Can a single pod make a node becomes unavailable? | C2C Community

Can a single pod make a node becomes unavailable?

  • 6 August 2021
  • 5 replies
  • 53 views

Userlevel 1

If I have 5 pods in 1 node, 4 of them are configured correctly, and the other one was misconfigured, and caused OOM killed on that POD.

Can one single misconfigured POD make a node unavailable?


5 replies

Userlevel 7
Badge +26

@alfonsmr would you have any insight here?

Are you able to elaborate the scenario?  Is this something you have seen?  Can you share more details?  Typically OOMKilled pods are a result of the kernel trying to safeguard stability of the node in a situation where the kublet itself has not managed to, you can read more on this subject here: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior

Userlevel 1

Thanks @alexmoore . That’s the scenario.

Is this something you have seen? That’s the thing, I don’t know, and that’s the reason I asked.

From the link that you sent, well, I removed the #node-out-of-memory-behaviour, https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/

The kubelet monitors resources like CPU, memory, disk space, and filesystem inodes on your cluster's nodes. When one or more of these resources reach specific consumption levels, the kubelet can proactively fail one or more pods on the node to reclaim resources and prevent starvation.

From the above, it doesn’t seem one bad pod will bring down the whole node.

But again, if anybody has any more insight, that will be great.

Typically a single bad pod shouldn’t bring down a node, kubelet and the Linux kernel itself should have protections in place to prevent that.  That doesn’t mean it is not possible though.  If you have any more information (errors/logs etc) you can share around what you are seeing that might help, or alternatively you might want to consider opening a support case as they will be able to help with deeper level diagnostics.

Userlevel 1

hi there,

 

if you have overcomitted the node memory intentionally or not - it can happen that the kernel will start killing random pods on the node to free up memory. this will not harm the node but the running application.

it is possible that a pod makes a node unavail if it can write to the node disc and use up all avail space there. we also encounter this if tmp dir of the node is mounted into the pod temp dir.

 

does this help as hint?

 

greets from berlin

helge rennicke

Reply