Hadoop YARN 2.2.0 Streaming Memory Limitation?

+1 vote

We are currently facing a frustrating hadoop streaming memory problem. our setup:

  • our compute nodes have about 7 GB OF RAM
  • hadoop streaming starts a bash script wich uses about 4 GB OF RAM
  • therefore it is only possible to start one and only ONE TASK PER NODE

out of the box each hadoop instance starts about 7 hadoop containers with default hadoop settings. each hadoop task forks a bash script that need about 4 GB of RAM, the first fork works, all following fail because THEY RUN OUT OF MEMORY. so what we are looking for is to LIMIT the number of containers TO ONLY ONE. so what we found on the internet:

  • yarn.scheduler.maximum-allocation-mb and is set to values such that there is at most one container. this means, must be MORE THAN HALF of the maximum memory (otherwise there will be multiple containers).

done right, this gives us one container per node. but it produces a new problem: since our java process is now using at least half of the max memory, our child (bash) process we fork will INHERIT THE PARENT MEMORY FOOTPRINT and since the memory used by our parent was more than half of total memory, WE RUN OUT OF MEMORY AGAIN. if we lower the map memory, hadoop will allocate 2 containers per node, which will run out of memory too.

since this problem is a blocker in our current project we are evaluating adapting the source code to solve this issue. as a last resort. any ideas on this are very much welcome.

posted Feb 24, 2014 by Jagan Mishra

Can you try setting yarn.nodemanager.resource.memory-mb(Amount of physical memory, in MB, that can be allocated for containers), say 1024, and also set to 1024?
thanks for the input. unfortunately it doesn’t solve our problem, if we set the properties:
yarn.nodemanager.resource.memory-mb = 1024 = 1024
there are no containers spawned and no jobs started.

if I set:
yarn.nodemanager.resource.memory-mb = 2048 = 2048

there is one container and one mapper, but the bash process can’t be started by hadoop streaming.
logs say:
ContainersMonitorImpl: Memory usage of ProcessTree 7655 for container-id container_1393326502216_0001_01_000001: 164.7 MB of 2 GB physical memory used; 1.5 GB of 4.2 GB virtual memory used
but there is no sign, why our bash script isn’t started.

1 Answer

+1 vote

Please try with = 5124

answer Feb 24, 2014 by Garima Jain
Thanks a lot for your input. we got it to run correctly, although not exactly the solution you proposed, but it’s close:

the main error we made is that on a yarn controller node the memory footprint must be set differently than on a hadoop worker node. following rule of thumb seems to apply in our setup:
master: = 1/3 of yarn.nodemanager.resource.memory-mb = 1/2 of yarn.nodemanager.resource.memory-mb

for both cases we“Xmx 1024” or about 1/4 of total memory.

The reason for this behaviour is that the yarn controller spawns 2 subprocesses, while all worker spawn only 1 subprocess:- on master: java MRAppMaster and YarnChild (which spawns the mapper)- on workers: YarnChild (which spawns the mapper)

Now everything works smoothly.
