Hadoop / HBase hotspotting / overloading specific nodes

+4 votes

I am having a problem with Hadoop maxing out drive space on a select few nodes when I am running an HBase job. The scenario is this:

  • The job is a data import using Map/Reduce / HBase
  • The data is being imported to one table
  • The table only has a couple of regions
  • As the job runs, HBase? / Hadoop? begins placing the data in HDFS on the datanode / regionserver that is hosting the regions
  • As the job progresses (and more data is imported) the two datanodes hosting the regions start to get full and eventually drive space hits 100% utilization whilst the other nodes in the cluster are at 40% or less drive space utilization
  • The job in Hadoop then begins to hang with multiple "out of space" errors and eventually fails.

I have tried running hadoop balancer during the job run and this helped but only really succeeded in prolonging the eventual job failure.

How can I get Hadoop / HBase to distribute the data to HDFS more evenly when it is favoring the nodes that the regions are on?

Am I missing something here?

posted Oct 9, 2014 by anonymous

can you set a reserved room for non-dfs usage? Just to avoid the disk gets full.
Reserved space in bytes per volume. Always leave this much space free for non dfs use.

