Hadoop: Filtering by value in Reducer

I am currently playing around with Hadoop and have some problems when trying to filter in the Reducer.

I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with some additional functionality
and added the possibility to filter by the specific value of each key - e.g. only output the key-value pairs where [[ value > threshold ]].

Filtering Code in Reducer

for (IntWritable val : values) {
  sum += val.get();
}
if ( sum > threshold ) {
  result.set(sum);
  context.write(key, result);
}

For threshold smaller any value the above code works as expected and the output contains all key-value pairs. If I increase the threshold to 1 some pairs are missing in the output although the respective value would be larger than the threshold.

I tried to work out the error myself, but I could not get it to work as intended. I use the exact Tutorial setup with Oracle JDK 8 on a CentOS 7 machine.

As far as I understand the respective IterableÂ in the Reducer already contains all the observed values for a specific key. Why is it possible that I am missing some of these key-value pairs then? It only fails in very few cases. The input file is pretty large - 250 MB -

so I also tried to increase the memory for the mapping and reduction steps but it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / is more experienced than I am.

Hadoop: Filtering by value in Reducer

Your comment on this post:

Your answer

Preview