top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Hadoop: Filtering by value in Reducer

+2 votes
80 views

I am currently playing around with Hadoop and have some problems when trying to filter in the Reducer.

I extended the WordCount v1.0 example from the 2.7 MapReduce Tutorial with some additional functionality
and added the possibility to filter by the specific value of each key - e.g. only output the key-value pairs where [[ value > threshold ]].

Filtering Code in Reducer

for (IntWritable val : values) {
  sum += val.get();
}
if ( sum > threshold ) {
  result.set(sum);
  context.write(key, result);
}

For threshold smaller any value the above code works as expected and the output contains all key-value pairs. If I increase the threshold to 1 some pairs are missing in the output although the respective value would be larger than the threshold.

I tried to work out the error myself, but I could not get it to work as intended. I use the exact Tutorial setup with Oracle JDK 8 on a CentOS 7 machine.

As far as I understand the respective Iterable in the Reducer already contains all the observed values for a specific key. Why is it possible that I am missing some of these key-value pairs then? It only fails in very few cases. The input file is pretty large - 250 MB -

so I also tried to increase the memory for the mapping and reduction steps but it did not help ( tried a lot of different stuff without success )

Maybe someone already experienced similar problems / is more experienced than I am.

posted May 11, 2015 by Kumar Mitrasen

Looking for an answer?  Promote on:
Facebook Share Button Twitter Share Button LinkedIn Share Button
What is the type of the threshold variable? sum I believe is a Java int.

Similar Questions
+2 votes

Does anyone knows how to ‘capture’ the exception which actually failed the job running on Mapper or Reducer at runtime? It seems Hadoop is designed to be fault tolerant that the failed jobs will be automatically rerun for a certain amount of times and won’t actually expose the real problem unless you look into the error log?

In my use case, I would like to capture the exception and make different response based on the type of the exception.

+1 vote

I have a file containing one line for each edge in the graph with two vertex ids (source & sink).
sample:

1  2 (here 1 is source and 2 is sink node for the edge)
1  5
2  3
4  2
4  3

I want to assign a unique Id (Long value )to each edge i.e for each line of the file. How to ensure assignment of unique value in distributed mapper process?

Note : File size is large, so using only one reducer is not feasible.

...