top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

MapReduce for complex key/value pairs?

+1 vote

I was wondering if the following is possible using MapReduce.

I would like to create a job that loops over a bunch of documents, tokenizes them into ngrams, and stores the ngrams and not only the counts of ngrams but also _which_ document(s) had this particular ngram. In other words, the key would be the ngram but the value would be an integer (the count) _and_ an array of document ids.

Is this something that can be done? Any pointers would be helpful...

posted Apr 8, 2014 by Ahmed Patel

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

+1 vote

Yes, you can write custom writable classes that detail and serialize your required data structure. If you have "Hadoop: The Definitive Guide", checkout "Serialization" under chapter "Hadoop I/O".

answer Apr 8, 2014 by Majula Joshi
- Adding parsing logic in mappers/reducers is the simplest, least elegant way to do it, or just writing json  strings is one simple way to do it.
- You get more advanced by writing custom writable which parse the data are the first way to do it.  
- The truly portable and "right" way is to do it is to define a schema and use Avro to parse it.   Unlike manually adding parsing to app logic, or adding json deser to your mapper/reducers, proper Avro serialization has the benefit of increasing performance and app portability while also code more maintainable (it inter-operates with pure java domain objects)
Similar Questions
+1 vote

In xmls configuration file of Hadoop-2.x, "mapreduce.input.fileinputformat.split.minsize" is given which can be set but how to set "mapreduce.input.fileinputformat.split.maxsize" in xml file. I need to set it in my mapreduce code.

+1 vote

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting: 

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

+3 votes

In KNN like algorithm we need to load model Data into cache for predicting the records.

Here is the example for KNN.

So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.

The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.

How can we parttion the file and perform the operation on these partition ?

ie 1 record  parttition1,partition2,.... 2nd record  parttition1,partition2,... 

This is what came to my thought. Is there any further way. Any pointers would help me.

0 votes

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?