MapReduce for complex key/value pairs?

I was wondering if the following is possible using MapReduce.

I would like to create a job that loops over a bunch of documents, tokenizes them into ngrams, and stores the ngrams and not only the counts of ngrams but also _which_ document(s) had this particular ngram. In other words, the key would be the ngram but the value would be an integer (the count) _and_ an array of document ids.

Is this something that can be done? Any pointers would be helpful...

1 Answer

Yes, you can write custom writable classes that detail and serialize your required data structure. If you have "Hadoop: The Definitive Guide", checkout "Serialization" under chapter "Hadoop I/O".

answer Apr 8, 2014 by Majula Joshi

- Adding parsing logic in mappers/reducers is the simplest, least elegant way to do it, or just writing json strings is one simple way to do it.
- You get more advanced by writing custom writable which parse the data are the first way to do it.
- The truly portable and "right" way is to do it is to define a schema and use Avro to parse it. Unlike manually adding parsing to app logic, or adding json deser to your mapper/reducers, proper Avro serialization has the benefit of increasing performance and app portability while also code more maintainable (it inter-operates with pure java domain objects)

commented Apr 8, 2014 by anonymous

Similar Questions

+1 vote

How to write a custom partitioner for a Hadoop MapReduce job?

+1 vote

How to set mapreduce.input.fileinputformat.split.maxsize for a specific job ?

In xmls configuration file of Hadoop-2.x, "mapreduce.input.fileinputformat.split.minsize" is given which can be set but how to set "mapreduce.input.fileinputformat.split.maxsize" in xml file. I need to set it in my mapreduce code.

+1 vote

Time out after 600 for YARN mapreduce application

I keep encountering an error when running nutch on hadoop YARN:

AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs

Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanager has the same specs only with 8 cores.

I am pretty sure one of these parameters is to the threshold I'm hitting:

yarn.am.liveness-monitor.expiry-interval-ms 
yarn.nm.liveness-monitor.expiry-interval-ms 
yarn.resourcemanager.nm.liveness-monitor.interval-ms

but I would like to understand why.

The issue usually appears under heavier load, and most of the time the on the next attempts it is successful. Also if I restart the Hadoop cluster the error goes away for some time.

+3 votes

How to partition a file to smaller size for performing KNN in hadoop mapreduce

In KNN like algorithm we need to load model Data into cache for predicting the records.

Here is the example for KNN.

So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.

The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.

How can we parttion the file and perform the operation on these partition ?

ie 1 record  parttition1,partition2,.... 2nd record  parttition1,partition2,...

This is what came to my thought. Is there any further way. Any pointers would help me.

0 votes

How to get info about which data in hdfs or file system that a MapReduce job visits?

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

MapReduce for complex key/value pairs?

Your comment on this post:

1 Answer

Your comment on this answer:

Your answer

Preview