top button
Flag Notify
    Connect to us
      Facebook Login
      Site Registration

Facebook Login
Site Registration

Hadoop InputFormat - Processing large number of small files

+2 votes

I have a use case wherein I need to process huge set of files stored in HDFS. Those files are non-splittable and they need to be processed as a whole. Here, I have the following question for which I need answers to proceed further in this.

  1. I wish to schedule the map process in task tracker where data is already available. How can I do it? Currently, I have a file that contains list of filenames. Each map get one line of it via NLineInputFormat. The map process then accesses the file via FSDataInputStream and work with it. Is there a way to ensure this map process is running on the node where the file is available?

  2. Since the files are not large and it can be called as small files by hadoop standard. Now, I came across CombineFileInputFormat that can process more than one file in a single map process. What I need here is a format that can process more than one files in a single map but does not have to read the files, and either in key or value, it has the filenames. In map process then, I can run a loop to process these files. Any help?

  3. Any other alternatives?

posted Aug 20, 2014 by Tarun Singhal

Looking for an answer?  Promote on:
Facebook Share Button Twitter Share Button LinkedIn Share Button

Similar Questions
0 votes



+1 vote

How can I store images in hadoop/hive and perform some processing on it? Is there any inbuilt library available to do so? How hadoop stores images in HDFS?

+1 vote

I would like to understand how Hadoop is used for more real-time scenarios. Are machine learning, language processing and fraud detection examples available ? What are the other practical usecases ?

+2 votes

Is it possible to consolidate two small data volumes (500GB each) into a larger data volume (3TB)?

I'm thinking that as long as the block file names and metadata are unique, then I should be able to shut down the datanode and use something like tar or rsync to copy the contents of each small volume to the large volume.

Will this work?

0 votes

I want to ask, what's the best way implementing a Job which is importing files into the HDFS?

I have an external System offering data accessible through a Rest API. My goal is to have a job running in Hadoop which is periodical (maybe started by chron?) looking into the Rest API if new data is available.

It would be nice if also this job could run on multiple data nodes. But in difference to all the MapReduce examples I found, is my job looking for new Data or changed data from an external interface and compares the data with existing one.

This is a conceptual example of the job:

  • The job ask the Rest API if there are new files
  • if so, the job imports the first file in the list
  • look if the file already exits

  • if not, the job imports the file

  • if yes, the job compares the data with the data already stored

  • if changed the job updates the file

  • if more file exits the job continues with 2 -

  • otherwise ends.

Can anybody give me a little help how to start (its my first job I write...) ?

Contact Us
+91 9880187415
#280, 3rd floor, 5th Main
6th Sector, HSR Layout
Karnataka INDIA.