When I submit a map reduce job, would it only work on the files present at that point?

I have a system where files are coming in hdfs at regular intervals and I perform an operation everytime the directory size goes above a particular point.

My Question is that when I submit a map reduce job, would it only work on the files present at that point?

1 Answer

Normally MR job is used for batch processing. So I don't think this is a good use case here for MR. Since you need to run the program periodically, you cannot submit a single mapreduce job for this. An possible way is to create a cron job to scan the folder size and submit a MR job if necessary;

answer Aug 28, 2014 by Amit Parthsarthi

Or, maybe have a look at Apache Falcon:Falcon - Apache Falcon - Data management and processing platform

commented Aug 28, 2014 by Naveena Garg

Similar Questions

+1 vote

How a job works in YARN/Map Reduce? like navigation path...

How a job works in YARN/Map Reduce? like navigation path.

Please check my understanding is right?

When the application or job or client starts, client communicate with Name node the application manager started on node (data node), Application manager communicates with Resource manager (on name node) to get resource.The resource are assigned to container. The job runs on Container which is JVM.

+1 vote

What is Hadoop Map Reduce? How it works?

+2 votes

Query regarding the object creation in map reduce code

I need your help in writing the map reduce program in Java. I am creating a mapper and reducer classes for reading and processing a log file. I also have many other class files which acts as supporting classes to mapper and will be instantiated from mapper class within the map function.

PROBLEM STATEMENT :
Since there are 20 other objects which will be instantiated from mapper class within the map function, we think this could create a performance hit because of multiple object creation .

Please let us know what could be best approach/design to instantiate these 20 classes from Mapper class without compromising on the performance.

Your suggestions/comments are welcome.

+2 votes

Hadoop: Map Reduce in Cache

I have a set of input files which are going through changes. Is there any way by which we can run a Map reduce program which caches results.

Also, whenever there is any change to the input files the Map Reduce program automatically runs again and the resultset is altered according to changes to input files?

Can we use MR to approach this dynamically ?

0 votes

How to get info about which data in hdfs or file system that a MapReduce job visits?

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

When I submit a map reduce job, would it only work on the files present at that point?

Your comment on this post:

1 Answer

Your comment on this answer:

Your answer

Preview