top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

When I submit a map reduce job, would it only work on the files present at that point?

0 votes

I have a system where files are coming in hdfs at regular intervals and I perform an operation everytime the directory size goes above a particular point.

My Question is that when I submit a map reduce job, would it only work on the files present at that point?

posted Aug 27, 2014 by Vijay Shukla

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

+1 vote

Normally MR job is used for batch processing. So I don't think this is a good use case here for MR. Since you need to run the program periodically, you cannot submit a single mapreduce job for this. An possible way is to create a cron job to scan the folder size and submit a MR job if necessary;

answer Aug 28, 2014 by Amit Parthsarthi
Or, maybe have a look at Apache Falcon:Falcon - Apache Falcon - Data management and processing platform
Similar Questions
+1 vote

How a job works in YARN/Map Reduce? like navigation path.

Please check my understanding is right?

When the application or job or client starts, client communicate with Name node the application manager started on node (data node), Application manager communicates with Resource manager (on name node) to get resource.The resource are assigned to container. The job runs on Container which is JVM.

+2 votes

I need your help in writing the map reduce program in Java. I am creating a mapper and reducer classes for reading and processing a log file. I also have many other class files which acts as supporting classes to mapper and will be instantiated from mapper class within the map function.

Since there are 20 other objects which will be instantiated from mapper class within the map function, we think this could create a performance hit because of multiple object creation .

Please let us know what could be best approach/design to instantiate these 20 classes from Mapper class without compromising on the performance.

Your suggestions/comments are welcome.

+2 votes

I have a set of input files which are going through changes. Is there any way by which we can run a Map reduce program which caches results.

Also, whenever there is any change to the input files the Map Reduce program automatically runs again and the resultset is altered according to changes to input files?

Can we use MR to approach this dynamically ?

0 votes

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?