How to customize hadoop configuration for a job?

1 Answer

You can implement your driver code using ToolRunner. So that you can pass your extra configuration through command line instead of editing your code all the time.

Driver code

public class WordCount extends Configured implements Tool {
    public static void main(String[] args) throws Exception 
    {
        int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
        System.exit(exitCode);
    }
    public int run(String[] args) throws Exception 
    {
        if (args.length != 2) {
            System.out.printf("Usage: %s [generic options]  n", getClass().getSimpleName());
            return -1;
        }
        Job job = new Job(getConf());
        job.setJarByClass(WordCount.class);
        job.setJobName("Word Count");
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(WordMapper.class);
        job.setReducerClass(SumReducer.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        boolean success = job.waitForCompletion(true);
        return success ? 0 : 1;
    }
}

command line
$ hadoop jar myjar.jar MyDriver -D mapred.reduce.tasks=10 myinputdir myoutputdir

answer Apr 2, 2014 by Sanketi Garg

That is exactly what I did. But the command line parameters are ignored. The hadoop I am using is 2.0.0-cdh4.1.2. The property I tried is io.sort.mb.

commented Apr 2, 2014 by anonymous

could you try to remove the space between -D and the "property=value", i.e. use -Dproperty=value instead of "-D property=value". That should work :)

commented Apr 2, 2014 by anonymous

Similar Questions

+1 vote

What configuration parameters cause a Hadoop 2.x job to run on the cluster?

Assume I have a machine on the same network as a hadoop 2 cluster but separate from it.

My understanding is that by setting certain elements of the config file or local xml files to point to the cluster I can launch a job without having to log into the cluster, move my jar to hdfs and start the job from the clusters hadoop machine.

Does this work? What Parameters need I sat? Where is the jar file? What issues would I see if the machine is running Windows with cygwin installed?

0 votes

How to write a Job for importing Files from an external Rest API into Hadoop

I want to ask, what's the best way implementing a Job which is importing files into the HDFS?

I have an external System offering data accessible through a Rest API. My goal is to have a job running in Hadoop which is periodical (maybe started by chron?) looking into the Rest API if new data is available.

It would be nice if also this job could run on multiple data nodes. But in difference to all the MapReduce examples I found, is my job looking for new Data or changed data from an external interface and compares the data with existing one.

This is a conceptual example of the job:

The job ask the Rest API if there are new files
if so, the job imports the first file in the list
look if the file already exits
if not, the job imports the file
if yes, the job compares the data with the data already stored
if changed the job updates the file
if more file exits the job continues with 2 -
otherwise ends.

Can anybody give me a little help how to start (its my first job I write...) ?

+1 vote

How to write a custom partitioner for a Hadoop MapReduce job?

+2 votes

How do I customize data placement on DataNodes (DN) of Hadoop cluster?

Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?

+1 vote

How to stop a mapreduce job from terminal running on Hadoop Cluster?

To run a job we use the command
$ hadoop jar example.jar inputpath outputpath
If job is so time taken and we want to stop it in middle then which command is used? Or is there any other way to do that?

How to customize hadoop configuration for a job?

Your comment on this post:

1 Answer

Your comment on this answer:

Your answer

Preview