Execute hadoop job remotely and programmatically

My project required to execute a hadoop job remotely and the job requires some third-part libraries (jar files). I tried:
1. Copy these jar files to hdfs.
2. Copy them into the distributed cache using DistributedCache.addFileToClassPath so that hadoop can spread these jar files to each of the slave nodes.

However, my program still throws ClassNotFoundException. Indicating that some of the classes cannot be found when the job is running.

So I am lookinh:
1. What is the correct way to run a job remotely and programmatically while the job requires some third-party jar files.
2. I found DistributedCache is deprecated (Im using hadoop 1.2.0), what is the alternative class?

1 Answer

Please have a look at the -libjars option of the hadoop cmd. It tells the system what additional libs have to be sent to the cluster before the job can start. Each time you submit the job, this kind of distribution happens again. So its not a good idea for really large libs, those you should deploy on all nodes and than you have to configure the classpath of the JVMs running the tasks.

answer Dec 10, 2013 by Sonu Jindal

Similar Questions

+1 vote

How to execute command on remote windows machine using python

I have a requirement where I need to kill one process on remote windows machine. Following command just works fine if i have to kill process on local machine

os.system('taskkill /f /im processName.exe')

However I am not able to figure out how to execute this command on remote windows machine. So is there any way I can execute command from windows machine on remote windows machine ?
Note: my local machine is also windows (machine from where i have to execute command)

0 votes

How to write a Job for importing Files from an external Rest API into Hadoop

I want to ask, what's the best way implementing a Job which is importing files into the HDFS?

I have an external System offering data accessible through a Rest API. My goal is to have a job running in Hadoop which is periodical (maybe started by chron?) looking into the Rest API if new data is available.

It would be nice if also this job could run on multiple data nodes. But in difference to all the MapReduce examples I found, is my job looking for new Data or changed data from an external interface and compares the data with existing one.

This is a conceptual example of the job:

The job ask the Rest API if there are new files
if so, the job imports the first file in the list
look if the file already exits
if not, the job imports the file
if yes, the job compares the data with the data already stored
if changed the job updates the file
if more file exits the job continues with 2 -
otherwise ends.

Can anybody give me a little help how to start (its my first job I write...) ?

+1 vote

Whenever a client submits a hadoop job, who receives it?

+1 vote

What is Job Tracker role in Hadoop?

+1 vote

How to write a custom partitioner for a Hadoop MapReduce job?

Execute hadoop job remotely and programmatically

Your comment on this post:

1 Answer

Your comment on this answer:

Your answer

Preview