Estimating the time of hadoop job?

887 views

Currently I'm developing an application which would ingest logs of order of 70-80 GB of data/day and would then do Some analysis on them

Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

posted Dec 17, 2013 by Sonu Jindal

Share this question

Facebook Share Button

Twitter Share Button

LinkedIn Share Button

2 Answers

It depends on:how many cores on each VNodehow complicated of your analysis application. But I don't think its normal spent 3hr to process 30GB data even on your *not good* hardware.

answer Dec 17, 2013 by Ahmed Patel

One of the problems you run into with Hadoop in Virtual Machine environments is performance issues when they are all running on the same physical host. With a VM, even though you are giving them 4 GB of RAM, and a virtual CPU and disk, if the virtual machines are sharing physical components like processor and physical storage medium, they compete for resources at the physical level. Even if you have the VM on a single host, or on a multi-core host with multiple disks and they are sharing as few resources as possible, there will still be a performance hit when the VM information has to pass through the hypervisor layer - co-scheduling resources with the host and things like that.

answer Dec 17, 2013 by anonymous

Similar Questions

+1 vote

Is there any benchmarks or 'performance heuristics' for Hadoop?

Is there any benchmarks or 'performance heuristics' for Hadoop? Is it possible to say something like 'You can process X lines of GZipped log file on a medium AWS server in Y minutes"? I would like to get an
idea of what kind of workflow is possible.

+2 votes

How hadoop 2.x perform better than hadoop 1.x

The setup consist of hadoop 1.0.1 and hbase 0.94.x. Loading raw data into hdfs and then into hbase consumes good amount of time for 10tb of raw data (using hadoop shell - copyFromLocal and pig script to load hbase).

Moving to hadoop 2.x will benefit performing better is my question. If yes please provide relevent links or docs which expains how it is achieved.
I do not need sorting my data while loading into hbase so what are the ways i can disable sort ta Mapper and at Reducer is my 2nd question.

Any Suggestions??

+1 vote

hadoop: Is there any way to limit the concurrent running mappers per job?

After upgraded to Hadoop 2 (yarn), I found that mapred.jobtracker.taskScheduler.maxRunningTasksPerJob no longer worked, right?

One workaround is to use queue to limit it, but its not easy to control it from job submitter.

Is there any way to limit the concurrent running mappers per job? Any documents or pointer?

+1 vote

What configuration parameters cause a Hadoop 2.x job to run on the cluster?

Assume I have a machine on the same network as a hadoop 2 cluster but separate from it.

My understanding is that by setting certain elements of the config file or local xml files to point to the cluster I can launch a job without having to log into the cluster, move my jar to hdfs and start the job from the clusters hadoop machine.

Does this work? What Parameters need I sat? Where is the jar file? What issues would I see if the machine is running Windows with cygwin installed?

+2 votes

Hadoop: How to obtain the exception actually failed the job on Mapper or Reducer at runtime?

Does anyone knows how to ‘capture’ the exception which actually failed the job running on Mapper or Reducer at runtime? It seems Hadoop is designed to be fault tolerant that the failed jobs will be automatically rerun for a certain amount of times and won’t actually expose the real problem unless you look into the error log?

In my use case, I would like to capture the exception and make different response based on the type of the exception.

...