top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Advantage/disadvantage of dbm vs join vs HBase

+1 vote
577 views

I have a roughly 5 GB file where each row is a key, value pair. I would like to use this as a "hashmap" against another large set of file. From searching around, one way to do it would be to turn it into a dbm like DBD and put it into a distributed cache. Another is by joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me. However, I have read that using a distributed cache for files beyond a few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just pay this overhead once at the beginning of the job, and then each node gets a copy locally, right? If I were to go with join, would it not increase the workload (more entries) and create the same network congestion issue? And wouldn't going with HBase means making it a bottleneck?

What's the advantage and disadvantage of going for one solution over the others? What if, for example, that "hashmap" needs to be from, say, a 40GB file. How would my option change? At which point would
each option make sense?

posted Jun 7, 2015 by anonymous

Looking for an answer?  Promote on:
Facebook Share Button Twitter Share Button LinkedIn Share Button
Do you have hbase running in your cluster ?
I ask this because bringing HBase as a new component into your deployment incurs operational overhead which you may not be familiar with.
Nope. I have never used HBase before. I'm also new to Hadoop in general. I'll be running the MapReduce job on EMR.

Disregarding what I'm familiar with, I'd also like to know when one would do one thing vs another. Maybe it's something we can only tell from experimenting around, but it sounds like a problem others have ran into before.

Similar Questions
+1 vote

I am facing some difficulty using join to display the array elements. Here is the code snippet

[code]use strict;use warnings
my @fruits = qw/apple mango orange banana guava/;
#print '[', join '][', @fruits;#print ']';
print '[', join '][', @fruits, ']';best,
[/code]

[output]
      [apple][mango][orange][banana][guava][]
[/output]

How can I make the output to eliminate the last empty square brackets [] using a single print statement. I used two print statements as shown in the code snippet above (#lines are commented out). Any help is greatly appreciated.

0 votes

Having looked at "man join" wasn't sure of it's use here.

Unknown number of files, constant is extension .list
(For testing purposes only using two)

cat *.list >> output.joined | sort -u

How can I test if the output.joined,
is indeed the combined two lists with dupes removed.

...