Advantage/disadvantage of dbm vs join vs HBase

I have a roughly 5 GB file where each row is a key, value pair. I would like to use this as a "hashmap" against another large set of file. From searching around, one way to do it would be to turn it into a dbm like DBD and put it into a distributed cache. Another is by joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me. However, I have read that using a distributed cache for files beyond a few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just pay this overhead once at the beginning of the job, and then each node gets a copy locally, right? If I were to go with join, would it not increase the workload (more entries) and create the same network congestion issue? And wouldn't going with HBase means making it a bottleneck?

What's the advantage and disadvantage of going for one solution over the others? What if, for example, that "hashmap" needs to be from, say, a 40GB file. How would my option change? At which point would
each option make sense?

[code]use strict;use warnings my @fruits = qw/apple mango orange banana guava/; #print '[', join '][', @fruits;#print ']'; print '[', join '][', @fruits, ']';best, [/code] [output] [apple][mango][orange][banana][guava][] [/output]

Advantage/disadvantage of dbm vs join vs HBase

Your comment on this post:

Your answer

Preview