top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

How to keep data consistency between Machines using Hadoop?

+2 votes
288 views

My Configuration:
Hadoop 2.2.0, two computer, one is master,another is node1.

I want to know following scene: If node1 is down by some reason, but I don't know node1 can't work, then I use hadoop command to put a file, such as:
$ hadoop fs -put graph.txtgraphin/graph.txt

I know graph.txt file will be put master machine, but node1 computer don't contains this file. After some time, node1 machine can repair, there will be inconsistent because of graph.txt file, how to realize consistency between master machine and node1 machine?

posted Feb 19, 2014 by Abhay Kulkarni

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

2 Answers

+1 vote

You can't achieve data consistency on your cluster configuration. To do this you need at least 3 data nodes and enabled replication with level 3 ( dfs.replication property in hdfs-site.xml).

answer Feb 19, 2014 by anonymous
0 votes

It does not seem like your hadoop -put ... command will even complete - the master is not receiving the file at any point. It instructs the node1 to connect to the client, after asking the node1 if it is in a state that it can receive data to be written, which depends on several other daemons being available and several successful internal RPC calls. A file does not get written to the master for storage - the only thing the master has is a transaction log telling it that client A sent a request to -put file graph.txt file in the HDFS storage location hdfs://user/$submitterusername/graphin/ and whether that request succeeded. There is another set of processes that tell the node1 to tell master periodically what files or pieces of files it has, and that record is stored in another file, but nothing goes to master for storing.

So if node1 is down, you will know when you try and -put something, since its your only data node - if it was not able to receive something for storage, master would know there is no node alive to put anything in, and tell you "I cant put that in HDFS because there are no storage nodes that are alive" but in Java exceptions instead of plain language.

answer Feb 19, 2014 by Seema Siddique
Similar Questions
+2 votes

Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?

+3 votes

From the documentation + code, "when kerberos is enabled, all tasks are run as the end user (e..g as user "joe" and not as hadoop user "mapred") using the task-controller (which is setuid root and when it runs, it does a setuid/setgid etc. to Joe and his groups ). For this to work, user "joe" linux account has to be present on all nodes of the cluster."

In a environment with large and dynamic user population; it is not practical to add every end user to every node of the cluster (and drop user when end user is deactivated etc.)

What are other options get this working ? I am assuming that if the users are in a LDAP, can using the PAM for LDAP solve the issue. Any other suggestions?

+2 votes

I am running hadoop-2.4.0 cluster. Each datanode has 10 disks, directories for 10 disks are specified in dfs.datanode.data.dir.

A few days ago, I modified dfs.datanode.data.dir of a datanode () to reduce disks. so two disks were excluded from dfs.datanode.data.dir, after the datanode was restarted, I expected that the namenode would update block locations. In other words, I thought the namenode should remove from block locations associated with blocks which were stored on excluded disks, but the namenode didnt update the block locations...

In my understanding, datanode send a block report to the namenode when datanode start so the namenode should update block locations immediately.

Is a bug? Could anyone please explain?

+3 votes

As I studied that data distribution, load balancing, fault tolerance are implicit in Hadoop. But I need to customize it, can we do that?

...