Hadoop: HTTPS communication between nodes, how can I confirm ?

716 views

We are trying to measure performance between HTTP and HTTPS version on Hadoop DFS, Mapreduce and other related modules.

As of now, we have tested using several metrics on Hadoop HTTP Mode. Similarly we are trying to test the same metrics on HTTPS Platform. Basically our test suite cluster consists of one Master Node and two Slave Nodes.

We have configured HTTPS connection and now we need to verify whether Nodes are communicating directly through HTTPS. Tried checking logs, clusters webhdfs ui, health check information, dfs admin report but of no help. Since there is only limited documentation available in HTTPS, we are unable to verify whether Nodes are communicating through HTTPS.

Hence any experts around here can shed some light on how to confirm HTTPS communication status between nodes (might be with mapreduce/DFS).

posted Feb 21, 2015 by anonymous

Share this question

Facebook Share Button

Twitter Share Button

LinkedIn Share Button

1 Answer

Be careful, HTTPS is to secure WebHDFS. If you want to protect all network streams you need more than that :

https://s3.amazonaws.com/dev.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.2/bk_reference/content/reference_chap-wire-encryption.html

If you're just interested in HTTPS an lsof -p | grep TCP will show you that DN listening on 50075 for HTTP, 50475 for HTTPS. For namenode that would be 50070 and 50470.

answer Feb 22, 2015 by Navneet

Similar Questions

+4 votes

Hadoop / HBase hotspotting / overloading specific nodes

I am having a problem with Hadoop maxing out drive space on a select few nodes when I am running an HBase job. The scenario is this:

The job is a data import using Map/Reduce / HBase
The data is being imported to one table
The table only has a couple of regions
As the job runs, HBase? / Hadoop? begins placing the data in HDFS on the datanode / regionserver that is hosting the regions
As the job progresses (and more data is imported) the two datanodes hosting the regions start to get full and eventually drive space hits 100% utilization whilst the other nodes in the cluster are at 40% or less drive space utilization
The job in Hadoop then begins to hang with multiple "out of space" errors and eventually fails.

I have tried running hadoop balancer during the job run and this helped but only really succeeded in prolonging the eventual job failure.

How can I get Hadoop / HBase to distribute the data to HDFS more evenly when it is favoring the nodes that the regions are on?

Am I missing something here?

+2 votes

How do I customize data placement on DataNodes (DN) of Hadoop cluster?

Let we change the default block size to 32 MB and replication factor to 1. Let Hadoop cluster consists of 4 DNs. Let input data size is 192 MB. Now I want to place data on DNs as following. DN1 and DN2 contain 2 blocks (32+32 = 64 MB) each and DN3 and DN4 contain 1 block (32 MB) each. Can it be possible? How to accomplish it?

0 votes

Can anyone provide me steps to grant a particular user access equivalnent to that of "hdfs" user in hadoop.

The reason behind this is I want to have my custom user who can create anything on the entire hdfs file system (/).
I tried couple of links however, none of them were useful. Is there any way by adding/modifying some property tags I can do that ?

+3 votes

Can we control data distribution and load balancing in Hadoop Cluster?

As I studied that data distribution, load balancing, fault tolerance are implicit in Hadoop. But I need to customize it, can we do that?

+1 vote

Can we run mapreduce job from eclipse IDE on fully distributed mode hadoop cluster?

A mapreduce job can be run as jar file from terminal or directly from eclipse IDE. When a job run as jar file from terminal it uses multiple jvm and all resources of cluster. Does the same thing happen when we run from IDE. I have run a job on both and it takes less time on IDE than jar file on terminal.

...