top button
Flag Notify
    Connect to us
      Facebook Login
      Site Registration Why to Join

Facebook Login
Site Registration

Hadoop:Whats the best way to check the compression codec that an HDFS file was written with?

+2 votes

We use both Gzip and Snappy compression so I want a way to determine how a specific file is compressed. The closest I found is the GETCODEC but that relies on the file name suffix ... which dont exist since Reducers typically dont add a suffix to the filenames they create.

posted Dec 4, 2013 by Luv Kumar

Share this question
Facebook Share Button Twitter Share Button Google+ Share Button LinkedIn Share Button Multiple Social Share Button

1 Answer

+1 vote

If you're looking for file header/contents based inspection, you could download the file and run the Linux utility 'file' on the file, and it should tell you the format.

I don't know about Snappy, but Gzip files can be identified simply by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to read its first few hundred bytes, which should have the codec string in it. Programmatically you can use for sequence files.

answer Dec 5, 2013 by Majula Joshi
Similar Questions
+3 votes

I am trying to access a hadoop 1 installation via the hadoop 2.2.0 command line tools. I am wondering if this is possible at all?

From hadoop 1 I get:

$ hadoop fs -ls hdfs://
Found 2 items
drwxr-xr-x - cs supergroup 0 2014-02-01 08:18 /tmp
drwxr-xr-x - cs supergroup 0 2014-02-01 08:19 /user

From hadoop 2.2.0 I get:

$ hadoop fs -ls hdfs://
ls: Failed on local exception:; Host Details : 
local host is: "i7/"; destination host is: "localhost":9000;

I am trying to find this information via a web-search, but up to now no success.

0 votes

The reason behind this is I want to have my custom user who can create anything on the entire hdfs file system (/).
I tried couple of links however, none of them were useful. Is there any way by adding/modifying some property tags I can do that ?

0 votes

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

Contact Us
+91 9880187415
#280, 3rd floor, 5th Main
6th Sector, HSR Layout
Karnataka INDIA.