Hadoop:Whats the best way to check the compression codec that an HDFS file was written with?

+2 votes

We use both Gzip and Snappy compression so I want a way to determine how a specific file is compressed. The closest I found is the GETCODEC but that relies on the file name suffix ... which dont exist since Reducers typically dont add a suffix to the filenames they create.

posted Dec 4, 2013 by Luv Kumar

1 Answer

+1 vote

If you're looking for file header/contents based inspection, you could download the file and run the Linux utility 'file' on the file, and it should tell you the format.

I don't know about Snappy, but Gzip files can be identified simply by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to read its first few hundred bytes, which should have the codec string in it. Programmatically you can use for sequence files.

answer Dec 5, 2013 by Majula Joshi
