Reputation: 119
I have the following sqoop script which is supposed to get the data in parquet and use the snappy compression.
sqoop import \
--hive-drop-import-delims \
--fields-terminated-by '\001' \
--connect '<Connection URL>' \
--query 'select * from <db_name>.<table_name> where $CONDITIONS' \
--username <username> \
--password <password> \
--split-by '<split-by-key>' \
-m=4 \
--input-null-string '' \
--input-null-non-string '' \
--inline-lob-limit 0 \
--target-dir <hdfs/location/where/files/should/land> \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--as-parquetfile \
--map-column-java NOTES_DETAIL=String,NOTES=String \
Once the script is finished successfully, I go into the hdfs location ['hdfs/location/where/files/should/land'] and see that neither snappy compression is applied nor the _SUCCUSS file showing up. Why is this happening?
This is what I see when I list the files in that folder
21cbd1a6-d58b-4fdc-b332-7433e582ce0b.parquet
3956b0ff-58fd-4a87-b383-4fecc337a72a.parquet
3b42a1a9-4aa7-4668-bdd8-41624dec5ac6.parquet
As you can see no .snappy in file name nor _SUCCESS file.
Upvotes: 0
Views: 2678
Reputation: 329
Enable compression using below parameter:
-z,--compress
Reference : https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Upvotes: 0
Reputation: 8796
You won't see at the extension of a Parquet file which compression was used. In Parquet files, the data is internally compressed in chunks. With the codec selection, you specify which codec should be used for each chunk in the whole file. Still, the Parquet specification allows you to change the compression codec in each data chunk, thus you could mix the compression codecs inside of a Parquet file. Some tools produce .snappy.parquet
files to indicate the chosen compression level but that is only decorative as the compression information is stored in the file's metadata.
To check if your Parquet file has been snappy-compressed, inspect the files using parquet-tools
.
Upvotes: 2