How to read multiple partitioned .gzip files into a Spark Dataframe?

Question

I have the following folder of partitioned data-

my_folder
 |--part-0000.gzip
 |--part-0001.gzip
 |--part-0002.gzip
 |--part-0003.gzip

I try to read this data into a dataframe using-

>>> my_df = spark.read.csv("/path/to/my_folder/*")
>>> my_df.show(5)
+--------------------+
|                 _c0|
+--------------------+
|��[I���...|
|��RUu�[*Ք��g��T...|
|�t���  �qd��8~��...|
|�(���b4�:������I�...|
|���!y�)�PC��ќ\�...|
+--------------------+
only showing top 5 rows

Also tried using this to check the data-

>>> rdd = sc.textFile("/path/to/my_folder/*")
>>> rdd.take(4)
['\x1f�\x08\x00\x00\x00\x00\x00\x00\x00�͎\ǖ�7�~�\x04�\x16��\'��"b�\x04�AR_



NOTE: When I do a zcat part-0000.gzip | head -1 to read the file content, it shows the data is tab separated and in plain readable English.

How do I read these files properly into a dataframe?

kev · Accepted Answer

For some reason, Spark does not recognize the .gzip file extension. So I had to change the file extensions before reading the partitioned data-

import os

# go to my_folder
os.chdir("/path/to/my_folder")

# renaming all `.gzip` extensions to `.gz` within my_folder
cmd = 'rename "s/gzip/gz/" *.gzip'
result_code = os.system(cmd)

if result_code == 0:
    print("Successfully renamed the file extensions!")

    # finally reading the data into a dataframe
    my_df = spark.read.csv("/path/to/my_folder/*", sep="	")
else:
    print("Could not rename the file extensions!")

How to read multiple partitioned .gzip files into a Spark Dataframe?

Answers (1)

Related Questions