Reputation: 1
Hello stackoverflow community,
I am having some issues reading parquet files. The problems start after I upload the Parquet file to Azure Data Lake gen 2 using Python.
I am using the official Micorsoft documentation: https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python
Besides the authentification, this part:
def upload_file_to_directory():
try:
file_system_client = service_client.get_file_system_client(file_system="my-file-system")
directory_client = file_system_client.get_directory_client("my-directory")
file_client = directory_client.create_file("uploaded-file.txt")
local_file = open("C:\\file-to-upload.txt",'r')
file_contents = local_file.read()
file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
file_client.flush_data(len(file_contents))
except Exception as e:
print(e)
When I use the code to upload a small csv file, it works totally fine. The csv file is uploaded and when I download the file I can open it without any problems.
If I convert the same data frame to a small parquet file and upload the file, the upload works fine. But when I download the file and try to open it, I get the error message:
ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
If I read the Parquet fiel directly without uploading, it works fine.
Does anyone have a suggestion how I need to modify the code so I don't destroy my parquet file?
Thanks!
Upvotes: 0
Views: 2459
Reputation: 108
I just resolved this error in my project today.
I am using pyarrow.parquet.write_table
to write my Parquet file.
I was passing a native Python file object to the where
parameter, which somehow caused the footer to never get written.
When I switched to using PyArrow output streams instead of native Python file objects, the footer got written correctly on stream close, which resolved this issue for me.
Upvotes: 0
Reputation: 8660
I'm not sure what's wrong with your code(It seems your code is not complete), you can have a try this code, it works on my side:
try:
file_system_client = service_client.get_file_system_client(file_system="my-file-system")
directory_client = file_system_client.get_directory_client("my-directory")
file_client = directory_client.create_file("data.parquet")
df = pd.DataFrame({'one': [-1, np.nan, 2.5],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]},
index=list('abc')).to_parquet()
file_client.append_data(data=df, offset=0, length=len(df))
file_client.flush_data(len(df))
except Exception as e:
print(e)
Upvotes: 1