JD D
JD D

Reputation: 8127

Improve performance of rewriting timestamps in parquet files

Due to some limitations of the consumer of my data, I need to "rewrite" some parquet files to convert timestamps that are in nanosecond precision to timestamps that are in millisecond precision.

I have implemented this and it works but I am not completely satisfied with it.

import pandas as pd

        df = pd.read_parquet(
            f's3://{bucket}/{key}', engine='pyarrow')

        for col_name in df.columns:
            if df[col_name].dtype == 'datetime64[ns]':
                df[col_name] = df[col_name].values.astype('datetime64[ms]')

        df.to_parquet(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}',
                      engine='pyarrow', index=False)

I'm currently running this job in lambda for each file but I can see this may be expensive and may not always work if the job takes longer than 15 minutes as that is the maximum time Lambda's can run.

The files can be on the larger side (>500 MB).

Any ideas or other methods I could consider? I am unable to use pyspark as my dataset has unsigned integers in it.

Upvotes: 1

Views: 384

Answers (2)

matthewpark319
matthewpark319

Reputation: 1267

Add use_deprecated_int96_timestamps=True to df.to_parquet() when you first write the file, and it will save as a nanosecond timestamp. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html

Upvotes: 1

w-m
w-m

Reputation: 11232

You could try rewriting all columns at once. Maybe this would reduce some memory copies in pandas, thus speeding up the process if you have many columns:

df_datetimes = df.select_dtypes(include="datetime64[ns]")
df[df_datetimes.columns] = df_datetimes.astype("datetime64[ms]")

Upvotes: 1

Related Questions