Reputation: 155
I have a Pandas dataframe that looks similar to this:
datetime data1 data2
2021-01-23 00:00:31.140 a1 a2
2021-01-23 00:00:31.140 b1 b2
2021-01-23 00:00:31.140 c1 c2
2021-01-23 00:01:29.021 d1 d2
2021-01-23 00:02:10.540 e1 e2
2021-01-23 00:02:10.540 f1 f2
The real dataframe is very large and for each unique timestamp, there are a few thousand rows.
I want to save this dataframe to a Parquet file so that I can quickly read all the rows that have a specific datetime index, without loading the whole file or looping through it. How do I save it correctly in Python and how do I quickly read only the rows for one specific datetime?
After reading, I would like to have a new dataframe that contains all the rows for that specific datetime. For example, I want to read only the rows for datetime "2021-01-23 00:00:31.140" from the Parquet file and receive this dataframe:
datetime data1 data2
2021-01-23 00:00:31.140 a1 a2
2021-01-23 00:00:31.140 b1 b2
2021-01-23 00:00:31.140 c1 c2
I am wondering it it may be first necessary to convert the data for each timestamp into a column, like this, so it can be accessed by reading a column instead of rows?
2021-01-23 00:00:31.140 2021-01-23 00:01:29.021 2021-01-23 00:02:10.540
['a1', 'a2'] ['d1', 'd2'] ['e1', 'e2']
['b1', 'b2'] NaN ['f1', 'f2']
['c1', 'c2'] NaN NaN
I appreciate any help, thank you very much in advance!
Upvotes: 6
Views: 2597
Reputation: 16561
One solution is to index your data by time and use dask
, here's an example:
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries(
start='2000-01-01',
end='2000-01-2',
freq='1s',
partition_freq='1h')
df
print(len(df))
# 86400 rows across 24 files/partitions
%%time
df.loc['2000-01-01 03:40'].compute()
# result returned in about 8 ms
Working with a transposed dataframe like you suggest is not optimal, since you will end up with thousands of columns (if not more) that are unique to each file/partition.
So on your data the workflow would look roughly like this:
import io
data = io.StringIO("""
datetime|data1|data2
2021-01-23 00:00:31.140|a1|a2
2021-01-23 00:00:31.140|b1|b2
2021-01-23 00:00:31.140|c1|c2
2021-01-23 00:01:29.021|d1|d2
2021-01-23 00:02:10.540|e1|e2
2021-01-23 00:02:10.540|f1|f2""")
import pandas as pd
df = pd.read_csv(data, sep='|', parse_dates=['datetime'])
# make sure the date time column was parsed correctly before
# setting it as an index
df = df.set_index('datetime')
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
ddf.to_parquet('test_parquet')
# note this will create a folder with one file per partition
ddf2 = dd.read_parquet('test_parquet')
ddf2.loc['2021-01-23 00:00:31'].compute()
# if you want to use very precise time, first convert it to datetime format
ts_exact = pd.to_datetime('2021-01-23 00:00:31.140')
ddf2.loc[ts_exact].compute()
Upvotes: 3