Reputation: 103
I have a pandas dataframe that has datetime in multiple columns and looks similar to below but with hundreds of columns, almost pushing 1k.
datetime, battery, datetime, temperature, datetime, pressure
2020-01-01 01:01:01, 13.8, 2020-01-01 01:01:02, 97, 2020-01-01 01:01:03, 10
2020-01-01 01:01:04, 13.8, 2020-01-01 01:01:05, 97, 2020-01-01 01:01:06, 11
What I have done is imported it and then converted every datetime column using pd.to_datetime. This reduces the memory usage by more than half (2.4GB to 1.0GB), but I'm wondering if this is still inefficient and maybe a better way.
Would I benefit from converting this down to 3 columns where I have datetime, data name, data measurement? If so what is the best method of doing this? I've tried this but end up with a lot of empty spaces.
Would there be another way to handle this data that I'm just not presenting?
or what I'm doing makes sense and is efficient enough?
I eventually want to plot some of this data by selecting specific data names.
Upvotes: 0
Views: 101
Reputation: 12503
I ran a small experiment with the above data and converting the data to date / type / value columns reduces the overall memory consumption:
print(df)
datetime battery datetime.1 temperature datetime.2 pressure
0 2020-01-01 01:01:01 13.8 2020-01-01 01:01:02 97 2020-01-01 01:01:03 10
1 2020-01-01 01:01:04 13.8 2020-01-01 01:01:05 97 2020-01-01 01:01:06 11
print(df.memory_usage().sum())
==> 224
After converting the dataframe:
dfs = []
for i in range(0, 6, 2):
d = df.iloc[:, i:i+2]
d["type"] = d.columns[1]
d.columns = ["datetime", "value", "type"]
dfs.append(d)
new_df = pd.concat(dfs)
print(new_df)
==>
datetime value type
0 2020-01-01 01:01:01 13.8 battery
1 2020-01-01 01:01:04 13.8 battery
0 2020-01-01 01:01:02 97.0 temperature
1 2020-01-01 01:01:05 97.0 temperature
0 2020-01-01 01:01:03 10.0 pressure
1 2020-01-01 01:01:06 11.0 pressure
print(new_df.memory_usage().sum())
==> 192
Upvotes: 2