tlab
tlab

Reputation: 103

Pandas dataframe with multiple datetime

I have a pandas dataframe that has datetime in multiple columns and looks similar to below but with hundreds of columns, almost pushing 1k.

datetime, battery, datetime, temperature, datetime, pressure
2020-01-01 01:01:01, 13.8, 2020-01-01 01:01:02, 97, 2020-01-01 01:01:03, 10
2020-01-01 01:01:04, 13.8, 2020-01-01 01:01:05, 97, 2020-01-01 01:01:06, 11

What I have done is imported it and then converted every datetime column using pd.to_datetime. This reduces the memory usage by more than half (2.4GB to 1.0GB), but I'm wondering if this is still inefficient and maybe a better way.

  1. Would I benefit from converting this down to 3 columns where I have datetime, data name, data measurement? If so what is the best method of doing this? I've tried this but end up with a lot of empty spaces.

  2. Would there be another way to handle this data that I'm just not presenting?

  3. or what I'm doing makes sense and is efficient enough?

I eventually want to plot some of this data by selecting specific data names.

Upvotes: 0

Views: 101

Answers (1)

Roy2012
Roy2012

Reputation: 12503

I ran a small experiment with the above data and converting the data to date / type / value columns reduces the overall memory consumption:

print(df) 

             datetime  battery          datetime.1  temperature          datetime.2  pressure
0 2020-01-01 01:01:01     13.8 2020-01-01 01:01:02           97 2020-01-01 01:01:03        10
1 2020-01-01 01:01:04     13.8 2020-01-01 01:01:05           97 2020-01-01 01:01:06        11

print(df.memory_usage().sum())
==> 224

After converting the dataframe:

dfs = []
for i in range(0, 6, 2):
    d = df.iloc[:, i:i+2]
    d["type"] = d.columns[1]
    d.columns = ["datetime", "value", "type"]
    dfs.append(d)
new_df = pd.concat(dfs)

print(new_df) 
==>
             datetime  value         type
0 2020-01-01 01:01:01   13.8      battery
1 2020-01-01 01:01:04   13.8      battery
0 2020-01-01 01:01:02   97.0  temperature
1 2020-01-01 01:01:05   97.0  temperature
0 2020-01-01 01:01:03   10.0     pressure
1 2020-01-01 01:01:06   11.0     pressure

print(new_df.memory_usage().sum())
==> 192

Upvotes: 2

Related Questions