Reputation: 65
Trying to resample a dataframe in pandas. I receive in input a .csv like this (the list in Data column are in form of strings): `
Name,Timestamp,Data
A1,5.26,"[1.0,1.2,1.9]"
A1,5.28,"[1.8,2.1,3.9]"
A1,5.30,"[1.2,1.4,0.9]"
A1,5.32,"[...]"
...
A2,5.26,"[...]"
A2,5.28,"[...]"
A2,5.30,"[...]"
A2,5.32,"[...]"
...
A3,5.26,"[...]"
A3,5.28,"[...]"
A3,5.30,"[...]"
A3,5.32,"[...]"`
Datas are recorded at 50hz (so every 20ms). I want to resample 25hz (so every 40ms).
I converted the Data column from string to an actual list with
df['Data'] = df['Data'].apply(ast.literal_eval)
and the Timestamp into seconds with:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
I know that I've to use the .resample()
function so I tried with
df.groupby('Name').resample("40L", on='Timestamp')
and it doesn't give me errors but it seems it doesn't resample at all in fact I've the same number of rows with same datas and just the Timestamp column converted into Datetime (and if I add a .mean()
after the end of resample function it gives me the error No numeric types to aggregate
).
I want the my table after the resample looks like:
Name Timestamp Data
A1 5.26 [...]
A1 5.30 [...]
...
A2 5.26 [...]
A2 5.30 [...]
...
A3 5.26 [...]
A3 5.30 [...]
What should I do?
Upvotes: 2
Views: 689
Reputation: 16758
If you do not want to interpolate the samples and only keep every other one it is enough for you to drop every other row, keeping the first:
df.groupby('Name').apply(lambda x: x.iloc[::2])
Any interpolation of the samples would require some information on the algorithm of aggregation that you would like to employ as Quang Hoang suggested in the comment.
Please note that to obtain alignment on the Timestamp
you need to set base=20
in this particular situation (please refer to the documentation for the rationale).
for name, df2 in df.groupby('Name'):
df3 = pd.DataFrame(df2.resample("40L", on='Timestamp2', base=20, convention='start').apply(lambda r: r.iloc[0]))
df3 = df3.set_index('Timestamp')
print(name, df3.drop(columns=['Timestamp2']))
Resulting in:
Name Data
Timestamp
5.26 A1 [1.0, 1.2, 1.9]
5.30 A1 [1.2, 1.4, 0.9]
Name Data
Timestamp
5.26 A2 [1.0, 1.2, 1.9]
5.30 A2 [1.2, 1.4, 0.9]
A different approach that does not use resample
but yields the same result (since you are not using the samples in-between the desired sample points):
for name, df2 in df.groupby('Name'):
df2['Timestamp2'] = pd.to_datetime(df2['Timestamp'], unit='s')
first = df2['Timestamp2'].iloc[0]
selector = pd.date_range(start=first, freq='40L', periods=len(df2['Timestamp2']) / 2 + 1)
df3 = df2[df2['Timestamp2'].isin(selector)]
print(df3.drop(columns=['Timestamp2']))
Result:
Name Timestamp Data
0 A1 5.26 [1.0, 1.2, 1.9]
2 A1 5.30 [1.2, 1.4, 0.9]
Name Timestamp Data
4 A2 5.26 [1.0, 1.2, 1.9]
6 A2 5.30 [1.2, 1.4, 0.9]
Please note that I copied the data for 'A2'
Name
label.
Upvotes: 1
Reputation: 150815
Your problem is to convert the data part into actual numeric data. ast.literal_eval
doesn't cut it because you cannot perform arithmetic operations on list
. Here's what I would do:
df = pd.read_csv('your.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
df = df.join(df['Data'].str[1:-1]
.str.split(',', expand=True)
.astype(float)
)
# resample
df.groupby('Name').resample('40L', on='Timestamp').mean()
After that, your df
would be something like:
0 1 2
Name Timestamp
A1 1970-01-01 00:00:05.240 1.0 1.20 1.9
1970-01-01 00:00:05.280 1.5 1.75 2.4
1970-01-01 00:00:05.320 1.4 1.65 2.9
1970-01-01 00:00:05.360 1.5 1.75 2.4
1970-01-01 00:00:05.400 1.2 1.40 0.9
A2 1970-01-01 00:00:05.240 1.0 1.20 1.9
1970-01-01 00:00:05.280 1.5 1.75 2.4
1970-01-01 00:00:05.320 1.4 1.65 2.9
1970-01-01 00:00:05.360 1.5 1.75 2.4
1970-01-01 00:00:05.400 1.2 1.40 0.9
Upvotes: 1