Kraton
Kraton

Reputation: 65

Resample a dataframe with a column of lists

Trying to resample a dataframe in pandas. I receive in input a .csv like this (the list in Data column are in form of strings): `

Name,Timestamp,Data
A1,5.26,"[1.0,1.2,1.9]"
A1,5.28,"[1.8,2.1,3.9]"
A1,5.30,"[1.2,1.4,0.9]"
A1,5.32,"[...]"
...
A2,5.26,"[...]"
A2,5.28,"[...]"
A2,5.30,"[...]"
A2,5.32,"[...]"
...
A3,5.26,"[...]"
A3,5.28,"[...]"
A3,5.30,"[...]"
A3,5.32,"[...]"`

Datas are recorded at 50hz (so every 20ms). I want to resample 25hz (so every 40ms).

I converted the Data column from string to an actual list with

df['Data'] = df['Data'].apply(ast.literal_eval)

and the Timestamp into seconds with:

df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')

I know that I've to use the .resample() function so I tried with

df.groupby('Name').resample("40L", on='Timestamp')

and it doesn't give me errors but it seems it doesn't resample at all in fact I've the same number of rows with same datas and just the Timestamp column converted into Datetime (and if I add a .mean() after the end of resample function it gives me the error No numeric types to aggregate).

I want the my table after the resample looks like:

Name Timestamp  Data
A1    5.26     [...]
A1    5.30     [...]
...
A2    5.26     [...]
A2    5.30     [...]
...
A3    5.26     [...]
A3    5.30     [...]

What should I do?

Upvotes: 2

Views: 689

Answers (2)

sophros
sophros

Reputation: 16758

If you do not want to interpolate the samples and only keep every other one it is enough for you to drop every other row, keeping the first:

df.groupby('Name').apply(lambda x: x.iloc[::2]) 

Any interpolation of the samples would require some information on the algorithm of aggregation that you would like to employ as Quang Hoang suggested in the comment.


Resampling approach

Please note that to obtain alignment on the Timestamp you need to set base=20 in this particular situation (please refer to the documentation for the rationale).

for name, df2 in df.groupby('Name'):
        df3 = pd.DataFrame(df2.resample("40L", on='Timestamp2', base=20, convention='start').apply(lambda r: r.iloc[0]))
        df3 = df3.set_index('Timestamp')        
        print(name, df3.drop(columns=['Timestamp2']))
    

Resulting in:

          Name             Data
Timestamp                      
5.26        A1  [1.0, 1.2, 1.9]
5.30        A1  [1.2, 1.4, 0.9]
          Name             Data
Timestamp                      
5.26        A2  [1.0, 1.2, 1.9]
5.30        A2  [1.2, 1.4, 0.9]

Selecting index sampling

A different approach that does not use resample but yields the same result (since you are not using the samples in-between the desired sample points):

for name, df2 in df.groupby('Name'):
    df2['Timestamp2'] = pd.to_datetime(df2['Timestamp'], unit='s')
    first = df2['Timestamp2'].iloc[0]
    selector = pd.date_range(start=first, freq='40L', periods=len(df2['Timestamp2']) / 2 + 1)
    df3 = df2[df2['Timestamp2'].isin(selector)]
    print(df3.drop(columns=['Timestamp2']))

Result:

  Name  Timestamp             Data
0   A1       5.26  [1.0, 1.2, 1.9]
2   A1       5.30  [1.2, 1.4, 0.9]
  Name  Timestamp             Data
4   A2       5.26  [1.0, 1.2, 1.9]
6   A2       5.30  [1.2, 1.4, 0.9]

Please note that I copied the data for 'A2' Name label.

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150815

Your problem is to convert the data part into actual numeric data. ast.literal_eval doesn't cut it because you cannot perform arithmetic operations on list. Here's what I would do:

df = pd.read_csv('your.csv')
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')

df = df.join(df['Data'].str[1:-1]
                       .str.split(',', expand=True)
                       .astype(float)
            )

# resample
df.groupby('Name').resample('40L', on='Timestamp').mean()

After that, your df would be something like:

                                0     1    2
Name Timestamp                              
A1   1970-01-01 00:00:05.240  1.0  1.20  1.9
     1970-01-01 00:00:05.280  1.5  1.75  2.4
     1970-01-01 00:00:05.320  1.4  1.65  2.9
     1970-01-01 00:00:05.360  1.5  1.75  2.4
     1970-01-01 00:00:05.400  1.2  1.40  0.9
A2   1970-01-01 00:00:05.240  1.0  1.20  1.9
     1970-01-01 00:00:05.280  1.5  1.75  2.4
     1970-01-01 00:00:05.320  1.4  1.65  2.9
     1970-01-01 00:00:05.360  1.5  1.75  2.4
     1970-01-01 00:00:05.400  1.2  1.40  0.9

Upvotes: 1

Related Questions