alkamyst
alkamyst

Reputation: 15

Most efficient way to process list of list of arrays of varying lengths in python

I have a dictionary that contains lists of values of varying lengths. I need to be able to process all the values at a particular index (column) in each list together. The only way I have found to is to convert it to a pandas dataframe. However, this is very slow for the actual dataset which can include 1000+ events (rows) with hundreds of observations (columns) per event.

A simplified example would look something like this:

event_dict= {}
event_dict['event1'] = [1,2,3,4,5]
event_dict['event2'] = [1,3,5,4,7,8,9,8]
event_dict['event3'] = [1,3,2,4]
event_dict['event4'] = [1, -1, 1, 2, 2,5]
#actual dictionary can have thousand+ rows with 100+ entries per row

event_df = pd.DataFrame()
for key in event_dict:
    temp_df = pd.DataFrame(event_dict[key])
    event_df = event_df.append(temp_df, ignore_index = True)

print(values_df)
values_df.mean()

The output would be something like:

   0  1  2  3    4    5    6    7
0  1  2  3  4  5.0  NaN  NaN  NaN
1  1  3  5  4  7.0  8.0  9.0  8.0
2  1  3  2  4  NaN  NaN  NaN  NaN
3  1 -1  1  2  2.0  5.0  NaN  NaN

0    1.000000
1    1.750000
2    2.750000
3    3.500000
4    4.666667
5    6.500000
6    9.000000
7    8.000000

Since each list contains a different number of values, I'm having trouble figuring out an efficient implementation that doesn't use dataframes. The actual code takes the most time in creating the values_df itself given the number of iterations needed etc. Once I have the dataframe, I can vectorize it but before then is where I'm getting stuck.

Upvotes: 0

Views: 168

Answers (1)

jezrael
jezrael

Reputation: 862851

Use DataFrame.from_dict and parameter orient='index':

s = pd.DataFrame.from_dict(event_dict, orient='index').mean()
print (s)
0    1.000000
1    1.750000
2    2.750000
3    3.500000
4    4.666667
5    6.500000
6    9.000000
7    8.000000
dtype: float64

Another idea is use zip_longest with filling missing values for different lengths:

from  itertools import zip_longest

a = np.nanmean(np.array(list(zip_longest(*list(event_dict.values()), fillvalue=np.nan))), 
               axis=1)
print (a)
[1.         1.75       2.75       3.5        4.66666667 6.5
 9.         8.        ]

s = pd.Series(a)
print (s)
0    1.000000
1    1.750000
2    2.750000
3    3.500000
4    4.666667
5    6.500000
6    9.000000
7    8.000000
dtype: float64

Upvotes: 4

Related Questions