Reputation: 15
I have a dictionary that contains lists of values of varying lengths. I need to be able to process all the values at a particular index (column) in each list together. The only way I have found to is to convert it to a pandas dataframe. However, this is very slow for the actual dataset which can include 1000+ events (rows) with hundreds of observations (columns) per event.
A simplified example would look something like this:
event_dict= {}
event_dict['event1'] = [1,2,3,4,5]
event_dict['event2'] = [1,3,5,4,7,8,9,8]
event_dict['event3'] = [1,3,2,4]
event_dict['event4'] = [1, -1, 1, 2, 2,5]
#actual dictionary can have thousand+ rows with 100+ entries per row
event_df = pd.DataFrame()
for key in event_dict:
temp_df = pd.DataFrame(event_dict[key])
event_df = event_df.append(temp_df, ignore_index = True)
print(values_df)
values_df.mean()
The output would be something like:
0 1 2 3 4 5 6 7
0 1 2 3 4 5.0 NaN NaN NaN
1 1 3 5 4 7.0 8.0 9.0 8.0
2 1 3 2 4 NaN NaN NaN NaN
3 1 -1 1 2 2.0 5.0 NaN NaN
0 1.000000
1 1.750000
2 2.750000
3 3.500000
4 4.666667
5 6.500000
6 9.000000
7 8.000000
Since each list contains a different number of values, I'm having trouble figuring out an efficient implementation that doesn't use dataframes. The actual code takes the most time in creating the values_df itself given the number of iterations needed etc. Once I have the dataframe, I can vectorize it but before then is where I'm getting stuck.
Upvotes: 0
Views: 168
Reputation: 862851
Use DataFrame.from_dict
and parameter orient='index'
:
s = pd.DataFrame.from_dict(event_dict, orient='index').mean()
print (s)
0 1.000000
1 1.750000
2 2.750000
3 3.500000
4 4.666667
5 6.500000
6 9.000000
7 8.000000
dtype: float64
Another idea is use zip_longest
with filling missing values for different lengths:
from itertools import zip_longest
a = np.nanmean(np.array(list(zip_longest(*list(event_dict.values()), fillvalue=np.nan))),
axis=1)
print (a)
[1. 1.75 2.75 3.5 4.66666667 6.5
9. 8. ]
s = pd.Series(a)
print (s)
0 1.000000
1 1.750000
2 2.750000
3 3.500000
4 4.666667
5 6.500000
6 9.000000
7 8.000000
dtype: float64
Upvotes: 4