Reputation: 1879
I have the following dataframe:
name day value time
0 MAC000002 2012-12-16 0.147 09:30:00
1 MAC000002 2012-12-16 0.110 10:00:00
2 MAC000002 2012-12-16 0.736 10:30:00
3 MAC000003 2012-12-16 0.404 09:30:00
4 MAC000003 2012-12-16 0.845 10:00:00
I want to convert the values only to a numpy array:
[[0.147, 0.110, 0.736],[0.404, 0.845 ...],...]
The only way I can think to do this is to pivot the dataframe then dump the values:
new_df = pd.pivot_table(df,index=["name"],values=["value"])
data = new_df.values()
However the dataset is very large and there are thousands of unique names and I cant pivot the table due to memory constraints. Is there another way to dump the values grouped by name keeping day then time ordering?
Upvotes: 1
Views: 144
Reputation: 164843
You're probably going down the wrong track:
pd.pivot_table
won't get you what you want here, by default it gives the mean by group. While you want to keep all values.I'll assume you've already sorted your dataframe by date and time. Then one solution is to use GroupBy
+ apply
with list
:
res = df.groupby('name', sort=False)['value'].apply(list).values.tolist()
print(res)
[[0.147, 0.11, 0.736], [0.40399999999999997, 0.845]]
You may see some performance improvement by converting 'name'
to a categorical. Another solution is possible via collections.defaultdict
, but this will likely be slower:
from collections import defaultdict
def group_apply(df):
return df.groupby('name', sort=False)['value'].apply(list).values.tolist()
def group_dict(df):
dd = defaultdict(list)
for name, value in df[['name', 'value']].itertuples(index=False):
dd[name].append(value)
return list(dd.values())
df = pd.concat([df]*10000, ignore_index=True)
assert group_apply(df) == group_dict(df)
%timeit group_apply(df) # 8.07 ms
%timeit group_dict(df) # 39.1 ms
Upvotes: 1