Pandas column to numpy arrays

Question

I have the following dataframe:

    name        day       value     time
0   MAC000002   2012-12-16  0.147   09:30:00
1   MAC000002   2012-12-16  0.110   10:00:00
2   MAC000002   2012-12-16  0.736   10:30:00
3   MAC000003   2012-12-16  0.404   09:30:00
4   MAC000003   2012-12-16  0.845   10:00:00

I want to convert the values only to a numpy array:

[[0.147, 0.110, 0.736],[0.404, 0.845 ...],...]

The only way I can think to do this is to pivot the dataframe then dump the values:

new_df = pd.pivot_table(df,index=["name"],values=["value"])
data = new_df.values()

However the dataset is very large and there are thousands of unique names and I cant pivot the table due to memory constraints. Is there another way to dump the values grouped by name keeping day then time ordering?

jpp · Accepted Answer

You're probably going down the wrong track:

pd.pivot_table won't get you what you want here, by default it gives the mean by group. While you want to keep all values.
NumPy arrays only give large benefits for fixed dimensions, e.g. same number of columns for each row. Here, it appears that this may not be true: one group may have 2 values and another 3. A list of lists may be more appropriate.

I'll assume you've already sorted your dataframe by date and time. Then one solution is to use GroupBy + apply with list:

res = df.groupby('name', sort=False)['value'].apply(list).values.tolist()

print(res)

[[0.147, 0.11, 0.736], [0.40399999999999997, 0.845]]

You may see some performance improvement by converting 'name' to a categorical. Another solution is possible via collections.defaultdict, but this will likely be slower:

from collections import defaultdict

def group_apply(df):
    return df.groupby('name', sort=False)['value'].apply(list).values.tolist()

def group_dict(df):
    dd = defaultdict(list)
    for name, value in df[['name', 'value']].itertuples(index=False):
        dd[name].append(value)
    return list(dd.values())

df = pd.concat([df]*10000, ignore_index=True)

assert group_apply(df) == group_dict(df)

%timeit group_apply(df)  # 8.07 ms
%timeit group_dict(df)   # 39.1 ms

Pandas column to numpy arrays

Answers (1)

Related Questions