proximacentauri
proximacentauri

Reputation: 1879

Pandas column to numpy arrays

I have the following dataframe:

    name        day       value     time
0   MAC000002   2012-12-16  0.147   09:30:00
1   MAC000002   2012-12-16  0.110   10:00:00
2   MAC000002   2012-12-16  0.736   10:30:00
3   MAC000003   2012-12-16  0.404   09:30:00
4   MAC000003   2012-12-16  0.845   10:00:00

I want to convert the values only to a numpy array:

[[0.147, 0.110, 0.736],[0.404, 0.845 ...],...]

The only way I can think to do this is to pivot the dataframe then dump the values:

new_df = pd.pivot_table(df,index=["name"],values=["value"])
data = new_df.values()

However the dataset is very large and there are thousands of unique names and I cant pivot the table due to memory constraints. Is there another way to dump the values grouped by name keeping day then time ordering?

Upvotes: 1

Views: 144

Answers (1)

jpp
jpp

Reputation: 164843

You're probably going down the wrong track:

  • pd.pivot_table won't get you what you want here, by default it gives the mean by group. While you want to keep all values.
  • NumPy arrays only give large benefits for fixed dimensions, e.g. same number of columns for each row. Here, it appears that this may not be true: one group may have 2 values and another 3. A list of lists may be more appropriate.

I'll assume you've already sorted your dataframe by date and time. Then one solution is to use GroupBy + apply with list:

res = df.groupby('name', sort=False)['value'].apply(list).values.tolist()

print(res)

[[0.147, 0.11, 0.736], [0.40399999999999997, 0.845]]

You may see some performance improvement by converting 'name' to a categorical. Another solution is possible via collections.defaultdict, but this will likely be slower:

from collections import defaultdict

def group_apply(df):
    return df.groupby('name', sort=False)['value'].apply(list).values.tolist()

def group_dict(df):
    dd = defaultdict(list)
    for name, value in df[['name', 'value']].itertuples(index=False):
        dd[name].append(value)
    return list(dd.values())

df = pd.concat([df]*10000, ignore_index=True)

assert group_apply(df) == group_dict(df)

%timeit group_apply(df)  # 8.07 ms
%timeit group_dict(df)   # 39.1 ms

Upvotes: 1

Related Questions