Gulzar
Gulzar

Reputation: 27936

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.

I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.

Code:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6, 8)

    data = np.stack(some_df['A'].values)  #shape (10, 4, 6, 8)
    processed = np.max(data, axis=1)  # shape (10, 6, 8)

    some_df['B'] = processed  # This fails

I want the new column 'B' to contain numpy arrays of shape (6, 8)

How can this be done?

Upvotes: 3

Views: 8674

Answers (3)

jezrael
jezrael

Reputation: 862481

This is not recommended, it is pain, slow and later processing is not easy.

One possible solution is use list comprehension:

some_df['B'] = [x for x in processed]

Or convert to list and assign:

some_df['B'] = processed.tolist()

Upvotes: 5

Gulzar
Gulzar

Reputation: 27936

Coming back to this after 2 years, here is a much better practice:

from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict


def calc_col_names(named_shape):
    *prefix, shape = named_shape
    names = [map(str, range(i)) for i in shape]
    return map('_'.join, product(prefix, *names))


def create_flat_columns_df_from_dict_of_numpy(
        named_np: Dict[str, np.array],
        n_samples_per_np: int,
):
    named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
    flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
    stacked_nps = np.column_stack(flat_nps)
    named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
    col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
    df = pd.DataFrame(stacked_nps, columns=col_names)
    df = df.convert_dtypes()
    return df


def parse_series_into_np(df, col_name, shp):
    # can parse the shape from the col names
    n_samples = len(df)
    col_names = sorted(c for c in df.columns if col_name in c)
    col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
    col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
    return col_as_np

usage to put a ndarray into a Dataframe:

full_rate_df = create_flat_columns_df_from_dict_of_numpy(
    named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
    n_samples_per_np=d["name1"].shape[0]
)

where d is a dict of nd arrays of the same shape[0], hashed by ["name1", "name2"].

The reverse operation can be obtained by parse_series_into_np.


The accepted answer remains, as it answers the original question, but this one is a much better practice.

Upvotes: 1

R2D2_2024
R2D2_2024

Reputation: 51

I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data. In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.

The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.

The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:

df["new_list_column"] = pd.Series(list(numpy_array_2D))

Upvotes: 0

Related Questions