How to manage Numpy arrays in Pandas DataFrames

Question

Let's assume one has a DataFrame with some integers values and some arrays defined somehow:

df =  pd.DataFrame(np.random.randint(0,100,size=(5, 1)), columns=['rand_int'])
array_a = np.arange(5) 
array_b = np.arange(7)
df['array_a'] = df['rand_int'].apply(lambda x: array_a[:x])
df['array_b'] = df['rand_int'].apply(lambda x: array_b[:x])

Some questions which can help me understand how to manage Numpy arrays with Pandas DataFrames:

How can one define array_a and array_b columns in df as the product between the item in the n-th row in column rand_int?
Is it possible to create another column, let's name it array_diff, which is the np.setdiff1d between array_a and array_b for each row?

filippo · Accepted Answer

I'd say it's better to work with NumPy and import data into the dataframe as a last step.

Anyway here's a solution that stores arrays into the dataframe step by step. Not really sure you actually want the outer product, it would be great if you could post the expected result.

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(5, 1)), columns=['rand_int']) 
>>> df
   rand_int
0        51
1        92
2        14
3        71
4        60

df['a'] = np.split(np.outer(df['rand_int'], np.arange(5)), 5) 
df['b'] = np.split(np.outer(df['rand_int'], np.arange(7)), 5)

>>> df
   rand_int                         a                                   b
0        51  [[0, 51, 102, 153, 204]]  [[0, 51, 102, 153, 204, 255, 306]]
1        92  [[0, 92, 184, 276, 368]]  [[0, 92, 184, 276, 368, 460, 552]]
2        14     [[0, 14, 28, 42, 56]]       [[0, 14, 28, 42, 56, 70, 84]]
3        71  [[0, 71, 142, 213, 284]]  [[0, 71, 142, 213, 284, 355, 426]]
4        60  [[0, 60, 120, 180, 240]]  [[0, 60, 120, 180, 240, 300, 360]]

df['d'] = df.b.combine(df.a, func=np.setdiff1d)
>>> df['d']
0    [255, 306]
1    [460, 552]
2      [70, 84]
3    [355, 426]
4    [300, 360]
Name: d, dtype: object

Note that np.split leaves an extra dimension, not sure if this can be avoided. You might want to remove it with np.squeeze

>>> df['a'].apply(np.squeeze)
0    [0, 51, 102, 153, 204]
1    [0, 92, 184, 276, 368]
2       [0, 14, 28, 42, 56]
3    [0, 71, 142, 213, 284]
4    [0, 60, 120, 180, 240]
Name: a, dtype: object

How to manage Numpy arrays in Pandas DataFrames

Answers (1)

Related Questions