Reputation: 1599
I would like to create a numpy array from pandas dataframe.
My code:
import pandas as pd
_df = pd.DataFrame({'itme': ['book', 'book' , 'car', ' car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
item color val
book green -22.70
book blue -109.60
car red -57.19
car green -11.20
bike blue -25.60
bike red -33.61
There are about 12k million rows.
I need to create a numpy array like :
item green blue red
book -22.70 -109.60 null
car -11.20 null -57.19
bike null -25.60 -33.16
each row is the item name and each col is color name. The order of the items and colors are not important. But, in numpy array, there are no row and column names, I need to keep the item and color name for each value, so that I know what the value represents in the numpy array.
For example
how to know that -57.19 is for "car" and "red" in numpy array ?
So, I need to create a dictionary to keep the mapping between :
item <--> row index in the numpy array
color <--> col index in the numpy array
I do not want to use iteritems and itertuples because they are not efficient for large dataframe due to How to iterate over rows in a DataFrame in Pandas and How to iterate over rows in a DataFrame in Pandas and Python Pandas iterate over rows and access column names and Does pandas iterrows have performance issues?
I prefer numpy vectorization solution for this.
How to efficiently convert the pandas dataframe to numpy array ? The array will also be transformed to torch.tensor.
thanks
Upvotes: 5
Views: 2487
Reputation: 62503
numpy.recarry
using pandas.DataFrame.to_records
, and also use Boolean indexing.item
is a method for both pandas
and numpy
, so don't use 'item'
as a column name. It has been changed to '_item'
.numpy
is a pandas
dependency, and much of pandas
vectorized functionality directly corresponds to numpy
.import pandas as pd
import numpy as np
# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]
# print(selected)
_item color val
book blue -109.6
# Alternatively, create a recarray
v = df.to_records(index=False)
# display(v)
rec.array([('book', 'green', -22.7 ), ('book', 'blue', -109.6 ),
('car', 'red', -57.19), ('car', 'green', -11.2 ),
('bike', 'blue', -25.6 ), ('bike', 'red', -33.61)],
dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])
# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]
# print(selected)
[('book', 'blue', -109.6)]
pandas.DataFrame.pivot
, and then use the previously mentioned methods.dfp = df.pivot(index='_item', columns='color', values='val')
# display(dfp)
color blue green red
_item
bike -25.6 NaN -33.61
book -109.6 -22.7 NaN
car NaN -11.2 -57.19
# create a numpy recarray
v = dfp.to_records(index=True)
# display(v)
rec.array([('bike', -25.6, nan, -33.61),
('book', -109.6, -22.7, nan),
('car', nan, -11.2, -57.19)],
dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])
# select data
selected = v.blue[(v._item == 'book')]
# print(selected)
array([-109.6])
Upvotes: 4