Reputation: 659
I have a Pandas Dataframe with three columns: row, column, value. The row values are all integers below some N
, and the column values are all integers below some M
. The values are all positive integers.
How do I efficiently create a Dataframe with N
rows and M
columns, with at index i, j
the value val
if (i, j , val)
is a row in my original Dataframe, and some default value (0
) otherwise? Furthermore, is it possible to create a sparse Dataframe immediately, since the data is already quite large, but N*M
is still about 10 times the size of my data?
Upvotes: 2
Views: 1055
Reputation: 221754
A NumPy solution would suit here for performance -
a = df.values
m,n = a[:,:2].max(0)+1
out = np.zeros((m,n),dtype=a.dtype)
out[a[:,0], a[:,1]] = a[:,2]
df_out = pd.DataFrame(out)
Sample run -
In [58]: df
Out[58]:
row col val
0 7 1 30
1 3 3 0
2 4 8 30
3 5 8 18
4 1 3 6
5 1 6 48
6 0 2 6
7 4 7 6
8 5 0 48
9 8 1 48
10 3 2 12
11 6 8 18
In [59]: df_out
Out[59]:
0 1 2 3 4 5 6 7 8
0 0 0 6 0 0 0 0 0 0
1 0 0 0 6 0 0 48 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 12 0 0 0 0 0 0
4 0 0 0 0 0 0 0 6 30
5 48 0 0 0 0 0 0 0 18
6 0 0 0 0 0 0 0 0 18
7 0 30 0 0 0 0 0 0 0
8 0 48 0 0 0 0 0 0 0
Upvotes: 1