Reputation: 1430
I have dataframe like this:
import pandas as pd
data = [{'id': 'Jones', 'tf': [(0, 0.5), (1,2.0)]},
{'id': 'Alpha', 'tf': [(1,2.0)]},
{'id': 'Blue', 'tf': [(2,0.1),(1,0.2)]}]
df = pd.DataFrame(data)
` I want to have dataframe in this form:
'id', 'var', 'value'
Jones, 0, 0.5
Jones, 1, 2.0
Alpha, 1, 2.0
Blue, 2, 0.1
Blue, 1, 0.2
I can do it in two steps:
i) unnest to form: id,0,1,2 - columns
id ,0 ,1 ,2
Jones,0.5,NaN,2.0
Alpha,NaN,2.0,NaN
Blue ,0.2,NaN,0.1
ii) melt with id
But there is a problem with step i). My dataset is rather sparse so unnesting takes a lot of memory for NaNs.
I'm looking for pandastic solution that avoids unnesting and it is memory efficient.
Upvotes: 0
Views: 192
Reputation: 164773
This is the loopy way. It will not be fast, but requires minimal memory.
I use .iat
for fast integer-based lookup, so care is required if you have other columns in your dataframe.
import pandas as pd
data = [{'id': 'Jones', 'tf': [(0, 0.5), (1,2.0)]},
{'id': 'Alpha', 'tf': [(1,2.0)]},
{'id': 'Blue', 'tf': [(2,0.1),(1,0.2)]}]
df = pd.DataFrame(data)
df = df.join(pd.DataFrame(columns=[0, 1, 2]))
for idx, lst in enumerate(df['tf']):
for tup in lst:
df.iat[idx, tup[0]+2] = tup[1]
df = df.drop('tf', 1).melt('id').dropna(subset=['value'])
# id variable value
# 0 Jones 0 0.5
# 3 Jones 1 2
# 4 Alpha 1 2
# 5 Blue 1 0.2
# 8 Blue 2 0.1
Upvotes: 0
Reputation: 323316
Should be fast
s=df.tf.str.len()
t=pd.DataFrame({'id':df.id.repeat(s),'V':df.tf.sum()})
t[['var','value']]=pd.DataFrame(t.V.tolist()).values
t
Out[550]:
V id var value
0 (0, 0.5) Jones 0.0 0.5
0 (1, 2.0) Jones 1.0 2.0
1 (1, 2.0) Alpha 1.0 2.0
2 (2, 0.1) Blue 2.0 0.1
2 (1, 0.2) Blue 1.0 0.2
Upvotes: 1