Reputation: 3852
I have a dataframe that looks like this:
df = pd.DataFrame(
{'tod': [[('a',10),('b',6),('h',3),('p',2)],
[('x',11),('e',2),('l',2)],
[('r',5),('l',5)],
[('n',15)]]})
tod
0 [(a, 10), (b, 6), (h, 3), (p, 2)]
1 [(x, 11), (e, 2), (l, 2)]
2 [(r, 5), (l, 5)]
3 [(n, 15)]
I want to expand list of tuples to new columns to get:
tod l1 n1 l2 n2 l3 n3 l4 n4
0 [(a, 10), (b, 6), (h, 3), (p, 2)] a 10 b 6 h 3 p 2
1 [(x, 11), (e, 2), (l, 2)] x 11 e 2 l 2 - nan
2 [(r, 5), (l, 5)] r 5 l 5 - nan - nan
3 [(n, 15)] n 15 - nan - nan - nan
I want to end up with missing values in the corresponding columns if the tuples do not exist.
I'm having trouble as the length of each list (number of tuples) within each cell is different for each row, so I want to dynamically assign new column values as they appear. Also each cell contains a list of tuple pairs, not a single tuple.
I tried something like in this question, but this only allowed expansion of a single tuple into multiple columns (when you know the columns before hand).
Then i looked at this and this but again, the number of columns is unknown so I ended up with:
pd.DataFrame.from_records([{k: v for v, k in row} for row in df.tod])
Out[171]:
2 3 5 6 10 11 15
0 p h NaN b a NaN NaN
1 l NaN NaN NaN NaN x NaN
2 NaN NaN l NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN n
Then I looked at splitting cells which contain tuples here and here which looked at converting the tuples to a series, but again, this did not work, as these examples only deal with single tuples of known length, not lists of tuples
How can I approach this problem?
Note: I realise I've not put up a lot of code for "What have you tried" - my console is a heap of garbage that has yielded continuous errors. In the interests of keeping this clean I've left this off.
Upvotes: 1
Views: 745
Reputation: 12015
n = max([len(row) for row in df.tod])
f = lambda l: sum(l, ()) + ('-', np.nan)*(n-len(l))
l = [list(f(row)) for row in df.tod]
ndf = pd.DataFrame(l,columns='l1 n1 l2 n2 l3 n3 l4 n4'.split())
# l1 n1 l2 n2 l3 n3 l4 n4
#0 a 10 b 6.0 h 3.0 p 2.0
#1 x 11 e 2.0 l 2.0 - NaN
#2 r 5 l 5.0 - NaN - NaN
#3 n 15 - NaN - NaN - NaN
df.join(ndf)
#
# tod l1 n1 l2 n2 l3 n3 l4 n4
#0 [(a, 10), (b, 6), (h, 3), (p, 2)] a 10 b 6.0 h 3.0 p 2.0
#1 [(x, 11), (e, 2), (l, 2)] x 11 e 2.0 l 2.0 - NaN
#2 [(r, 5), (l, 5)] r 5 l 5.0 - NaN - NaN
#3 [(n, 15)] n 15 - NaN - NaN - NaN
Upvotes: 0
Reputation: 862591
You can flatten tuples and then create columns names by generator, last join
to original DataFrame:
#https://stackoverflow.com/a/45122198/2901002
def mygen(lst):
for item in lst:
yield 'l{}'.format(item)
yield 'n{}'.format(item)
df1 = pd.DataFrame([[b for a in row for b in a] for row in df.tod])
df1.columns = list(mygen(range(1, len(df1.columns) // 2 + 1)))
print(df1)
l1 n1 l2 n2 l3 n3 l4 n4
0 a 10 b 6.0 h 3.0 p 2.0
1 x 11 e 2.0 l 2.0 None NaN
2 r 5 l 5.0 None NaN None NaN
3 n 15 None NaN None NaN None NaN
df = df.join(df1)
print (df)
tod l1 n1 l2 n2 l3 n3 l4 n4
0 [(a, 10), (b, 6), (h, 3), (p, 2)] a 10 b 6.0 h 3.0 p 2.0
1 [(x, 11), (e, 2), (l, 2)] x 11 e 2.0 l 2.0 None NaN
2 [(r, 5), (l, 5)] r 5 l 5.0 None NaN None NaN
3 [(n, 15)] n 15 None NaN None NaN None NaN
Upvotes: 2