Reputation: 25
I'm new to Python so I'm sorry if terminology is not correct; I've searched for similar posts but didn't find anything helpful for my case. I have a dataframe like this:
Column1 Column2
0 0001 [('A','B'),('C','D'),('E','F')]
1 0001 [('A','B'),('C','D'),('E','F')]
2 0001 [('A','B'),('C','D'),('E','F')]
3 0002 [('G','H'),('I','J')]
4 0002 [('G','H'),('I','J')]
Each row is replicated n times based on the number of tuples contained in the list of Column2. What I'd like to do is to add a new column containing only one tuple per row:
Column1 Column2 Column2_new
0 0001 [('A','B'),('C','D'),('E','F')] 'A' 'B'
1 0001 [('A','B'),('C','D'),('E','F')] 'C' 'D'
2 0001 [('A','B'),('C','D'),('E','F')] 'E' 'F'
3 0002 [('G','H'),('I','J')] 'G' 'H'
4 0002 [('G','H'),('I','J')] 'I' 'J'
Can you please help me with this?
Thanks in advance for any suggestion
Upvotes: 2
Views: 57
Reputation: 28729
data = {'Column1':["0001"]*3 + ["0002"]*2,
'Column2':[[('A','B'),('C','D'),('E','F')]]*3 + [[('G','H'),('I','J')]]*2
}
df = pd.DataFrame(data)
print(df)
Column1 Column2
0 0001 [(A, B), (C, D), (E, F)]
1 0001 [(A, B), (C, D), (E, F)]
2 0001 [(A, B), (C, D), (E, F)]
3 0002 [(G, H), (I, J)]
4 0002 [(G, H), (I, J)]
M = df.drop_duplicates('Column1')
print(M)
Column1 Column2
0 0001 [(A, B), (C, D), (E, F)]
3 0002 [(G, H), (I, J)]
pd.concat([df,M.Column2.explode().reset_index(drop=True).rename('new')],axis=1)
Column1 Column2 new
0 0001 [(A, B), (C, D), (E, F)] (A, B)
1 0001 [(A, B), (C, D), (E, F)] (C, D)
2 0001 [(A, B), (C, D), (E, F)] (E, F)
3 0002 [(G, H), (I, J)] (G, H)
4 0002 [(G, H), (I, J)] (I, J)
Alternatively, you could use the itertools functions - product and chain to get ur data, and concat back to the original dataframe:
from itertools import product,chain
res = chain.from_iterable(product([first],last)
for first, last
in zip(M.Column1, M.Column2))
out = pd.DataFrame(res,columns=['Column1','new'])
pd.concat((df,out.new),axis=1)
Upvotes: 1
Reputation: 75150
We can do df.lookup
after groupby+cumcount
idx = df.groupby('Column1').cumcount()
df['new']= pd.DataFrame(df['Column2'].tolist()).lookup(df.index,idx)
print(df)
Column1 Column2 new
0 1 [(A, B), (C, D), (E, F)] (A, B)
1 1 [(A, B), (C, D), (E, F)] (C, D)
2 1 [(A, B), (C, D), (E, F)] (E, F)
3 2 [(G, H), (I, J)] (G, H)
4 2 [(G, H), (I, J)] (I, J)
Upvotes: 2