Reputation: 177
I have in my opinion a little tricky problem with tables in pandas. I have a Table of tags data and documents in below format :
Tags Docs
0 A1-0001-AA-001/002 Doc1
1 A1-0001-AA-003/004 Doc1
2 A1-0001-AA-006/009 Doc2
3 A1-111-BB-001A/B/C Doc3
4 A1-222-BC-002A/B/C Doc4
5 A1-222-BB-001/2/3 Doc4
6 A1-3333-CD-001/003 Doc6
Or if you want to copy the code:
d= {'Tags':['A1-0001-AA-001/002',
'A1-0001-AA-003/004',
'A1-0001-AA-006/009',
'A1-111-BB-001A/B/C',
'A1-222-BC-002A/B/C',
'A1-222-BB-001/2/3',
'A1-3333-CD-001/003',
], 'Docs': ['Doc1', 'Doc1','Doc2','Doc3','Doc4','Doc4','Doc6']}
df = pd.DataFrame(data=d)
And I would like to split the rows in "Tags" by delimiter "/" and group by document. The problem is that if I split by delimiter, the second part will be left out without the suffix and this suffix is different on each row. Is there a way to split those? Thanks in advance for any ideas!
Upvotes: 0
Views: 52
Reputation: 862591
Solution working if same length of values after /
per groups created from rows of original data.
First use Series.str.split
with DataFrame.explode
and then split values by lengths from first values per groups and add to all another values:
df = df.assign(Tags=df['Tags'].str.split('/')).explode('Tags')
m = df.index.duplicated()
s = df['Tags'].str.len().groupby(level=0).transform('last')
df['new'] = [x[:-y] for x, y in zip(df['Tags'], s)]
df['Tags'] = (df['Tags'].mask(m, df.pop('new')
.groupby(level=0)
.transform('first')
.add(df['Tags'])))
df = df.reset_index(drop=True)
print (df)
Tags Docs
0 A1-0001-AA-001 Doc1
1 A1-0001-AA-002 Doc1
2 A1-0001-AA-003 Doc1
3 A1-0001-AA-004 Doc1
4 A1-0001-AA-006 Doc2
5 A1-0001-AA-009 Doc2
6 A1-111-BB-001A Doc3
7 A1-111-BB-001B Doc3
8 A1-111-BB-001C Doc3
9 A1-222-BC-002A Doc4
10 A1-222-BC-002B Doc4
11 A1-222-BC-002C Doc4
12 A1-222-BB-001 Doc4
13 A1-222-BB-002 Doc4
14 A1-222-BB-003 Doc4
15 A1-3333-CD-001 Doc6
16 A1-3333-CD-003 Doc6
Upvotes: 1