Reputation: 27
I have the following df with string col (small example, origin has more col & rows):
p = pd.DataFrame([
{'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun'},
{'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun'},
{'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']"},
{'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun'},
{'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun'}
])
If a value in col_1 matches a value in col_2 - col_5, write value of col_1 in new col, but if value already found let it unique in new col. How do I achieve this?
this matches only where is one value, but not multiple values:
mask = p[p.columns[2:6]].isin(p['col_1']).any(1)
# if value of col_1 is in col_2,col_3,col_4,col_5 write matching value in col_6, else xx
p['col_1'] = np.where(mask, p['col_1'], 'xx')
expected output in col_6:
p_new = pd.DataFrame([
{'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun','col_6':'xx'},
{'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun','col_6':'earth,sun'},
{'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']",'col_6':'saturn'},
{'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun','col_6':'mercury,itokawa,venus,earth', },
{'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun','col_6':'saturn'}
])
Upvotes: 0
Views: 85
Reputation: 30032
You can convert the values to set
df = pd.DataFrame({'col1': p['col_1'].str.split(', ?').apply(set),
'col2': p.filter(regex='col_[2-5]').agg(','.join, axis=1).str.split(',').apply(set)})
print(df)
col1 \
0 {pluto}
1 {sun, earth}
2 {saturn}
3 {venus, mercury, earth, itokawa}
4 {saturn}
col2
0 {makemake, uranus, ceres, saturn, mars, sun, planet x, venus, earth, eris, haumea, neptune}
1 {, jupyter, eros, sun, vesta, bennu, other, venus, apophis, earth, hale-bopp, didymos}
2 {oumuamua, mars, g1, saturn, ['saturn'], tempel}
3 {saturn, mars, sun, other, venus, earth, mercury, itokawa}
4 {sun, other, mars, saturn}
Then find the intersection part and convert it back to string
p['col_6'] = df.apply(lambda row: ','.join(row['col1'] & row['col2']), axis=1)
print(p)
ID col_1 \
0 1 pluto
1 2 sun, earth
2 3 saturn
3 4 mercury, itokawa, venus, earth
4 5 saturn
col_2 \
0 saturn,neptune,uranus,saturn,eris,haumea
1 earth,venus,,jupyter,bennu,apophis
2 oumuamua,g1,tempel
3 mercury,venus,itokawa
4 saturn
col_3 col_4 \
0 saturn,neptune,uranus,haumea,makemake,ceres mars,venus,planet x,earth
1 bennu,apophis,vesta,eros,didymos earth,venus,other,hale-bopp
2 saturn mars
3 mercury,itokawa,saturn venus,other,mars,earth
4 saturn mars,other
col_5 col_6
0 sun
1 sun sun,earth
2 ['saturn'] saturn
3 sun venus,mercury,earth,itokawa
4 sun saturn
Upvotes: 1