Pandas look for string values of one col in multiple columns, write value if in other col in new column

Question

I have the following df with string col (small example, origin has more col & rows):

p = pd.DataFrame([
                   {'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun'}, 
                   {'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun'}, 
                   {'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']"},
                   {'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun'},
                   {'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun'}
                  ])

If a value in col_1 matches a value in col_2 - col_5, write value of col_1 in new col, but if value already found let it unique in new col. How do I achieve this?

this matches only where is one value, but not multiple values:


mask = p[p.columns[2:6]].isin(p['col_1']).any(1)
# if value of col_1 is in col_2,col_3,col_4,col_5 write matching value in col_6, else xx
p['col_1'] = np.where(mask, p['col_1'], 'xx')

expected output in col_6:


p_new = pd.DataFrame([
                   {'ID': 1,'col_1': 'pluto', 'col_2':'saturn,neptune,uranus,saturn,eris,haumea', 'col_3':'saturn,neptune,uranus,haumea,makemake,ceres','col_4':'mars,venus,planet x,earth','col_5':'sun','col_6':'xx'}, 
                   {'ID': 2,'col_1': 'sun, earth', 'col_2':'earth,venus,,jupyter,bennu,apophis', 'col_3':'bennu,apophis,vesta,eros,didymos','col_4':'earth,venus,other,hale-bopp','col_5':'sun','col_6':'earth,sun'}, 
                   {'ID': 3,'col_1': 'saturn', 'col_2':'oumuamua,g1,tempel', 'col_3':'saturn','col_4':'mars','col_5':"['saturn']",'col_6':'saturn'},
                   {'ID': 4,'col_1': 'mercury, itokawa, venus, earth', 'col_2':'mercury,venus,itokawa', 'col_3':'mercury,itokawa,saturn','col_4':'venus,other,mars,earth','col_5':'sun','col_6':'mercury,itokawa,venus,earth', },
                   {'ID': 5,'col_1': 'saturn', 'col_2':'saturn', 'col_3':'saturn','col_4':'mars,other','col_5':'sun','col_6':'saturn'}
                  ])

Ynjxsjmh · Accepted Answer

You can convert the values to set

df = pd.DataFrame({'col1': p['col_1'].str.split(', ?').apply(set),
                   'col2': p.filter(regex='col_[2-5]').agg(','.join, axis=1).str.split(',').apply(set)})

print(df)

                               col1  \
0                           {pluto}
1                      {sun, earth}
2                          {saturn}
3  {venus, mercury, earth, itokawa}
4                          {saturn}

                                                                                          col2
0  {makemake, uranus, ceres, saturn, mars, sun, planet x, venus, earth, eris, haumea, neptune}
1       {, jupyter, eros, sun, vesta, bennu, other, venus, apophis, earth, hale-bopp, didymos}
2                                             {oumuamua, mars, g1, saturn, ['saturn'], tempel}
3                                   {saturn, mars, sun, other, venus, earth, mercury, itokawa}
4                                                                   {sun, other, mars, saturn}

Then find the intersection part and convert it back to string

p['col_6'] = df.apply(lambda row: ','.join(row['col1'] & row['col2']), axis=1)

print(p)

   ID                           col_1  \
0   1                           pluto
1   2                      sun, earth
2   3                          saturn
3   4  mercury, itokawa, venus, earth
4   5                          saturn

                                      col_2  \
0  saturn,neptune,uranus,saturn,eris,haumea
1        earth,venus,,jupyter,bennu,apophis
2                        oumuamua,g1,tempel
3                     mercury,venus,itokawa
4                                    saturn

                                         col_3                        col_4  \
0  saturn,neptune,uranus,haumea,makemake,ceres    mars,venus,planet x,earth
1             bennu,apophis,vesta,eros,didymos  earth,venus,other,hale-bopp
2                                       saturn                         mars
3                       mercury,itokawa,saturn       venus,other,mars,earth
4                                       saturn                   mars,other

        col_5                        col_6
0         sun
1         sun                    sun,earth
2  ['saturn']                       saturn
3         sun  venus,mercury,earth,itokawa
4         sun                       saturn

Pandas look for string values of one col in multiple columns, write value if in other col in new column

Answers (1)

Related Questions