MikeM
MikeM

Reputation: 31

Compare two dataframe columns in same dataframe and return text contained in first column

I have data loaded into a dataframe but cannot figure out how to compare the parsed data against the other column and return only matches.

This seems like it should be easy but I just don't see it. I've tried splitting the values out to compare but here's where I get stuck.

import pandas as pd

df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
                    'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})

df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')


# output something like...
df['output'] = [null,';c1312;',';d1310;']

I'd expect to see something like -

1st row - return null, as t9010 is not contained in col2_split

2nd row - return c1312, as it is in col2_split

3rd row - return d1310 but not c1512, as only d1310 is in col2_split

lastly, the final text should be returned semicolon delimited and with leading and trailing semicolons i.e. ;t9010; or ;c1312; or ;d1310;c1512; if there is more than one.

Upvotes: 2

Views: 119

Answers (3)

ALFAFA
ALFAFA

Reputation: 648

You may be can try this method to get all values in col1 if its values are in col2. The method is by splitting string values in each row to a list and then omitting the empty values or length is less than 0 in the list values ([]) first. And then searching the values without empty values in col1 that is matched to the col2 and displaying the output to the output column.

df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
                    'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})

#splitting & omitting the empty values
df['col1_split']=df.col1.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
df['col2_split']=df.col2.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))

def check(list1, list2):
    res=''
    for i in list1:
        if (i in list2): res += ';'+str(i)
    #semicolon cover at the end of string in each row
    if len(res)>0: res+=';'
    return res

df['output']=df.apply(lambda x: check(x.col1_split, x.col2_split), axis=1)
df

Output:

Hope this can help you.

Upvotes: 0

Madhuri Sangaraju
Madhuri Sangaraju

Reputation: 309

The part where you have tried to split using ";" is correct. After that, you need to compare each element in col1_split with each element in col2_split. You can write a simple function to avoid many loops and use pandas apply function to do the rest

Here is the sample code for the same

import pandas as pd

df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
                    'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})

df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')

def value_check(list1, list2):
    string = ""
    for i in list1:
        if (i in list2) & (len(i)>0):
            string += ";"+i+';'
    return string

df['output'] = df.apply(lambda x: value_check(x.col1_split, x.col2_split), axis=1)
df

Output

enter image description here

Upvotes: 2

gmds
gmds

Reputation: 19885

We can use a nested list comprehension for this:

df['common'] = pd.Series([[sub for sub in left if sub in right] for left, right in zip(df['col1_split'], df['col2_split'])]).str.join(';')

print(df['common'])

Output:

0          ;
1    ;c1312;
2    ;d1310;
Name: common, dtype: object

Upvotes: -1

Related Questions