Reputation: 31
I have data loaded into a dataframe but cannot figure out how to compare the parsed data against the other column and return only matches.
This seems like it should be easy but I just don't see it. I've tried splitting the values out to compare but here's where I get stuck.
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
# output something like...
df['output'] = [null,';c1312;',';d1310;']
I'd expect to see something like -
1st row - return null, as t9010
is not contained in col2_split
2nd row - return c1312
, as it is in col2_split
3rd row - return d1310
but not c1512
, as only d1310
is in col2_split
lastly, the final text should be returned semicolon delimited and with leading and trailing semicolons i.e. ;t9010;
or ;c1312;
or ;d1310;c1512;
if there is more than one.
Upvotes: 2
Views: 119
Reputation: 648
You may be can try this method to get all values in col1
if its values are in col2
. The method is by splitting string values in each row to a list and then omitting the empty values or length is less than 0 in the list values ([]
) first. And then searching the values without empty values in col1
that is matched to the col2
and displaying the output to the output
column.
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
#splitting & omitting the empty values
df['col1_split']=df.col1.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
df['col2_split']=df.col2.apply(lambda x: list((pd.Series(x.split(';')))[(pd.Series(x.split(';'))).apply(len)>0]))
def check(list1, list2):
res=''
for i in list1:
if (i in list2): res += ';'+str(i)
#semicolon cover at the end of string in each row
if len(res)>0: res+=';'
return res
df['output']=df.apply(lambda x: check(x.col1_split, x.col2_split), axis=1)
df
Output:
Hope this can help you.
Upvotes: 0
Reputation: 309
The part where you have tried to split using ";" is correct. After that, you need to compare each element in col1_split
with each element in col2_split
. You can write a simple function to avoid many loops and use pandas
apply
function to do the rest
Here is the sample code for the same
import pandas as pd
df = pd.DataFrame({ 'col1': [';t9010;',';c1312;',';d1310;c1512;'],
'col2': [';t1010;d1010;c1012;',';t1210;d1210;c1312;',';t1310;d1310;c1412;']})
df['col1_split'] = df['col1'].str.split(';')
df['col2_split'] = df['col2'].str.split(';')
def value_check(list1, list2):
string = ""
for i in list1:
if (i in list2) & (len(i)>0):
string += ";"+i+';'
return string
df['output'] = df.apply(lambda x: value_check(x.col1_split, x.col2_split), axis=1)
df
Output
Upvotes: 2
Reputation: 19885
We can use a nested list
comprehension for this:
df['common'] = pd.Series([[sub for sub in left if sub in right] for left, right in zip(df['col1_split'], df['col2_split'])]).str.join(';')
print(df['common'])
Output:
0 ;
1 ;c1312;
2 ;d1310;
Name: common, dtype: object
Upvotes: -1