Intersection of 2 columns within a single Dataframe pandas

Question

import pandas as pd

df = pd.DataFrame({'Environment': [['AppleOS X','postgres','Apache','tomcat']], 'Description': [['Apache', 'Commons', 'Base32', 'decoding', 'invalid', 'rejecting', '.', 'via','valid', '.']] })

                             Environment                                                                Description
0  [AppleOS X, postgres, Apache, tomcat]  [Apache, Commons, Base32, decoding, invalid, rejecting, ., via, valid, .]

I am new to Pandas and dataframes, and I have to doubt in finding the intersection of two columns mentioned above.

Objective:

Environment and Description are two columns in a dataframe. The objective is to create a new column with the intersection of strings present in the first two columns.

Existing Implementation:

def f(param):
    return set.intersection(set(param['Environment']),set(param['Description']))

df['unique_words'] = df.apply(f, axis=1)
print(df['unique_words'])

This set intersection syntax is something I referred in https://www.kite.com/python/answers/how-to-find-the-intersection-of-two-lists-in-python

Problem:

I am not sure how the above syntax works, but it returns with {}

Expected Output:

As ['Apache'] is present in both the columns, it should be the value in the new column created in the dataframe.

Kindly let me know if anyone had done a similar function or any help is appreciated.

Trenton McKinney · Accepted Answer

use set.intersection
map lowercase to the values in the list
In terms of natural langue processing, the list values should all be converted to lowercase.

# assumes only the two columns in the dataframe
df['common_words'] = df.apply(lambda x: list(set(map(str.lower, x[0])).intersection(map(str.lower, x[1]))), axis=1)

# if there are many columns, specify the two desired columns to compare
df['common_words'] = df[['Environment', 'Description']].apply(lambda x: list(set(map(str.lower, x[0])).intersection(map(str.lower, x[1]))), axis=1)

# display(df)
                             Environment                                                                Description common_words
0  [AppleOS X, postgres, Apache, tomcat]  [Apache, Commons, Base32, decoding, invalid, rejecting, ., via, valid, .]     [apache]

Intersection of 2 columns within a single Dataframe pandas

Answers (1)

Related Questions