Bits
Bits

Reputation: 276

keywords matching between dictionary values as list and pandas column

Let's say, I have dataframe df with column name as news_text,

news_text
lebron james is the great basketball player.
leonardo di caprio has won the oscar for best actor
avatar was directed by steven speilberg.
ronaldo has resigned from manchester united.
argentina beats france in fifa world cup 2022.
joe biden has won the president elections.
2026 fifa WC will be host by canada,mexico and usa combined.

and a large dictionary with hundreds of keys, something like,

{'category_1': ['lebron james', 'oscar', 'leonardo dicaprio'], 'category_2': ['basketball', 'steven speilberg','manchester united'], 
'category_3': ['ronaldo', 'argentina','world cup']...so on}

All, I want to perform the exact keywords matching between the dictionary values (which consists list of keywords) and df['news_text']. Once keywords will be matched, correponding dictionary keys will be assigned to new column mapped_category in the form of list and if no keyword found in any of keyword list then column value will be NA. The final output will be something like,

news_text                                                    mapped_category
lebron james is the great basketball player.               ['category_1', 'category_2']
leonardo di caprio has won the oscar for best actor        ['category_1','category_1']
avatar was directed by steven speilberg.                   ['category_2']
ronaldo has resigned from manchester united.               ['category_2','category_3']
argentina beats france in fifa world cup 2022.             ['category_3','category_3]
joe biden has won the president elections.                        NA
2026 fifa WC will be host by canada,mexico and usa combined.      NA

Upvotes: 0

Views: 28

Answers (1)

shadowtalker
shadowtalker

Reputation: 13823

The simplest (not necessarily fastest or fanciest) way to do this is to write a function that produces the desired list of categories for one news document, and then apply that function to the series of documents:

categories = {
    'category_1': ['lebron james', 'oscar', 'leonardo dicaprio'],
    'category_2': ['basketball', 'steven speilberg','manchester united'],
    'category_3': ['ronaldo', 'argentina','world cup'],
}

def find_categories(document):
    found = []
    for category, keywords in categories.items():
        for keyword in keywords:
            if keyword in document:
                found.append(category)
                break
    return found

df['news_categories'] = df['news_text'].apply(find_categories)

Upvotes: 0

Related Questions