James Plinkus
James Plinkus

Reputation: 55

Search a list to see if it contains strings stored in a different list in python

I have a list of words in one list (word_list), and I created another list that is just a row of article headlines (headline_col). The headlines are strings of many words, while the word_list are single words. I want to search the headlines to see if they contain any of the words in my word list and, if so, append another list (slam_list) with the headline.

I've looked this up and all the things I see are only matching an exact string to another of the same. For example looking to see if the entry is exactly "apple," not to see if it is in "john ate an apple today."

I've tried using sets, but I was only able to get it to return True if there was a match, I didn't know how to get it to append the slam_list, or even just print the entry. This is what I have. How would I use this to get what I need?

import csv

word_list = ["Slam", "Slams", "Slammed", "Slamming",
             "Blast", "Blasts", "Blasting", "Blasted"]

slam_list = []
csv_data = []

# Creating the list I need by opening a csv and getting the column I need
with open("website_headlines.csv", encoding="utf8") as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append(row)

headline_col = [headline[2] for headline in csv_data]

Upvotes: 1

Views: 111

Answers (2)

Eric Truett
Eric Truett

Reputation: 3010

Here, since you are reading a csv, it is likely going to be easier to use pandas to accomplish your goals.

What you want to do is identify the column by its index, which looks like it is 2. Then you find the values of the third column that are in word_list.

import pandas as pd

df = pd.read_csv("website_headlines.csv")
col = df.columns[2]
df.loc[df[col].isin(word_list), col]

Consider the following example

import numpy as np
import pandas as pd

word_list = ["Slam", "Slams", "Slammed", "Slamming",
             "Blast", "Blasts", "Blasting", "Blasted"]

# add some extra characters to see if limited to exact matches
word_list_mutated = np.random.choice(word_list + [item + '_extra' for item in word_list], 10)

data = {'a': range(1, 11), 'b': range(1, 11), 'c': word_list_mutated}
df = pd.DataFrame(data)
col = df.columns[2]

>>>df.loc[df[col].isin(word_list), col]
    a   b               c
0   1   1           Slams
1   2   2           Slams
2   3   3   Blasted_extra
3   4   4          Blasts
4   5   5     Slams_extra
5   6   6  Slamming_extra
6   7   7            Slam
7   8   8     Slams_extra
8   9   9            Slam
9  10  10        Blasting

Upvotes: 0

James Grammatikos
James Grammatikos

Reputation: 480

So using sets, as you mentioned, is definitely the way to go here. This is because lookups in sets are much faster than in lists. If you want to know why, do a quick google search on hashing. All you need to do to make this change is change the square brackets in word_list to curly braces.

The real issue that you need to deal with is "The headlines are strings of many words, while the word_list are single words"

What you need to do is iterate over the many words. I'm assuming headline_col is a list of headlines, where headline is a string containing one or more words. We'll iterate over all the headlines, then iterate over each word in the headline.

word_list = {"Slam", "Slams", "Slammed", "Slamming", "Blast", "Blasts", "Blasting", "Blasted"}

# Iterate over each headline
for headline in headline_col:

    # Iterate over each word in headline
    # Headline.split will break the headline into a list of words (breaks on whitespace)
    for word in headline.split():

        # if we've found our word
        if word in word_list:
            # add the word to our list
            slam_list.append(headline)
            # we're done with this headline, so break from the inner for loop
            break

Upvotes: 1

Related Questions