kartikeya saraswat
kartikeya saraswat

Reputation: 23

Python program to find if a certain keyword is present in a list of documents (string)

Question: A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word.

The function should meet the following criteria:

Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”

She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”

Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.

My code:-

keywords=["casino"]
def multi_word_search(document,keywords):
    dic={}
    z=[]
    for word in document:
        i=document.index(word)
        token=word.split()
        new=[j.rstrip(",.").lower() for j in token]
        
        for k in keywords:
            if k.lower() in new:
                dic[k]=z.append(i)
            else:
                    dic[k]=[]             
    return dic

It must return value of {'casino': [0]} on giving document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?'], keywords=['casino'], but got {'casino': []} instead. I wonder if someone could help me?

Upvotes: 0

Views: 654

Answers (3)

Victor V
Victor V

Reputation: 23

I would first tokenize the string "new" using split(), then build a set to speed up look up.

If you want case insensitive you need to lower case both sides

for k in keywords:
   s = set(new.split())
   if k in s:
      dic[k] = z.append(i)
   else:
      dic[k]=[]
return dic
   

Upvotes: 1

Prakash
Prakash

Reputation: 192

This should work too..

document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?']
keywords=['casino', 'car']

def findme(term):      
    for x in document:
        val = x.split(' ')
        for v in val:
            if term.lower() == v.lower():
                return document.index(x)
    
for key in keywords:
    n = findme(key)
    print(f'{key}:{n}')

Upvotes: 0

ygorg
ygorg

Reputation: 770

This is not as trivial as it seem. From a NLP (natural language processing) splitting a text into words is not trivial (it is called tokenisation).

import nltk

# stemmer = nltk.stem.PorterStemmer()

def multi_word_search(documents, keywords):
    # Initialize result dictionary
    dic = {kw: [] for kw in keywords}
    for i, doc in enumerate(documents):
        # Preprocess document
        doc = doc.lower()
        tokens = nltk.word_tokenize(doc)
        tokens = [stemmer.stem(token) for token in tokens]
        # Search each keyword
        for kw in keywords:
            # kw = stemmer.stem(kw.lower())
            kw = kw.lower()
            if kw in tokens:
                # If found, add to result dictionary
                dic[kw].append(i)
    return dic

documents = ['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?' 'Some casinos']
keywords=['casino']
multi_word_search(documents, keywords)

To increase matching you can use stemming (it removes plurals and verbs flexions, ex: running -> run)

Upvotes: 0

Related Questions