pylearner
pylearner

Reputation: 1460

Searching for a phrase in a document

Task is to match a keyword from a paragraph, what I did was I broke the paragraph into words and put them in a list and then used the search words from another list and did a match.

data :

Automatic Product Title Tagging
Aim: To automate the process of product title tagging using manually tagged data. 

ROUTE OPTIMIZATION – Spring Clean
Aim:  Minimizing the overall travel time using optimization techniques. 

CUSTOMER SEGMENTATION:
Aim:  Develop an engine which segments and provides the score for
      customers based on their behavior and analyze their purchasing pattern. 

Attempted code:

s = ['tagged', 'product title',  'tagging', 'analyze']

skills = []
for word in data.split():

    print(word)    
    word.lower()
    if word in s:

        skills.append(word)
skills1 = list(set(skills))

print(skills1)

['tagged', 'tagging', 'analyze'] 

As I used the split function, every word is split and hence I am not able to detect the word product title which is there in the paragraph.

Appreciate if anyone can help on this.

Upvotes: 0

Views: 78

Answers (4)

wailinux
wailinux

Reputation: 139

"Aim:" must be in each line of "data" so I'll find the index for this word("Aim:")

p = "Automatic Product Title Tagging  Aim: To automate the process of product title tagging using manually tagged data."
index = p.find("Aim:") # 33
print(p[33:])
output:
"Aim: To automate the process of product title tagging using manually tagged data."
w_lenght = len("Aim:") # 4 : for exclude word "Aim:"
print(p[37:])
output:
" To automate the process of product title tagging using manually tagged data."

example:

s = ['tagged', 'product title',  'tagging', 'analyze']
skills = []
for line in data.split("\n"):
    index = line.find("Aim:") + len("Aim:") #4
    if index != -1:
    for word in line[index:].split():
        if word.lower() in s:
            skills.append(word)
            print(word)

Upvotes: 0

Leo K
Leo K

Reputation: 5354

What you are searching for is not a 'keyword' but a phrase. One solution is to use a regular expression search (a simple substring is in text construct won't work well because when given 'product title', it might catch byproduct titles, which isn't what you want).

This should do it:

import re
[ k for k in skills if re.search( r'\b' + k + r'\b', data, flags=re.IGNORECASE ) ]

Upvotes: 3

Rakesh
Rakesh

Reputation: 82755

Iterate the list s and check if element in string.

Demo:

data = """
 Automatic Product Title Tagging  
 Aim: To automate the process of product title tagging using manually tagged data.
 ROUTE OPTIMIZATION – Spring Clean
 Aim:  Minimizing the overall travel time using optimization techniques.
 CUSTOMER SEGMENTATION:
 Aim:  Develop an engine which segments and provides the score for  
       customers based on their behavior and analyze their purchasing
       pattern. 
"""
s = ['tagged', 'product title',  'tagging', 'analyze']
data = data.lower()

skills = []
for i in s:
    if i.lower() in data:
        skills.append(i)
print(skills)

Or in a single line.

skills = [i for i in s if i.lower() in data]

Output:

['tagged', 'product title', 'tagging', 'analyze']

Upvotes: 2

guroosh
guroosh

Reputation: 652

split() splits the string around the passed argument. The default argument for split() is a space. Since you want to search 'product title' which also includes a space, you can do one of these:

1) Search for the phrase directly in the paragraph

2) if you split, then you can search for a match in i and i+1 indices

Upvotes: 0

Related Questions