Reputation: 67
I want to write topic lists to check whether a review talks about one of the defined topics. It's important for me to write the topic lists myself and not use topic modeling to find possible topics.
I thought this is called dictionary analysis, but I can't find anything.
I have a data frame with reviews from amazon:
df = pd.DataFrame({'User': ['UserA', 'UserB','UserC'],
'text': ['Example text where he talks about a phone and his charging cable',
'Example text where he talks about a car with some wheels',
'Example text where he talks about a plane']})
Now I want to define topic lists:
phone = ['phone', 'cable', 'charge', 'charging', 'call', 'telephone']
car = ['car', 'wheel','steering', 'seat','roof','other car related words']
plane = ['plane', 'wings', 'turbine', 'fly']
The result of the method should be 3/12 for the "phone" topic of the first review (3 words of the topic list where in the review which has 12 words) and 0 for the other two topics.
The second review would result in 2/11 for the "car" topic and 0 for the other topics and for the third review 1/8 for the "plane" topic and 0 for the others.
Results as a list:
phone_results = [0.25, 0, 0]
car_results = [0, 0.18181818182, 0]
plane_results = [0, 0, 0.125]
Of course I would only use lowercase wordstems of the reviews which makes defining topics easier, but this should not be of concern now.
Is there a method for this or do I have to write one? Thank you in advance!
Upvotes: 0
Views: 1455
Reputation: 67
I thought I give back to the community and post my finished code which is based on @David542 answer:
import pandas as pd
import numpy as np
import re
i=0
#Iterates through the reviews
total_length = len(sentences)
print("Process started:")
s = 1
for sentence in sentences:
#Splits a review text into single words
words = sentence.split()
previous_word = ""
#Iterates through the topics, each is one column in a table
for column in dictio:
#Saves the topic words in the pattern list
pattern = list(dictio[column])
#remove nan values
clean_pattern = [x for x in pattern if str(x) != 'nan']
match_score = 0
#iterates through each entry of the topic list
for search_words in clean_pattern:
#iterates through each word of the review
for word in words:
#when two consecutive words are searched for the first if statement gets activated
if len(search_words.split())>1:
pattern2 = r"( "+re.escape(search_words.split()[0])+r"([a-z]+|) "+re.escape(search_words.split()[1])+r"([a-z]+|))"
#the spaces are important so bedtime doesnt match time
if re.search(pattern2, " "+previous_word+" "+word, re.IGNORECASE):
match_score +=1
#print(pattern2, " match ", previous_word," ", word)
if len(search_words.split())==1:
pattern1 = r" "+re.escape(search_words)+r"([a-z]+|)"
if re.search(pattern1, " "+word, re.IGNORECASE):
match_score +=1
#print(pattern1, " match ", word)
#saves the word for the next iteration to be used as the previous word
previous_word = word
result=0
if match_score > 0:
result = 1
df.at[i, column] = int(result)
i+=1
#status bar
factor = round(s/total_length,4)
if factor%0.05 == 0:
print("Status: "+str(factor*100)+"%")
s+=1
The texts I want to analyze are in a list of strings sentences
. The topics I want to look for in my texts are in the dataFrame dictio
. The Topic starts with the topic name and has rows of search words. The analyses takes one or two consecutives words and looks for them with variable endings in each string. If the regex matches the original dataframe df
gets a "1" in the according row of a column assigned to the topic. Other then specified in my question I am not calculating the percentage of words, since I found it doesnt add value to my analysis. Punctuation in the strings should be removed, but stemming is not necessary. If you have specific questions, pls comment and I will edit this code or answer your comment.
Upvotes: 0
Reputation: 110163
NLP can be quite deep, but for something about the ratio of known words, you could probably do something more basic. For example:
word_map = {
'phone': ['phone', 'cable', 'charge', 'charging', 'call', 'telephone'],
'car': ['car', 'wheels','steering', 'seat','roof','other car related words'],
'plane': ['plane', 'wings', 'turbine', 'fly']
}
sentences = [
'Example text where he talks about a phone and his charging cable',
'Example text where he talks about a car with some wheels',
'Example text where he talks about a plane'
]
for sentence in sentences:
print '==== %s ==== ' % sentence
words = sentence.split()
for prefix in word_map:
match_score = 0
for word in words:
if word in word_map[prefix]:
match_score += 1
print 'Prefix: %s | MatchScore: %.2fs' % (prefix, float(match_score)/len(words))
And you'd get something like this:
==== Example text where he talks about a phone and his charging cable ====
Prefix: phone | MatchScore: 0.25s
Prefix: plane | MatchScore: 0.00s
Prefix: car | MatchScore: 0.00s
==== Example text where he talks about a car with some wheels ====
Prefix: phone | MatchScore: 0.00s
Prefix: plane | MatchScore: 0.00s
Prefix: car | MatchScore: 0.18s
==== Example text where he talks about a plane ====
Prefix: phone | MatchScore: 0.00s
Prefix: plane | MatchScore: 0.12s
Prefix: car | MatchScore: 0.00s
This is a basic example of course, and words sometimes don't end in spaces -- it could be commas, periods, etc. So you'd want to take that into account. And also the tense I can "phone" someone or "phoned", or "phoning", but also we wouldn't want a word such as "phonetic" to get mixed up. So it gets pretty tricky on edge cases, but for a very basic working(!) example, I would see if you can do it in python without using a natural language library. And eventually, if it doesn't meet your use case, you can start testing them out.
Beyond that you can look at something like Rasa NLU or nltk.
Upvotes: 2
Reputation: 1
You can use the RASA-NLU intent classification pretrained model
Upvotes: 0