Reputation: 1659
I have a function that scores words. I have lots of text from sentences to several page documents. I'm stuck on how to score the words and return the text near its original state.
Here's an example sentence:
"My body lies over the ocean, my body lies over the sea."
What I want to produce is the following:
"My body (2) lies over the ocean (3), my body (2) lies over the sea."
Below is a dummy version of my scoring algorithm. I've figured out how to take text, tear it apart and score it.
However, I'm stuck on how to put it back together into the format I need it in.
Here's a dummy version of my function:
def word_score(text):
words_to_work_with = []
words_to_return = []
passed_text = TextBlob(passed_text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word)
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
for word in words to work with:
if word == 'body':
score = 2
if word == 'ocean':
score = 3
else:
score = None
words_to_return.append((word,score))
return words_to_return
I'm a relative newbie so I have two questions:
I'd really like to be able to feed entire segments (i.e. sentences, documents) into the function and have it return them.
Thank you for helping me!
Upvotes: 0
Views: 756
Reputation: 4547
Here's a working implementation. The function first parses the input text as a list, such that each list element is a word or a combination of punctuation characters (eg. a comma followed by a space.) Once the words in the list have been processed, it combines the list back into a string and returns it.
def word_score(text):
words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
for i,word in enumerate(words_to_work_with):
if word.isalpha():
words_to_work_with[i] = inflection.singularize(word).lower()
words_to_work_with[i] = lemmatizer.lemmatize(word)
if word == 'body':
words_to_work_with[i] = 'body (2)'
elif word == 'ocean':
words_to_work_with[i] = 'ocean (3)'
return ''.join(words_to_work_with)
txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)
Output:
My body (2) lie over the ocean (3), my body (2) lie over the sea.
If you have more than 2 words that you want to score, using a dictionary instead of if
conditions is indeed a good idea.
Upvotes: 0
Reputation: 719
Hope this would help. Based on your question, it has worked for me.
best regards!!
"""
Python 3.7.2
Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea.
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file
output_text = []
for line in input_file:
words = line.split()
for word in words:
if word == 'body':
output_text.append('body (2)')
output_file.write('body (2) ')
elif word == 'body,':
output_text.append('body (2),')
output_file.write('body (2), ')
elif word == 'ocean':
output_text.append('ocean (3)')
output_file.write('ocean (3) ')
elif word == 'ocean,':
output_text.append('ocean (3),')
output_file.write('ocean (3), ')
else:
output_text.append(word)
output_file.write(word+' ')
print (output_text)
input_file.close()
output_file.close()
Upvotes: 0
Reputation: 604
So basically, you want to attribute a score for each word. The function you give may be improved using a dictionary instead of several if
statements.
Also you have to return all scores, instead of just the score of the first word
in words_to_work_with
which is the current behavior of the function since it will return an integer on the first iteration.
So the new function would be :
def word_score(text)
words_to_work_with = []
passed_text = TextBlob(text)
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
return [dict_scores.get(word, None)] # if word is not recognized, score is None
For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) :
def word_score_and_reconstruct(text):
words_to_work_with = []
passed_text = TextBlob(text)
reconstructed_text = ''
for word in words_to_work_with:
word = word.singularize().lower()
word = str(word) # Is this line really useful ?
e_word_lemma = lemmatizer.lemmatize(word)
words_to_work_with.append(e_word_lemma)
dict_scores = {'body': 2, 'ocean': 3}
dict_strings = {'body': ' (2)', 'ocean': ' (3)'}
word_scores = []
for word in words_to_work_with:
word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here
# we add 'word'+'(word's score)', only if the word has a score
# if not, we add the default value '' meaning we don't add anything
reconstructed_text += word + dict_strings.get(word, '')
return reconstructed_text, word_scores
I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea
Upvotes: 1