Reputation: 1659

Split sentences, process words, and put sentence back together?

I have a function that scores words. I have lots of text from sentences to several page documents. I'm stuck on how to score the words and return the text near its original state.

Here's an example sentence:

"My body lies over the ocean, my body lies over the sea."

What I want to produce is the following:

"My body (2) lies over the ocean (3), my body (2) lies over the sea."

Below is a dummy version of my scoring algorithm. I've figured out how to take text, tear it apart and score it.

However, I'm stuck on how to put it back together into the format I need it in.

Here's a dummy version of my function:

def word_score(text):
    words_to_work_with = []
    words_to_return = []
    passed_text = TextBlob(passed_text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)
    for word in words to work with:
        if word == 'body':
            score = 2
        if word == 'ocean':
            score = 3
        else:
            score = None
        words_to_return.append((word,score))
    return words_to_return

I'm a relative newbie so I have two questions:

How can I put the text back together, and
Should that logic be put into the function or outside of it?

I'd really like to be able to feed entire segments (i.e. sentences, documents) into the function and have it return them.

Thank you for helping me!

Upvotes: 0

Answers (3)

glhr

Reputation: 4547

Here's a working implementation. The function first parses the input text as a list, such that each list element is a word or a combination of punctuation characters (eg. a comma followed by a space.) Once the words in the list have been processed, it combines the list back into a string and returns it.

def word_score(text):
    words_to_work_with = re.findall(r"\b\w+|\b\W+",text)
    for i,word in enumerate(words_to_work_with):
        if word.isalpha():
            words_to_work_with[i] = inflection.singularize(word).lower()
            words_to_work_with[i] = lemmatizer.lemmatize(word)
            if word == 'body':
               words_to_work_with[i] = 'body (2)'
            elif word == 'ocean':
               words_to_work_with[i] = 'ocean (3)'
    return ''.join(words_to_work_with)

txt = "My body lies over the ocean, my body lies over the sea."
output = word_score(txt)
print(output)

Output:

My body (2) lie over the ocean (3), my body (2) lie over the sea.

If you have more than 2 words that you want to score, using a dictionary instead of if conditions is indeed a good idea.

Upvotes: 0

Shashank Singh

Reputation: 719

Hope this would help. Based on your question, it has worked for me.

best regards!!

"""
Python 3.7.2

Input:
Saved text in the file named as "original_text.txt"
My body lies over the ocean, my body lies over the sea. 
"""
input_file = open('original_text.txt', 'r') #Reading text from file
output_file = open('processed_text.txt', 'w') #saving output text in file

output_text = []

for line in input_file:
    words =  line.split()
    for word in words:
        if word == 'body':
            output_text.append('body (2)')
            output_file.write('body (2) ')
        elif word == 'body,':
            output_text.append('body (2),')
            output_file.write('body (2), ')
        elif word == 'ocean':
            output_text.append('ocean (3)')
            output_file.write('ocean (3) ')
        elif word == 'ocean,':
            output_text.append('ocean (3),')
            output_file.write('ocean (3), ')
        else:
            output_text.append(word)
            output_file.write(word+' ')

print (output_text)
input_file.close()
output_file.close()

Upvotes: 0

MaximGi

Reputation: 604

So basically, you want to attribute a score for each word. The function you give may be improved using a dictionary instead of several if statements. Also you have to return all scores, instead of just the score of the first wordin words_to_work_with which is the current behavior of the function since it will return an integer on the first iteration. So the new function would be :

def word_score(text)
    words_to_work_with = []
    passed_text = TextBlob(text)
    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word) # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body' : 2, 'ocean' : 3, etc ...}
    return [dict_scores.get(word, None)] # if word is not recognized, score is None

For the second part, which is reconstructing the string, I would actually do this in the same function (so this answers your second question) :

def word_score_and_reconstruct(text):
    words_to_work_with = []
    passed_text = TextBlob(text)

    reconstructed_text = ''

    for word in words_to_work_with:
        word = word.singularize().lower()
        word = str(word)  # Is this line really useful ?
        e_word_lemma = lemmatizer.lemmatize(word)
        words_to_work_with.append(e_word_lemma)

    dict_scores = {'body': 2, 'ocean': 3}
    dict_strings = {'body': ' (2)', 'ocean': ' (3)'}

    word_scores = []

    for word in words_to_work_with:
        word_scores.append(dict_scores.get(word, None)) # we still construct the scores list here

        # we add 'word'+'(word's score)', only if the word has a score
        # if not, we add the default value '' meaning we don't add anything
        reconstructed_text += word + dict_strings.get(word, '')

    return reconstructed_text, word_scores

I'm not guaranteeing this code will work at first try, I can't test it but it'll give you the main idea

Upvotes: 1

Split sentences, process words, and put sentence back together?

Answers (3)

Related Questions