toyhtoza
toyhtoza

Reputation: 43

Python How to skip the part in a string marked by certain symbols?

I‘m trying to reconstruct a sentence by one-to-one matching the words in a word list to a sentence:

text='acatisananimal'
words=['cat','is','an','a','animal']

for i in words:
    if i in text:
        final=text.replace(i,' '+i)
        text=final
print(final)

the expected output will be like:

a cat is an animal

If I run my code, the 'a' and 'an' in 'animal' will be unavoidably separated too. So I want to sort the word list by the length, and search for the long words first.

words.sort(key=len)
words=words[::-1]

Then I would like to mark the long words with special symbols, and expect the program could skip the part I marked. For example:

acatisan%animal&

And finally I will erase the symbols. But I'm stuck here. I don't know what to do to make the program skip the certain parts between '%' and '&' . Can anyone help me?? Or are there better ways to solve the spacing problem? Lots of Thanks!

**For another case,what if the text include the words that are not included in the word list?How could I handle this?

text=‘wowwwwacatisananimal’

Upvotes: 1

Views: 11254

Answers (6)

Kim
Kim

Reputation: 1686

I wouldn't recommend using different delimeters either side of your matched words(% and & in your example.)

It's easier to use the same delimiter either side of your marked word and use Python's list slicing.

The solution below uses the [::n] syntax for getting every nth element of a list.

a[::2] gets even-numbered elements, a[1::2] gets the odd ones.

>>> fox = "the|quick|brown|fox|jumpsoverthelazydog" 

Because they have | characters on either side, 'quick' and 'fox' are odd-numbered elements when you split the string on |:

>>> splitfox = fox.split('|')
>>> splitfox
['the', 'quick', 'brown', 'fox', 'jumpsoverthelazydog']
>>> splitfox[1::2]
['quick', 'fox']

and the rest are even:

>>> splitfox[::2]
['the', 'brown', 'jumpsoverthelazydog']

So, by enclosing known words in | characters, splitting, and scanning even-numbered elements, you're searching only those parts of the text that are not yet matched. This means you don't match within already-matched words.

from itertools import chain


def flatten(list_of_lists):
    return chain.from_iterable(list_of_lists)


def parse(source_text, words):
    words.sort(key=len, reverse=True)
    texts = [source_text, '']  # even number of elements helps zip function
    for word in words:
        new_matches_and_text = []
        for text in texts[::2]:
            new_matches_and_text.append(text.replace(word, f"|{word}|"))
        previously_matched = texts[1::2]
        # merge new matches back in
        merged = '|'.join(flatten(zip(new_matches_and_text, previously_matched)))
        texts = merged.split('|')
    # remove blank words (matches at start or end of a string)
    texts = [text for text in texts if text]
    return ' '.join(texts)

>>> parse('acatisananimal', ['cat', 'is', 'a', 'an', 'animal'])
'a cat is an animal'
>>> parse('atigerisanenormousscaryandbeautifulanimal', ['tiger', 'is', 'an',  'and', 'animal'])
'a tiger is an enormousscary and beautiful animal'

The merge code uses the zip and flatten functions to splice the new matches and old matches together. It basically works by pairing even and odd elements of the list, then "flattening" the result back into one long list ready for the next word.

This approach leaves the unrecognised words in the text.

'beautiful' and 'a' are handled well because they're on their own (i.e. next to recognised words.)

'enormous' and 'scary' are not known and, as they're next to each other, they're left stuck together.

Here's how to list the unknown words:

>>> known_words = ['cat', 'is', 'an', 'animal']
>>> sentence = parse('anayeayeisananimal', known_words)
>>> [word for word in sentence.split(' ') if word not in known_words]
['ayeaye']

I'm curious: is this a bioinformatics project?

Upvotes: 1

zipa
zipa

Reputation: 27869

List and dict comprehension is another way to do it:

result = ' '.join([word for word, _ in sorted([(k, v) for k, v in zip(words, [text.find(word) for word in words])], key=lambda x: x[1])])

So, I used zip to combine words and their position in text, sorted the words by their position in original text and finally joined the result with ' '.

Upvotes: 0

AChampion
AChampion

Reputation: 30258

A more generalized approach would be to look for all valid words at the beginning, split them off and explore the rest of the letters, e.g.:

def compose(letters, words):
    q = [(letters, [])]
    while q:
        letters, result = q.pop()
        if not letters:
            return ' '.join(result)
        for word in words:
            if letters.startswith(word):
                q.append((letters[len(word):], result+[word]))

>>> words=['cat','is','an','a','animal']
>>> compose('acatisananimal', words)
'a cat is an animal'

If there are potentially multiple possible sentence compositions it would trivial to turn this into a generator and replace return with yield to yield all matching sentence compositions.

Contrived example (just replace return with yield):

>>> words=['adult', 'sex', 'adults', 'exchange', 'change']
>>> list(compose('adultsexchange', words))
['adults exchange', 'adult sex change']

Upvotes: 3

Eliethesaiyan
Eliethesaiyan

Reputation: 2322

if you are getting on that part of removing only the symbols...then regex is your what you are looking for..import a module called re and do this.

import re
code here
print  re.sub(r'\W+', ' ', final)

Upvotes: 1

Luminos
Luminos

Reputation: 331

You need a small modification in your code, update the code line

final=text.replace(i,' '+i)

to

final=text.replace(i,' '+i, 1) . This will replace only the first occurrence.

So the updated code would be

text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
    final=text.replace(i,' '+i, 1)
    text=final
print(final)

Output is:

a cat is an animal

Upvotes: 1

McGrady
McGrady

Reputation: 11477

Maybe you can replace the word with the index, so the final string should be like this 3 0 1 2 4 and then convert it back to sentence:

text='acatisananimal'
words=['cat','is','an','a','animal']


for i in sorted(words,key=len,reverse=True):
    if i in text:
        final=text.replace(i,' %s'%words.index(i))
        text=final
print(" ".join(words[int(i)] for i in final.split()))

Output:

a cat is an animal

Upvotes: 2

Related Questions