Reputation: 43
I‘m trying to reconstruct a sentence by one-to-one matching the words in a word list to a sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i)
text=final
print(final)
the expected output will be like:
a cat is an animal
If I run my code, the 'a' and 'an' in 'animal' will be unavoidably separated too. So I want to sort the word list by the length, and search for the long words first.
words.sort(key=len)
words=words[::-1]
Then I would like to mark the long words with special symbols, and expect the program could skip the part I marked. For example:
acatisan%animal&
And finally I will erase the symbols. But I'm stuck here. I don't know what to do to make the program skip the certain parts between '%' and '&' . Can anyone help me?? Or are there better ways to solve the spacing problem? Lots of Thanks!
**For another case,what if the text include the words that are not included in the word list?How could I handle this?
text=‘wowwwwacatisananimal’
Upvotes: 1
Views: 11254
Reputation: 1686
I wouldn't recommend using different delimeters either side of your matched words(%
and &
in your example.)
It's easier to use the same delimiter either side of your marked word and use Python's list slicing.
The solution below uses the [::n]
syntax for getting every n
th element of a list.
a[::2]
gets even-numbered elements, a[1::2]
gets the odd ones.
>>> fox = "the|quick|brown|fox|jumpsoverthelazydog"
Because they have |
characters on either side, 'quick'
and 'fox'
are odd-numbered elements when you split the string on |
:
>>> splitfox = fox.split('|')
>>> splitfox
['the', 'quick', 'brown', 'fox', 'jumpsoverthelazydog']
>>> splitfox[1::2]
['quick', 'fox']
and the rest are even:
>>> splitfox[::2]
['the', 'brown', 'jumpsoverthelazydog']
So, by enclosing known words in |
characters, splitting, and scanning even-numbered elements, you're searching only those parts of the text that are not yet matched. This means you don't match within already-matched words.
from itertools import chain
def flatten(list_of_lists):
return chain.from_iterable(list_of_lists)
def parse(source_text, words):
words.sort(key=len, reverse=True)
texts = [source_text, ''] # even number of elements helps zip function
for word in words:
new_matches_and_text = []
for text in texts[::2]:
new_matches_and_text.append(text.replace(word, f"|{word}|"))
previously_matched = texts[1::2]
# merge new matches back in
merged = '|'.join(flatten(zip(new_matches_and_text, previously_matched)))
texts = merged.split('|')
# remove blank words (matches at start or end of a string)
texts = [text for text in texts if text]
return ' '.join(texts)
>>> parse('acatisananimal', ['cat', 'is', 'a', 'an', 'animal'])
'a cat is an animal'
>>> parse('atigerisanenormousscaryandbeautifulanimal', ['tiger', 'is', 'an', 'and', 'animal'])
'a tiger is an enormousscary and beautiful animal'
The merge
code uses the zip
and flatten
functions to splice the new matches and old matches together. It basically works by pairing even and odd elements of the list, then "flattening" the result back into one long list ready for the next word.
This approach leaves the unrecognised words in the text.
'beautiful'
and 'a'
are handled well because they're on their own (i.e. next to recognised words.)
'enormous'
and 'scary'
are not known and, as they're next to each other, they're left stuck together.
Here's how to list the unknown words:
>>> known_words = ['cat', 'is', 'an', 'animal']
>>> sentence = parse('anayeayeisananimal', known_words)
>>> [word for word in sentence.split(' ') if word not in known_words]
['ayeaye']
I'm curious: is this a bioinformatics project?
Upvotes: 1
Reputation: 27869
List and dict comprehension is another way to do it:
result = ' '.join([word for word, _ in sorted([(k, v) for k, v in zip(words, [text.find(word) for word in words])], key=lambda x: x[1])])
So, I used zip
to combine words and their position in text, sorted
the words by their position in original text and finally joined the result with ' '
.
Upvotes: 0
Reputation: 30258
A more generalized approach would be to look for all valid words at the beginning, split them off and explore the rest of the letters, e.g.:
def compose(letters, words):
q = [(letters, [])]
while q:
letters, result = q.pop()
if not letters:
return ' '.join(result)
for word in words:
if letters.startswith(word):
q.append((letters[len(word):], result+[word]))
>>> words=['cat','is','an','a','animal']
>>> compose('acatisananimal', words)
'a cat is an animal'
If there are potentially multiple possible sentence compositions it would trivial to turn this into a generator and replace return
with yield
to yield all matching sentence compositions.
Contrived example (just replace return
with yield
):
>>> words=['adult', 'sex', 'adults', 'exchange', 'change']
>>> list(compose('adultsexchange', words))
['adults exchange', 'adult sex change']
Upvotes: 3
Reputation: 2322
if you are getting on that part of removing only the symbols...then regex is your what you are looking for..import a module called re and do this.
import re
code here
print re.sub(r'\W+', ' ', final)
Upvotes: 1
Reputation: 331
You need a small modification in your code, update the code line
final=text.replace(i,' '+i)
to
final=text.replace(i,' '+i, 1)
. This will replace only the first occurrence.
So the updated code would be
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in words:
if i in text:
final=text.replace(i,' '+i, 1)
text=final
print(final)
Output is:
a cat is an animal
Upvotes: 1
Reputation: 11477
Maybe you can replace the word with the index, so the final
string should be like this 3 0 1 2 4
and then convert it back to sentence:
text='acatisananimal'
words=['cat','is','an','a','animal']
for i in sorted(words,key=len,reverse=True):
if i in text:
final=text.replace(i,' %s'%words.index(i))
text=final
print(" ".join(words[int(i)] for i in final.split()))
Output:
a cat is an animal
Upvotes: 2