Reputation: 2928
I have UTF-8 Unicode text file as below (non-english)
So I marked encoding as UTF-8 in python and imported file into python.
# -*- coding: utf-8 -*-
I have tokenized sentences by "." and got list of sentences.
Now i need to compare with another unicode word list and find out whether any of those words in each sentence.
This is my code. But it shows only first match identified.
for sentence in sentences:
for word in sentence.split(" "):
if word in pronouns:
print sentence
EDIT:
Finally I noticed there is invalid unicode character in source text files. It is described here Tokenizing unicode using nltk
Upvotes: 2
Views: 5859
Reputation: 98
I tried to simulate your problem, but I get the expected result, maybe the problem is in the Encoding or in your list of pronouns.
pronouns = ['aa','bb','cc']
sentences = ['aa dkdje asdf aesr','bb asersada','cc ase aser sa sa c ','aa saef sf se s', 'aa','bb']
for sentence in sentences:
for word in sentence.split(" "):
if word in pronouns:
print (sentence)
The output of the code was:
aa dkdje asdf aesr
bb asersada
cc ase aser sa sa c
aa saef sf se s
aa
bb
Hope this is helpful.
Upvotes: 2