ChamingaD
ChamingaD

Reputation: 2928

Iterate through python list

I have UTF-8 Unicode text file as below (non-english)

unicode textfile

So I marked encoding as UTF-8 in python and imported file into python.

# -*- coding: utf-8 -*-

I have tokenized sentences by "." and got list of sentences.

sentence list

Now i need to compare with another unicode word list and find out whether any of those words in each sentence.

This is my code. But it shows only first match identified.

for sentence in sentences:
    for word in sentence.split(" "):
        if word in pronouns:
            print sentence

EDIT:

Finally I noticed there is invalid unicode character in source text files. It is described here Tokenizing unicode using nltk

Upvotes: 2

Views: 5859

Answers (1)

KarTo
KarTo

Reputation: 98

I tried to simulate your problem, but I get the expected result, maybe the problem is in the Encoding or in your list of pronouns.

pronouns = ['aa','bb','cc']

sentences = ['aa dkdje asdf aesr','bb asersada','cc ase aser sa sa c ','aa saef sf se s', 'aa','bb']

for sentence in sentences:
    for word in sentence.split(" "):
        if word in pronouns:
            print (sentence)

The output of the code was:

aa dkdje asdf aesr
bb asersada
cc ase aser sa sa c 
aa saef sf se s
aa
bb

Hope this is helpful.

Upvotes: 2

Related Questions