Reputation: 79
I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:
import codecs
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []
for counter, line in enumerate(hypernyms):
count_arr.append(0)
for word in words:
if line.find(word) >=0:
count_arr[counter] +=1
for iterator, count in enumerate(count_arr):
if count>0:
print iterator, ' ', count
This is finding some words, but ignoring some others The input files are: File-1:
पौधा
वनस्पति
File-2:
वनस्पति, पेड़-पौधा
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग
पादप_समूह, पेड़-पौधे, वनस्पति_समूह
पेड़-पौधा
This gives output:
0 1
3 1
Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?
Upvotes: 1
Views: 212
Reputation: 273
Put this code and you will see why that happens,because of the spaces: in file 1 the first word is पौधा[space]....
for i in hypernyms:
print "file1",i
for i in words:
print "file2",i
After count_arr = [] and before for counter, line...
Upvotes: 0
Reputation:
That because You don't remove the "\n" charactor at the end of lines. So you don't search "some_pattern\n", not "some_pattern". Use strip() function to chop them off like this:
import codecs
words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []
for line in hypernyms:
count_arr.append(0)
for word in words:
count_arr[-1] += (word in line)
for count in enumerate(count_arr):
if count:
print iterator, ' ', count
Upvotes: 0
Reputation: 353379
I think the problem is here:
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
.readlines()
will leave the line break at the end, so you're not searching for पौधा, you're searching for पौधा\n
, and you'll only match at the end of a line. If I use .read().split()
instead, I get
0 2
2 1
3 1
Upvotes: 1