Reputation: 57
Why the first word is printing but not included in search in 'dic'.
Can any one tell me the reason and solution how to include first word also?
here is my code:
my_dic = {
"a":"1",
"b":"2",
"c":"3",
"d":"4",
"e":"5",
}
with open('c:\\english_text_file.txt',encoding = 'utf8') as file :
for line in file:
for word in line.split():
print('word from line.split: ',word)
if word in my_dic.keys():
print('word from if word in ...',word)
contents of the text file is:
a b c d e
the output code is:
word from line.split: a
word from line.split: b
word from if word in ... b
word from line.split: c
word from if word in ... c
word from line.split: d
word from if word in ... d
word from line.split: e
word from if word in ... e
Upvotes: 1
Views: 95
Reputation: 31614
This is because a behavior of windows for txt file: it will add BOM
to the start of a txt file.
What is BOM
?
It means Byte-order mark Description
, value as follows:
Byte-order mark Description
EF BB BF UTF-8
FF FE UTF-16 aka UCS-2, little endian
FE FF UTF-16 aka UCS-2, big endian
00 00 FF FE UTF-32 aka UCS-4, little endian.
00 00 FE FF UTF-32 aka UCS-4, big-endian.
Open your english_text_file.txt
, and view it with any hex editor you will see the content is:
efbb bf61 2062 2063 2064 2065 0d0a
Here, efbb bf
is the BOM, 61 2062 2063 2064 2065 0d0a
is the ASCII code of a b c d e\r\n
So for utf-8 file, we need to check if it has BOM
at the start, if has, need to remove it.
Next is a sample code for your reference, if you do not mind to change the original file, you can also directly override the old input file, here I just use a new file with no BOM
in it.
import codecs
my_dic = {
"a":"1",
"b":"2",
"c":"3",
"d":"4",
"e":"5",
}
content = open('./english_text_file.txt', 'rb').read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
open('./changed_english_text_file.txt', 'wb').write(content)
else:
open('./changed_english_text_file.txt', 'wb').write(content)
with open('./changed_english_text_file.txt',encoding = 'utf8') as file :
for line in file:
for word in line.split():
print('word from line.split: ',word)
if word in my_dic.keys():
print('word from if word in ...',word)
Output is:
word from line.split: a
word from if word in ... a
word from line.split: b
word from if word in ... b
word from line.split: c
word from if word in ... c
word from line.split: d
word from if word in ... d
word from line.split: e
word from if word in ... e
Upvotes: 2