Python - first word is not being included in search

Question

Why the first word is printing but not included in search in 'dic'.

Can any one tell me the reason and solution how to include first word also?

here is my code:

my_dic = {
"a":"1", 
"b":"2", 
"c":"3", 
"d":"4", 
"e":"5", 
}

with open('c:\english_text_file.txt',encoding = 'utf8') as file :
  for line in file:
    for word in line.split():
      print('word from line.split: ',word)
      if word in my_dic.keys():
       print('word from if word in ...',word)

and the test file is here:

contents of the text file is:

a b c d e

the output code is:

word from line.split:  a
word from line.split:  b
word from if word in ... b
word from line.split:  c
word from if word in ... c
word from line.split:  d
word from if word in ... d
word from line.split:  e
word from if word in ... e

atline · Accepted Answer

This is because a behavior of windows for txt file: it will add BOM to the start of a txt file.

What is BOM?

It means Byte-order mark Description, value as follows:

Byte-order mark Description 
EF BB BF UTF-8 
FF FE UTF-16 aka UCS-2, little endian 
FE FF UTF-16 aka UCS-2, big endian 
00 00 FF FE UTF-32 aka UCS-4, little endian. 
00 00 FE FF UTF-32 aka UCS-4, big-endian.

Open your english_text_file.txt, and view it with any hex editor you will see the content is:

efbb bf61 2062 2063 2064 2065 0d0a

Here, efbb bf is the BOM, 61 2062 2063 2064 2065 0d0a is the ASCII code of a b c d e

So for utf-8 file, we need to check if it has BOM at the start, if has, need to remove it.

Next is a sample code for your reference, if you do not mind to change the original file, you can also directly override the old input file, here I just use a new file with no BOM in it.

import codecs

my_dic = {
    "a":"1",
    "b":"2",
    "c":"3",
    "d":"4",
    "e":"5",
}

content = open('./english_text_file.txt', 'rb').read()
if content[:3] == codecs.BOM_UTF8:
    content = content[3:]
    open('./changed_english_text_file.txt', 'wb').write(content)
else:
    open('./changed_english_text_file.txt', 'wb').write(content)

with open('./changed_english_text_file.txt',encoding = 'utf8') as file :
    for line in file:
        for word in line.split():
            print('word from line.split: ',word)
            if word in my_dic.keys():
                print('word from if word in ...',word)

Output is:

word from line.split:  a
word from if word in ... a
word from line.split:  b
word from if word in ... b
word from line.split:  c
word from if word in ... c
word from line.split:  d
word from if word in ... d
word from line.split:  e
word from if word in ... e

Python - first word is not being included in search

Answers (1)

Related Questions