Reputation: 370
I am trying to create a python script that will read data from a text file and then checks if it has .(two letters), which well tell me if is a country code. I have tried using split and other methods but have not got it to work? Here is the code I have so far -->
# Python program to
# demonstrate reading files
# using for loop
import re
file2 = open('contry.txt', 'w')
file3 = open('noncountry.txt', 'w')
# Opening file
file1 = open('myfile.txt', 'r')
count = 0
noncountrycount = 0
countrycounter = 0
# Using for loop
print("Using for loop")
for line in file1:
count += 1
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print(line)
countrycounter += 1
else:
print("fail", line)
noncountrycount += 1
print(noncountrycount)
print(countrycounter)
file1.close()
file2.close()
file3.close()
The txt file has this in it
.aaa generic American Automobile Association, Inc.
.aarp generic AARP
.abarth generic Fiat Chrysler Automobiles N.V.
.abb generic ABB Ltd
.abbott generic Abbott Laboratories, Inc.
.abbvie generic AbbVie Inc.
.abc generic Disney Enterprises, Inc.
.able generic Able Inc.
.abogado generic Minds + Machines Group Limited
.abudhabi generic Abu Dhabi Systems and Information Centre
.ac country-code Internet Computer Bureau Limited
.academy generic Binky Moon, LLC
.accenture generic Accenture plc
.accountant generic dot Accountant Limited
.accountants generic Binky Moon, LLC
.aco generic ACO Severin Ahlmann GmbH & Co. KG
.active generic Not assigned
.actor generic United TLD Holdco Ltd.
.ad country-code Andorra Telecom
.adac generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC)
.ads generic Charleston Road Registry Inc.
.adult generic ICM Registry AD LLC
.ae country-code Telecommunication Regulatory Authority (TRA)
.aeg generic Aktiebolaget Electrolux
.aero sponsored Societe Internationale de Telecommunications Aeronautique (SITA INC USA)
I am getting this error now File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to
Upvotes: 1
Views: 1907
Reputation: 9418
It's usually not only an issue with the code, so we need all the context to reproduce, debug and solve.
The final hint was the console output (error, stacktrace) you pasted.
This is how I read & analyze the error-output (Python's stacktrace):
... C:/Users/tyler/Desktop ...
... findcountrycodes/Test.py", line 15 ...
... Python36\lib\encodings*cp1252*.py ...
... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:
From this output we can extract important contextual information to research & solve the issue:
Test.py
points to the erroneous statement reading the file: file1 = open('myfile.txt', 'r')
UnicodeDecodeError
, a frequently occuring Python Exception when reading filesYou can now:
UnicodeDecodeError
.One answer suggests to use the nowadays common UTF-8:
open(filename, encoding="utf8")
An methodical solution-approach would be:
Notepad++
encoding
See also:
So you want only the lines with country-code
s.
Then expected these 3 lines of your input file to be filtered:
.ad country-code Andorra Telecom
.ac country-code Internet Computer Bureau Limited
.ae country-code Telecommunication Regulatory Authority (TRA)
As you already did, test each line of the file.
Test if the line starts with these 4 characters .xx
(where xx
can be any ASCII-letter).
This regular expression tests for a valid two-letter country code:
^\.\w{2}\s
^
from the start of the string (line
)\.
(first) letter should be a dot\w{2}
(followed by) any two word-characters (⚠️ also matches _0
)\s
(followed by) a single whitespace (blank, tab, etc.)This is done in your code as follows (assuming the line
is populated from read lines):
import re
line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print('found country-code')
Here is a runnable demo on IDEone
Upvotes: 1
Reputation: 23783
You are splitting on three spaces
but the character codes are only followed by one space so your logic is wrong.
>>> s = '.ac country-code Internet Computer Bureau Limited'
>>> s.strip().split(' ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>
Check if the third character is not a space and the fourth character is a space.
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>
Upvotes: 1
Reputation: 6176
Is this something You were looking for:
with open('lorem.txt') as file:
data = file.readlines()
for line in data:
temp = line.split()[0]
if len(temp) == 3:
print(temp)
In short:
file.readlines()
in this case returns a list of all lines in the file, pretty much it split the file by \n
.
Then for each of those lines it gets split even more by spaces, and since the code You need is the first in the line it is also first in the list, so now it is important to check if the first item in the list is 3 characters long because since Your formatting seems pretty consistent only a length of 3 will be a country code.
Upvotes: 2