ShadowGunn
ShadowGunn

Reputation: 370

How to check txt file for content

I am trying to create a python script that will read data from a text file and then checks if it has .(two letters), which well tell me if is a country code. I have tried using split and other methods but have not got it to work? Here is the code I have so far -->

# Python program to
# demonstrate reading files
# using for loop
import re

file2 = open('contry.txt', 'w')
file3 = open('noncountry.txt', 'w')
# Opening file
file1 = open('myfile.txt', 'r')
count = 0
noncountrycount = 0
countrycounter = 0
# Using for loop
print("Using for loop")
for line in file1:
    count += 1
    
    pattern = re.compile(r'^\.\w{2}\s')
    if pattern.match(line):
        print(line)
        countrycounter += 1
    else:
        print("fail", line)

        noncountrycount += 1

print(noncountrycount)
print(countrycounter)
file1.close()
file2.close()
file3.close()

The txt file has this in it

.aaa    generic American Automobile Association, Inc.
.aarp   generic AARP
.abarth generic Fiat Chrysler Automobiles N.V.
.abb    generic ABB Ltd
.abbott generic Abbott Laboratories, Inc.
.abbvie generic AbbVie Inc.
.abc    generic Disney Enterprises, Inc.
.able   generic Able Inc.
.abogado    generic Minds + Machines Group Limited
.abudhabi   generic Abu Dhabi Systems and Information Centre
.ac country-code    Internet Computer Bureau Limited
.academy    generic Binky Moon, LLC
.accenture  generic Accenture plc
.accountant generic dot Accountant Limited
.accountants    generic Binky Moon, LLC
.aco    generic ACO Severin Ahlmann GmbH & Co. KG
.active generic Not assigned
.actor  generic United TLD Holdco Ltd.
.ad country-code    Andorra Telecom
.adac   generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC)
.ads    generic Charleston Road Registry Inc.
.adult  generic ICM Registry AD LLC
.ae country-code    Telecommunication Regulatory Authority (TRA)
.aeg    generic Aktiebolaget Electrolux
.aero   sponsored   Societe Internationale de Telecommunications Aeronautique (SITA INC USA)

I am getting this error now File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to

Upvotes: 1

Views: 1907

Answers (3)

hc_dev
hc_dev

Reputation: 9418

It's usually not only an issue with the code, so we need all the context to reproduce, debug and solve.

Encoding error

The final hint was the console output (error, stacktrace) you pasted.

Read the stacktrace & research

This is how I read & analyze the error-output (Python's stacktrace):

... C:/Users/tyler/Desktop ...

... findcountrycodes/Test.py", line 15 ...

... Python36\lib\encodings*cp1252*.py ...

... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:

From this output we can extract important contextual information to research & solve the issue:

  1. you are using Windows
  2. the line 15 in your script Test.py points to the erroneous statement reading the file: file1 = open('myfile.txt', 'r')
  3. you are using Python 3.6 and the currently used encoding was Windows 1252 (cp-1252)
  4. the root-cause is UnicodeDecodeError, a frequently occuring Python Exception when reading files

You can now:

  • research Stackoverflow and the web for this exception: UnicodeDecodeError.
  • improve your question by adding this context (as keywords, tag, or dump as plain output)

Try a different encoding

One answer suggests to use the nowadays common UTF-8: open(filename, encoding="utf8")

Detect the file encoding

An methodical solution-approach would be:

  1. check the file's encoding or charset, e.g. using an editor, on windows Notepad or Notepad++
  2. open the file your Python code with the proper encoding

See also:

Filtering lines for country-codes

So you want only the lines with country-codes.

Filtering expected

Then expected these 3 lines of your input file to be filtered:

.ad country-code    Andorra Telecom
.ac country-code    Internet Computer Bureau Limited
.ae country-code    Telecommunication Regulatory Authority (TRA)

Solution using regex

As you already did, test each line of the file. Test if the line starts with these 4 characters .xx (where xx can be any ASCII-letter).

Regex explained

This regular expression tests for a valid two-letter country code:

^\.\w{2}\s
  • ^ from the start of the string (line)
  • \. (first) letter should be a dot
  • \w{2} (followed by) any two word-characters (⚠️ also matches _0)
  • \s (followed by) a single whitespace (blank, tab, etc.)

Python code

This is done in your code as follows (assuming the line is populated from read lines):

import re

line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
    print('found country-code')

Here is a runnable demo on IDEone

Further Readings

Upvotes: 1

wwii
wwii

Reputation: 23783

You are splitting on three spaces but the character codes are only followed by one space so your logic is wrong.

>>> s = '.ac country-code    Internet Computer Bureau Limited'
>>> s.strip().split('   ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>

Check if the third character is not a space and the fourth character is a space.

>>> if s[2] != ' ' and s[3] == ' ':
...     print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado    generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
...     print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>

Upvotes: 1

Matiiss
Matiiss

Reputation: 6176

Is this something You were looking for:

with open('lorem.txt') as file:
    data = file.readlines()

for line in data:
    temp = line.split()[0]
    if len(temp) == 3:
        print(temp)

In short:

file.readlines() in this case returns a list of all lines in the file, pretty much it split the file by \n.

Then for each of those lines it gets split even more by spaces, and since the code You need is the first in the line it is also first in the list, so now it is important to check if the first item in the list is 3 characters long because since Your formatting seems pretty consistent only a length of 3 will be a country code.

Upvotes: 2

Related Questions