Reputation: 47

Python - How to read multiple lines from text file as a string and remove all encoding?

I have a list of 77 items. I have placed all 77 items in a text file (one per line).

I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).

Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.

Question:

In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?

Here is how I am reading the text file into python:

with open("name_list2.txt", "r") as myfile:
        policy_match_list = myfile.readlines()

policy_match_list = [x.strip() for x in policy_match_list]

Note - "policy_match_list" is the list of 77 policies read in from the text file.

Here is how I am comparing my two lists:

    for policy_name in policy_match_list:
        for us_policy in us_policies:
            if policy_name == us_policy["name"]:
                print(f"Match #{match} | {policy_name}")
                match += 1

Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to

Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"

I hope this makes sense, let me know if I can add any further info

Cheers!

Upvotes: 1

Answers (2)

Abhinav Mathur

Reputation: 8101

Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:

Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.

Upvotes: 0

Nimrod Rappaport

Reputation: 153

Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process. Try doing:

with open("name_list2.txt", "r", encoding="utf-8") as myfile:

Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?

Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.

Upvotes: 1

Python - How to read multiple lines from text file as a string and remove all encoding?

Answers (2)

Related Questions