jon rios
jon rios

Reputation: 398

unknown Characters in scraped data

i am using Pandas and importing a csv file of a table of rows and columns, mostly text. Some of the text contains these characters below, some repeat multiple times, here is an example. not sure what they are or how to handle them. Im trying multiple encodings and they change but dont go away... Is there a script/process/encoding to clean these types of chars up?

ENCODING UTF-8
.billion stored in the world’s largest database bought for £6, according to an investigation
.Caused, the NMBS said, by a data worker “clicking on the wrong button”.'
.there’s a good chance that you’re one of, one of the nation’s three major credit reporting agencies.'
ENCODING CP1252
.billion stored in the world’s largest database bought for £6, according to an investigation
.Caused, the NMBS said, by a data worker “clicking on the wrong button”.
.there’s a good chance that you’re one of, one of the nation’s three major credit reporting agencies.'

Upvotes: 1

Views: 206

Answers (2)

jon rios
jon rios

Reputation: 398

i ended up just finding all the individual characters and replacing them with nothing. sentences seem to read fine, missing a couple of apostrphe but still readable

spec_chars =['Ä', 'ù', 'ú', 'ì', 'ô', 'ˆ', 'š', '€', '¢', 'Ã', '„', '¬', 'º', '¹', 'Â', '£', '§', '´']

for i in spec_chars:
     mytext= mytext.replace(i, "")

#or over entire DF

df.replace(regex='[ÄùÄúÄìÄôÄùˆš€¢Ã„¬º¹Â£§´]', value="", inplace=True)

Upvotes: 1

DonCarleone
DonCarleone

Reputation: 839

You can include all words that don't have foreign chars

from string import ascii_letters, punctuation

words = [<list_of_words>]
allowed = set(ascii_letters+punctuation)

output = [word for word in words if all(letter in allowed for letter in word)]

See Python - remove elements (foreign characters) from list

Upvotes: 0

Related Questions