Nazim Kerimbekov
Nazim Kerimbekov

Reputation: 4783

Decoding UTF8 literals in a CSV file

Question:

Does anyone know how I could transform this b"it\\xe2\\x80\\x99s time to eat" into this it's time to eat


More details & my code:

Hello everyone,

I'm currently working with a CSV file which full of rows with UTF8 literals in them, for example:

b"it\xe2\x80\x99s time to eat"

The end goal is to to get something like this:

it's time to eat

To achieve this I have tried using the following code:

import pandas as pd


file_open = pd.read_csv("/Users/Downloads/tweets.csv")

file_open["text"]=file_open["text"].str.replace("b\'", "")

file_open["text"]=file_open["text"].str.encode('ascii').astype(str)

file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]

print(file_open["text"])

After running the code the row that I took as an example is printed out as:

it\xe2\x80\x99s time to eat

I have tried solving this issue using the following code to open the CSV file:

file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")

which printed out the example row in the following manner:

it\xe2\x80\x99s time to eat

and I have also tried decoding the rows using this:

file_open["text"]=file_open["text"].str.decode('utf-8')

Which gave me the following error:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Thank you very much in advance for your help.

Upvotes: 0

Views: 831

Answers (1)

jedwards
jedwards

Reputation: 30260

b"it\\xe2\\x80\\x99s time to eat" sounds like your file contains an escaped encoding.

In general, you can convert this to a proper Python3 string with something like:

x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x)     # it’s time to eat

(Use of .encode('latin1') explained here)

So, if after you use pd.read_csv(..., encoding="utf8") you still have escaped strings, you can do something like:

pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
#    itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val)   # it’s time to eat

But I think it's probably better to do this to the whole file instead of to each value individually, for example with StringIO (if the file isn't too big):

from io import StringIO

# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
    for line in f:
        line = line.encode('latin1').decode('utf8')
        sio.write(line)
sio.seek(0)    # Reset file pointer to the beginning

# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

Upvotes: 2

Related Questions