How to read a column as bytes?

Question

I have a pandas DataFrame whereby a column consists of strings as follows

import pandas as pd
df = pd.DataFrame(...)
df
       WORD         
0      '0% de mati\xc3\xa8res grasses'       
1      '115 apr\xc3\xa8s J.-C.'

For each string in the dataframe, I can read them as bytes by b'0% de mati\xc3\xa8res grasses'.decode("utf-8") and b'115 apr\xc3\xa8s J.-C.'.decode("utf-8"). I would like to ask how to decode this column. I tried df['WORD'].astype('bytes').str.decode("utf-8") but to no avail.

Thank you so much for your help!

BallpointBen · Accepted Answer

It's hard to know what the initial encoding is, but it looks like latin-1:

df['WORD'].str.encode('latin-1').str.decode('utf-8')

0    0% de matières grasses
1           115 après J.-C.
Name: WORD, dtype: object

Since the output seems sensical I'd say this is correct, but generally there's no surefire way to re-encode text if it has an unknown encoding to start.

How to read a column as bytes?

Answers (1)

Related Questions