Reputation: 2870
I have a pandas DataFrame whereby a column consists of strings as follows
import pandas as pd
df = pd.DataFrame(...)
df
WORD
0 '0% de mati\xc3\xa8res grasses'
1 '115 apr\xc3\xa8s J.-C.'
For each string in the dataframe, I can read them as bytes
by b'0% de mati\xc3\xa8res grasses'.decode("utf-8")
and b'115 apr\xc3\xa8s J.-C.'.decode("utf-8")
. I would like to ask how to decode this column. I tried df['WORD'].astype('bytes').str.decode("utf-8")
but to no avail.
Thank you so much for your help!
Upvotes: 0
Views: 718
Reputation: 13750
It's hard to know what the initial encoding is, but it looks like latin-1:
df['WORD'].str.encode('latin-1').str.decode('utf-8')
0 0% de matières grasses
1 115 après J.-C.
Name: WORD, dtype: object
Since the output seems sensical I'd say this is correct, but generally there's no surefire way to re-encode text if it has an unknown encoding to start.
Upvotes: 1