String formatted as UTF-8 in Pandas Dataframe

Question

I'm reading from a CSV file with columns of various types.

df = pd.read_csv('file_name.csv')
df.head()[columnname]

0    b'Hi,

I hope you are well.'
1    b"\xc2\xa0Hello,
\xc2\xa0
 "
2    b"

 blah blah blah"
3    NaN
4    b'blah blah blah'
Name: columnname, dtype: object

From my understanding, the b'' format implies that it is a byte string and I have to .decode('utf-8') it to a string formatted in ascii and remove b'' as well as encodings like \xc2\xa0. However, when I try to decode, I get an error.

df[columnname] = df[columnname].apply(lambda x: x.decode('utf-8'))

AttributeError: 'str' object has no attribute 'decode'

I think what is going on is that when reading from the csv file, the column is set to str data type, as such "b'Hi...'". So I checked the raw CSV file, and I saw previous_column,"b'Hi....'", next_column. Is there a way to properly read this column as a byte string such that I could later call the decode function?

I've also tried setting the dtype=np.bytes_for that specific column in the pd.read_csv() function and calling df.astype after reading the csv, but neither works. My last resort would be to manually remove the encodings with regex.

Andrej Kesely · Accepted Answer

If your column values are really strings like this: "b'some string'", then you can try apply ast.literal_eval on them:

from ast import literal_eval

df['columnname'] = df['columnname'].fillna("b''").apply(lambda x: literal_eval(x).decode('utf-8'))
print(df)

Should print:

   index                       columnname
0      1  Hi,

I hope you are well.
1      2                 Hello,
 
 
2      3          

 blah blah blah
3      4                                 
4      5                   blah blah blah

String formatted as UTF-8 in Pandas Dataframe

Answers (1)

Related Questions