jazz
jazz

Reputation: 13

String formatted as UTF-8 in Pandas Dataframe

I'm reading from a CSV file with columns of various types.

df = pd.read_csv('file_name.csv')
df.head()[columnname]
0    b'Hi,\r\n\r\nI hope you are well.'
1    b"\xc2\xa0Hello,\r\n\xc2\xa0\r\n "
2    b"\r\n\r\n blah blah blah"
3    NaN
4    b'blah blah blah'
Name: columnname, dtype: object

From my understanding, the b'' format implies that it is a byte string and I have to .decode('utf-8') it to a string formatted in ascii and remove b'' as well as encodings like \xc2\xa0. However, when I try to decode, I get an error.

df[columnname] = df[columnname].apply(lambda x: x.decode('utf-8'))
AttributeError: 'str' object has no attribute 'decode'

I think what is going on is that when reading from the csv file, the column is set to str data type, as such "b'Hi...'". So I checked the raw CSV file, and I saw previous_column,"b'Hi....'", next_column. Is there a way to properly read this column as a byte string such that I could later call the decode function?

I've also tried setting the dtype=np.bytes_for that specific column in the pd.read_csv() function and calling df.astype after reading the csv, but neither works. My last resort would be to manually remove the encodings with regex.

Upvotes: 1

Views: 2468

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195428

If your column values are really strings like this: "b'some string'", then you can try apply ast.literal_eval on them:

from ast import literal_eval

df['columnname'] = df['columnname'].fillna("b''").apply(lambda x: literal_eval(x).decode('utf-8'))
print(df)

Should print:

   index                       columnname
0      1  Hi,\r\n\r\nI hope you are well.
1      2                 Hello,\r\n \r\n 
2      3          \r\n\r\n blah blah blah
3      4                                 
4      5                   blah blah blah

Upvotes: 1

Related Questions