Reputation: 13
I'm reading from a CSV file with columns of various types.
df = pd.read_csv('file_name.csv')
df.head()[columnname]
0 b'Hi,\r\n\r\nI hope you are well.'
1 b"\xc2\xa0Hello,\r\n\xc2\xa0\r\n "
2 b"\r\n\r\n blah blah blah"
3 NaN
4 b'blah blah blah'
Name: columnname, dtype: object
From my understanding, the b''
format implies that it is a byte string and I have to .decode('utf-8')
it to a string formatted in ascii and remove b''
as well as encodings like \xc2\xa0
. However, when I try to decode, I get an error.
df[columnname] = df[columnname].apply(lambda x: x.decode('utf-8'))
AttributeError: 'str' object has no attribute 'decode'
I think what is going on is that when reading from the csv file, the column is set to str
data type, as such "b'Hi...'"
. So I checked the raw CSV file, and I saw previous_column,"b'Hi....'", next_column
. Is there a way to properly read this column as a byte string such that I could later call the decode function?
I've also tried setting the dtype=np.bytes_
for that specific column in the pd.read_csv()
function and calling df.astype
after reading the csv, but neither works. My last resort would be to manually remove the encodings with regex.
Upvotes: 1
Views: 2468
Reputation: 195428
If your column values are really strings like this: "b'some string'"
, then you can try apply ast.literal_eval
on them:
from ast import literal_eval
df['columnname'] = df['columnname'].fillna("b''").apply(lambda x: literal_eval(x).decode('utf-8'))
print(df)
Should print:
index columnname
0 1 Hi,\r\n\r\nI hope you are well.
1 2 Hello,\r\n \r\n
2 3 \r\n\r\n blah blah blah
3 4
4 5 blah blah blah
Upvotes: 1