Reputation: 17631
I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)
import pandas as pd
df = pd.DataFrame(...)
df
COLUMN1 ....
0 b'abcde' ....
1 b'dog' ....
2 b'cat1' ....
3 b'bird1' ....
4 b'elephant1' ....
When I access by column with df.COLUMN1
, I see Name: COLUMN1, dtype: object
However, if I access by element, it is a "bytes" object
df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'
How do I convert these into "regular" strings? That is, how can I get rid of this b''
prefix?
Upvotes: 50
Views: 80901
Reputation: 121
I add issue with some columns being either full of str or mixed of str and bytes in a dataframe. Solved with a minor modification of the solution provided by @Christabella Irwanto: (i'm more of fan of the str.decode('utf-8')
as suggested by @Mad Physicist)
for col, dtype in df.dtypes.items():
if dtype == object: # Only process object columns.
# decode, or return original value if decode return Nan
df[col] = df[col].str.decode('utf-8').fillna(df[col])
>>> df[col]
0 Element
1 b'Element'
2 b'165'
3 165
4 25
5 25
>>> df[col].str.decode('utf-8').fillna(df[col])
0 Element
1 Element
2 165
3 165
4 25
5 25
6 25
(replaced np.object
with object
to work with recent numpy version)
Upvotes: 6
Reputation: 728
I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str
, others of type bytes
. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str
. (python 3.6.9, pandas 1.0.5)
>>> import pandas as pd
>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])
>>> ser.values
array([b'value_1', 'value_2'], dtype=object)
>>> ser2 = ser.str.decode("utf-8")
>>> ser[~ser2.isna()] = ser2
>>> ser.values
array(['value_1', 'value_2'], dtype=object)
Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn't find one documented.
EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:
ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")
Upvotes: 1
Reputation: 1201
Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:
for col, dtype in df.dtypes.items():
if dtype == np.object: # Only process byte object columns.
df[col] = df[col].apply(lambda x: x.decode("utf-8"))
Upvotes: 6
Reputation: 394031
You can use vectorised str.decode
to decode byte strings into ordinary strings:
df['COLUMN1'].str.decode("utf-8")
To do this for multiple columns you can select just the str columns:
str_df = df.select_dtypes([np.object])
convert all of them:
str_df = str_df.stack().str.decode('utf-8').unstack()
You can then swap out converted cols with the original df cols:
for col in str_df:
df[col] = str_df[col]
Upvotes: 82