Reputation: 45
I have a utf-16 csv file that I'm trying to load into Pandas. By default the data comes in as an object datatype. I plan to do some modeling with the caption column so I'd like to convert the column df['caption'] from an object to a unicode string. Currently I'm running into the following error 'UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 6: ordinal not in range(128)' when doing df['caption']=df['caption'].astype(unicode).
I tried to solve this by using the encode and decode functions on the individual values in the df['caption'] column but I couldn't get it to work.
I'm very new to pandas and unicode so I was wondering if there's some insight as to what I am doing wrong.
Thanks in advance.
Teresa
Additional information is below:
The traceback is as follows:
UnicodeEncodeError: Traceback (most recent call last)
<ipython-input-5-aad36f4acf38> in <module>()
10 print df['caption'].head(10)
11
---> 12 df['caption']=df['caption'].astype(unicode)
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/generic.pyc in astype(self, dtype, copy, raise_on_error)
2016
2017 mgr = self._data.astype(
-> 2018 dtype, copy=copy, raise_on_error=raise_on_error)
2019 return self._constructor(mgr).__finalize__(self)
2020
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in astype(self, *args, **kwargs)
2414
2415 def astype(self, *args, **kwargs):
-> 2416 return self.apply('astype', *args, **kwargs)
2417
2418 def convert(self, *args, **kwargs):
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in apply(self, f, *args, **kwargs)
2373
2374 else:
-> 2375 applied = getattr(blk, f)(*args, **kwargs)
2376
2377 if isinstance(applied, list):
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in astype(self, dtype, copy, raise_on_error, values)
425 def astype(self, dtype, copy=False, raise_on_error=True, values=None):
426 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 427 values=values)
428
429 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/internals.pyc in _astype(self, dtype, copy, raise_on_error, values, klass)
442 # force the copy here
443 if values is None:
--> 444 values = com._astype_nansafe(self.values, dtype, copy=True)
445 newb = make_block(values, self.items, self.ref_items,
446 ndim=self.ndim, placement=self._ref_locs,
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/core/common.pyc in _astype_nansafe(arr, dtype, copy)
2222 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
2223 elif issubclass(dtype.type, compat.string_types):
-> 2224 return lib.astype_str(arr.ravel()).reshape(arr.shape)
2225
2226 if copy:
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.astype_str (pandas/lib.c:12944)()
/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.astype_str (pandas/lib.c:12862)()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 6: ordinal not in range(128)
My code is as follows:
import pandas as pd
import numpy as np
df = pd.read_csv('Chevrolet_4-7-2014_cvid_data.csv',encoding='utf-16',header=0,na_values=['N/A',''],names=['channel','link','title','posted','views','likes','dislikes','description','category','statdate','statviews','timewatched','averagetw','subsdriven','shares','caption'])
print df.head(5)
print df.dtypes
print df['caption'].head(10)
df['caption']=df['caption'].astype(unicode)
The data looks like the following:
channel link \
0 Chevrolet http://www.youtube.com/watch?v=dCayKZe6WvI
1 Chevrolet http://www.youtube.com/watch?v=IRXK35dPXbE
2 Chevrolet http://www.youtube.com/watch?v=XXdj4QMw748
3 Chevrolet http://www.youtube.com/watch?v=_ger32ROs94
4 Chevrolet http://www.youtube.com/watch?v=Chfm7Pou49k
5 Chevrolet http://www.youtube.com/watch?v=ySmEJyQ94BI
title posted views \
0 Chevy Open House Event: From Our House to Your... Apr 1 2014 73111
1 Truck Towing Capabilities: 2014 Silverado -- #... Mar 26 2014 11934
2 Potholes at the Milford Proving Grounds: Tips ... Mar 20 2014 8037
3 Diesel Trucks: Heavy Duty Strengths -- 2015 Si... Mar 20 2014 12096
4 Captain America: All in a Day's Work -- 2014 T... Mar 14 2014 93377
5 Media Blasting: Camaro Engineering -- 2014 Cam... Mar 13 2014 109931
likes dislikes description \
0 43 13 In March over 100000 people visited our Chevy ...
1 183 56 Farmer Dewayne Kleman and General Motors engin...
2 58 10 Chevrolet vehicles are carefully designed to w...
3 210 6 Introducing the all-new 2015 Silverado HD. The...
4 1095 35 From saving the world to working on math homew...
category statdate statviews timewatched averagetw subsdriven \
0 Autos & Vehicles NaN NaN NaN NaN NaN
1 Autos & Vehicles NaN NaN NaN NaN NaN
2 Autos & Vehicles NaN NaN NaN NaN NaN
3 Autos & Vehicles NaN NaN NaN NaN NaN
4 Autos & Vehicles NaN NaN NaN NaN NaN
shares caption
0 NaN The Chevy Spring Open House Sale the perfect ...
1 NaN 0:03 A Man And His Truck And An Engineer / To...
2 NaN 0:02 Severe Bump road sign 0:07 Pothole Facil...
3 NaN 0:03 And there's no stronger Silverado than t...
4 NaN 0:03 Are you doing anything fun Saturday nigh...
5 NaN 0:05 Camaro Z/28 logo 0:07 Z/28 Bead Lock 0:0...
[5 rows x 16 columns]
channel object
link object
title object
posted object
views object
likes int64
dislikes int64
description object
category object
statdate object
statviews float64
timewatched object
averagetw object
subsdriven float64
shares float64
caption object
dtype: object
0 The Chevy Spring Open House Sale the perfect ...
1 0:03 A Man And His Truck And An Engineer / To...
2 0:02 Severe Bump road sign 0:07 Pothole Facil...
3 0:03 And there's no stronger Silverado than t...
4 0:03 Are you doing anything fun Saturday nigh...
5 0:05 Camaro Z/28 logo 0:07 Z/28 Bead Lock 0:0...
Name: caption, dtype: object
Upvotes: 2
Views: 3525
Reputation: 985
Can you try adding dtype={'caption' : str}
to your read_csv()
call? Like:
df = pd.read_csv('Chevrolet_4-7-2014_cvid_data.csv',
encoding='utf-16',
header=0,
na_values=['N/A',''],
names=[...],
dtype={'caption' : str})
BTW, pandas uses header=0
here by default. Not that I can see your CSV but that may be redundant with your use of the names
keyword argument, since pandas will use those column names automatically if they're in row 0 of your CSV. But anyway, let me know if the other thing works for you. :)
Upvotes: 1