Reputation: 11755
The default behavior of pandas.read_html
appears to be to convert
characters to unicode \xa0
codes:
url = 'http://www.reuters.com/finance/stocks/company-officers/IBM'
ibm = pd.read_html(url, header=0)[0]
ibm.iloc[0,0]
'Virginia\xa0Rometty'
I know I can use a converter to convert these to spaces as follows:
spacer = lambda s: s.replace(u'\xa0', ' ')
ibm = pd.read_html(url, header=0, converters={'Name':spacer})[0]
ibm.iloc[0,0]
'Virginia Rometty'
This seems unnecessarily complicated for something that must be a pretty common. Is there another way? Perhaps an encoding
option?
Upvotes: 1
Views: 1264
Reputation: 402743
I don't think an encoding option will fix this, but you can just get rid of them. Using str.replace
, you can get rid of any non-ASCII and replace it with a space.
ibm['Name'] = ibm['Name'].str.replace('[^\x00-\x8F]', ' ')
Or, just the non-breaking space -
ibm['Name'] = ibm['Name'].str.replace('\xa0', ' ')
Upvotes: 4