Reputation: 157

Remove unicode '\xa0' from pandas column

I was given a latin-1 .txt dataset, which I am trying to clean up to use for proper analysis using python 3 and pandas. The dataset, being scraped from html contains a number of \xa0 occurences, which I can't seem to deal with using decode, strip, -u replace, or any other method which I found on stack overflow. All of my attempts seem to be ignored by python, still printing out the same results. As I am new to data scraping, chances are that I missed out on something obvious, but right now I don't see a way forward

I have tried to decode to ascii, strip to str and then replace, or replace using a -u clause, not leading to anything

filepath = 'meow3.txt'  
outF = open("myOutFile.txt", "a")
with open(filepath) as fp:  
    line = fp.readline()
    for line in fp:
        if line.strip().startswith(','):
            line = line.replace(',','',1)
        line = line.replace(u'\xa0', u' ')
        print(line)

df = pd.read_csv('myOutFile.txt', sep=",", encoding="latin-1", header=None, names=["Company name", "Total", "Invested since-to"])

print (df)

3sun Group, £10m ,Feb 2014 
,Abacus, £4.15m ,Aug 2013 
,Accsys Group ,£12m, Mar 2017, 
Acro ,\xa0£7.8m, Nov 2015 – Sep 2017, 
ACS Clothing, £25.3m ,Jan 2014

this is how the dataset looks like, and why in my code I am removing the first comma provided it is at the start of the column. But none of the suggested answers I tried seemed to help with removing the \xa0 part of the dataset, still giving the same result (seen above). If anyone has any clue for how I could make this work, I would be very grateful, Cheers, Jericho

Edit: While I know this would be best dealt with by pre-processing before turning it into txt file, I have no access or control of that process, and I have to work with the data I was given.

Upvotes: 3

Answers (4)

smm

Reputation: 858

To remove from whole dataframe altogether, below code should work.

df.replace(r"\xa0", '', regex=True)

Upvotes: 0

randomGeek4

Reputation: 181

do this after reading the file.

df['col'] = df['col'].apply(lambda x: str(x).replace(u'\xa0', u''))

Upvotes: 6

liangli

Reputation: 1197

I suddenly stuck by this problem today and finally find a quickest and neatest solution. Say your pandas dataframe has a column with values like 'South\xa0Carolina'.

Use the following code to remove all '\xa0'. Actually I have tried .replace("\xa0"," ") and .replace(r'\xa0',r' '), but none of them worked.

data['col'].str.split().str.join(' ')

Upvotes: 13

idisposable11

Reputation: 96

Maybe decoding line to UTF8 will help

line = line.decode('UTF-8')

Then do the string replacement after that, if necessary.

Upvotes: 0

Remove unicode &#39;\xa0&#39; from pandas column

Answers (4)

Related Questions

Remove unicode '\xa0' from pandas column