Reputation: 157
I was given a latin-1 .txt dataset, which I am trying to clean up to use for proper analysis using python 3 and pandas. The dataset, being scraped from html contains a number of \xa0 occurences, which I can't seem to deal with using decode, strip, -u replace, or any other method which I found on stack overflow. All of my attempts seem to be ignored by python, still printing out the same results. As I am new to data scraping, chances are that I missed out on something obvious, but right now I don't see a way forward
I have tried to decode to ascii, strip to str and then replace, or replace using a -u clause, not leading to anything
filepath = 'meow3.txt'
outF = open("myOutFile.txt", "a")
with open(filepath) as fp:
line = fp.readline()
for line in fp:
if line.strip().startswith(','):
line = line.replace(',','',1)
line = line.replace(u'\xa0', u' ')
print(line)
df = pd.read_csv('myOutFile.txt', sep=",", encoding="latin-1", header=None, names=["Company name", "Total", "Invested since-to"])
print (df)
3sun Group, £10m ,Feb 2014
,Abacus, £4.15m ,Aug 2013
,Accsys Group ,£12m, Mar 2017,
Acro ,\xa0£7.8m, Nov 2015 – Sep 2017,
ACS Clothing, £25.3m ,Jan 2014
this is how the dataset looks like, and why in my code I am removing the first comma provided it is at the start of the column. But none of the suggested answers I tried seemed to help with removing the \xa0 part of the dataset, still giving the same result (seen above). If anyone has any clue for how I could make this work, I would be very grateful, Cheers, Jericho
Edit: While I know this would be best dealt with by pre-processing before turning it into txt file, I have no access or control of that process, and I have to work with the data I was given.
Upvotes: 3
Views: 16231
Reputation: 858
To remove from whole dataframe altogether, below code should work.
df.replace(r"\xa0", '', regex=True)
Upvotes: 0
Reputation: 181
do this after reading the file.
df['col'] = df['col'].apply(lambda x: str(x).replace(u'\xa0', u''))
Upvotes: 6
Reputation: 1197
I suddenly stuck by this problem today and finally find a quickest and neatest solution. Say your pandas dataframe has a column with values like 'South\xa0Carolina'.
Use the following code to remove all '\xa0'. Actually I have tried .replace("\xa0"," ") and .replace(r'\xa0',r' '), but none of them worked.
data['col'].str.split().str.join(' ')
Upvotes: 13
Reputation: 96
Maybe decoding line to UTF8 will help
line = line.decode('UTF-8')
Then do the string replacement after that, if necessary.
Upvotes: 0