Bazman
Bazman

Reputation: 2150

Pandas remove white space/unknown character when importing csv file

I can download files:

seasons = [2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]

epl_tables = {}
epl_seasons = {}
for year in seasons:
    start_year = str(year)[-2:]
    end_year = str(year+1)[-2:]
    season = start_year + end_year
    epl_seasons[season] = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(start_year, end_year)).dropna(how='all')
    epl_tables[season] = league(epl_seasons[season]) 

This works fine.

However when I try to add the 2004-05 season by adding 2004 to seasons, there is a problem and the code fails.

seasons = [2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]

The problem is caused by a white space before the referee's name in rows 337 to 345 of the csv file.

I can work around by manually deleting the white space and then loading from disk this works but obviously it's not ideal.

I've tried various ways to get it to work such as shown below but nothing seems to work,

epl_seasons[season] = pd.read_csv("https://www.football-data.co.uk/mmz4281/{}{}/E0.csv".format(start_year, end_year), delimiter=',', encoding="utf-8", skipinitialspace=True).dropna(how='all')

A potential complication is that when I open the file in excel the space appears as a white space but then I open it in libreCalc (in Ubuntu which is what I'm working in), it appears as an unknown character a question mark in a black box tipped at 45 degrees. See the answer from PeterMau in the link below to see what this unknown character looks like.

https://ask.libreoffice.org/en/question/113125/characters-turned-into-question-marks/

Can someone please tell me who I can automatically remove these white spaces/ unknown characters?

Upvotes: 0

Views: 1199

Answers (2)

Salih Osman
Salih Osman

Reputation: 71

Just run the str.strip() function on my data columns and also passed the character causing the issue which is ? in my case, as follows:

df[newcol]= df[oldCol].str.strip('?')

Upvotes: 0

Hofbr
Hofbr

Reputation: 1010

You can remove white space in a string with .str.strip():

epl_seasons[season]['COLUMN NAME'] = epl_seasons[season]['COLUMN NAME'].str.strip()

This shouldn't be a manual process. Just add a line so that when you import a CSV file you also cleanup the problematic column.

Obviously this only works for a specific column. Here's an answer on a different thread that addresses removing white space from every df cell:

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

Upvotes: 1

Related Questions