How to delete some rows with comments from a CSV file to load data to DataFrame?

Question

There is a relatively big CSV file with data (around 80Mb). When I open it in MS Excel, I see that it contains 100 columns with many rows of data. However, the first row is not the column names, it's a web link. Furthermore, the last two rows are some comments. So, now I want to load this data into pandas DataFrame:

import pandas as pd
df = pd.read_csv('myfile.csv')

Then I want to read a column called Duration (I see that it exists in CSV file) and delete a word years from it's values:

Duration = map(lambda x: float(x.rstrip('years')), df['Duration'])

It gives me this error:

AttributeError: 'float' object has no attribute 'rstrip'

If I open the file in MS Excel and delete the first row (a web link) and the last two rows (the comments), then the code works!

So, how can I clean this CSV file automatically in Python (to extract only columns with values)?

Update: When I write print df.head(), it outputs:

have mixed types. Specify dtype option on import or set low_memory=False.

Do I need ot specify Type for all 100 columns? What if I don't know the types apriori.

Update: I cannot attach the file, but as an example you can check this one. Download the file 2015-2016.

Alexander · Accepted Answer

There are some parameters in pd.read_csv() that you should use:

df = pdread_csv('myfile.csv', skiprows=1, skip_footer=2)

I looked at the link you provided in the comments and tried to import it. I saw two mixed data types (for id and desc), so I explicitly set the dtype for these two columns. Also, by observation, the footer contains 'Total', so I excluded any row starting with the letter T. Other than the headers, valid rows should start with integers for the id column. If there are other footers not starting with T that are introduced, this will throw an error when read.

If you first download and uncompress the zip file, you can proceed as follows:

file_loc = ...  # Specify location where you saved the unzipped file.
df = pd.read_csv(file_loc, skiprows=1, skip_blank_lines=True, 
                 dtype={'id': int, 'desc': str}, comment='T')

And this will strip out year or years from the emp_length column, although you are still left with text categories.

df['emp_length'] = df.emp_length.str.replace(r'( years|year)', '')

How to delete some rows with comments from a CSV file to load data to DataFrame?

Answers (2)

Related Questions