Klausos Klausos
Klausos Klausos

Reputation: 16050

How to delete some rows with comments from a CSV file to load data to DataFrame?

There is a relatively big CSV file with data (around 80Mb). When I open it in MS Excel, I see that it contains 100 columns with many rows of data. However, the first row is not the column names, it's a web link. Furthermore, the last two rows are some comments. So, now I want to load this data into pandas DataFrame:

import pandas as pd
df = pd.read_csv('myfile.csv')

Then I want to read a column called Duration (I see that it exists in CSV file) and delete a word years from it's values:

Duration = map(lambda x: float(x.rstrip('years')), df['Duration'])

It gives me this error:

AttributeError: 'float' object has no attribute 'rstrip'

If I open the file in MS Excel and delete the first row (a web link) and the last two rows (the comments), then the code works!

So, how can I clean this CSV file automatically in Python (to extract only columns with values)?

Update: When I write print df.head(), it outputs:

have mixed types. Specify dtype option on import or set low_memory=False.

Do I need ot specify Type for all 100 columns? What if I don't know the types apriori.

Update: I cannot attach the file, but as an example you can check this one. Download the file 2015-2016.

Upvotes: 1

Views: 2464

Answers (2)

Alexander
Alexander

Reputation: 109546

There are some parameters in pd.read_csv() that you should use:

df = pdread_csv('myfile.csv', skiprows=1, skip_footer=2)

I looked at the link you provided in the comments and tried to import it. I saw two mixed data types (for id and desc), so I explicitly set the dtype for these two columns. Also, by observation, the footer contains 'Total', so I excluded any row starting with the letter T. Other than the headers, valid rows should start with integers for the id column. If there are other footers not starting with T that are introduced, this will throw an error when read.

If you first download and uncompress the zip file, you can proceed as follows:

file_loc = ...  # Specify location where you saved the unzipped file.
df = pd.read_csv(file_loc, skiprows=1, skip_blank_lines=True, 
                 dtype={'id': int, 'desc': str}, comment='T')

And this will strip out year or years from the emp_length column, although you are still left with text categories.

df['emp_length'] = df.emp_length.str.replace(r'( years|year)', '')

Upvotes: 3

YS-L
YS-L

Reputation: 14738

To skip the first line, you could use the skiprows option in read_csv. If the last two lines are not too tricky (i.e. that they cause some parsing errors), you could use .iloc to ignore them. Finally, a vectorized version of rstrip is available via the str attribute of the Duration column, assuming it contains strings.

See the following code for an example:

import pandas as pd
from StringIO import StringIO
content = StringIO("""http://www.example.com
col1,col2,Duration
1,11,5 years
2,22,4 years
3,33,2 years
# Some comments in the
# last two lines here.
""")
df = pd.read_csv(content, skiprows=1).iloc[:-2]
df['Duration'] = df.Duration.str.rstrip('years').astype(float)
print df

Output:

  col1  col2 Duration
0    1    11       5 
1    2    22       4 
2    3    33       2

If reading speed is not a concern, you can also use the skip_footer=2 option in read_csv to skip the last two lines. This will cause read_csv to use the Python parser engine instead of the faster C engine.

Upvotes: 1

Related Questions