Reputation: 16050
There is a relatively big CSV file with data (around 80Mb). When I open it in MS Excel, I see that it contains 100 columns with many rows of data. However, the first row is not the column names, it's a web link. Furthermore, the last two rows are some comments. So, now I want to load this data into pandas DataFrame:
import pandas as pd
df = pd.read_csv('myfile.csv')
Then I want to read a column called Duration
(I see that it exists in CSV file) and delete a word years
from it's values:
Duration = map(lambda x: float(x.rstrip('years')), df['Duration'])
It gives me this error:
AttributeError: 'float' object has no attribute 'rstrip'
If I open the file in MS Excel and delete the first row (a web link) and the last two rows (the comments), then the code works!
So, how can I clean this CSV file automatically in Python (to extract only columns with values)?
Update:
When I write print df.head()
, it outputs:
have mixed types. Specify dtype option on import or set low_memory=False.
Do I need ot specify Type for all 100 columns? What if I don't know the types apriori.
Update: I cannot attach the file, but as an example you can check this one. Download the file 2015-2016.
Upvotes: 1
Views: 2464
Reputation: 109546
There are some parameters in pd.read_csv() that you should use:
df = pdread_csv('myfile.csv', skiprows=1, skip_footer=2)
I looked at the link you provided in the comments and tried to import it. I saw two mixed data types (for id
and desc
), so I explicitly set the dtype for these two columns. Also, by observation, the footer contains 'Total', so I excluded any row starting with the letter T
. Other than the headers, valid rows should start with integers for the id
column. If there are other footers not starting with T
that are introduced, this will throw an error when read.
If you first download and uncompress the zip
file, you can proceed as follows:
file_loc = ... # Specify location where you saved the unzipped file.
df = pd.read_csv(file_loc, skiprows=1, skip_blank_lines=True,
dtype={'id': int, 'desc': str}, comment='T')
And this will strip out year
or years
from the emp_length
column, although you are still left with text categories.
df['emp_length'] = df.emp_length.str.replace(r'( years|year)', '')
Upvotes: 3
Reputation: 14738
To skip the first line, you could use the skiprows
option in read_csv
. If the last two lines are not too tricky (i.e. that they cause some parsing errors), you could use .iloc
to ignore them. Finally, a vectorized version of rstrip
is available via the str
attribute of the Duration
column, assuming it contains strings.
See the following code for an example:
import pandas as pd
from StringIO import StringIO
content = StringIO("""http://www.example.com
col1,col2,Duration
1,11,5 years
2,22,4 years
3,33,2 years
# Some comments in the
# last two lines here.
""")
df = pd.read_csv(content, skiprows=1).iloc[:-2]
df['Duration'] = df.Duration.str.rstrip('years').astype(float)
print df
Output:
col1 col2 Duration
0 1 11 5
1 2 22 4
2 3 33 2
If reading speed is not a concern, you can also use the skip_footer=2
option in read_csv
to skip the last two lines. This will cause read_csv
to use the Python parser engine instead of the faster C engine.
Upvotes: 1