Reputation: 19
import numpy as np
import csv
filename = "a.csv"
def convert(s):
s = s.strip().replace(',', '.')
return str(s)
salary_data = np.genfromtxt(filename,
delimiter= ',',
dtype=[('year','i8'),('university','U50'),('school','U250'),
('degree','U250'),('employement_rate_overall','f8'),
('basic_monthly_mean','f8'),('gross_monthly_mean','i8'),
('gross_monthly_median','i8'),('gross_mthly_25_percentile','i8'),
('gross_mthly_75_percentile','i8')],
encoding= None, #avoid having the deprecated warning
skip_header=1,
missing_values=['na','-'],filling_values=[0],
converters={2: convert} ,
comments=None)
print(salary_data)
I was trying to load the csv data, but the data is quite dirty as it contains quotation marks/commas inside the some of the value field and causes me an error.
Some errors were detected!
Line #5 (got 13 columns instead of 12)
I was trying to clean the commas by using the converters. However, the code doesn't seem to work. and I tried
converters={2: lambda s: str(s.replace(',', '.'))}
This is also not working for my cases. I hope to know what is my mistake and thanks for helping! Thank you for those spotting out my mistake! Even I tried to replace the quotation marks the code is not functioning. The text below is the csv file that I am loading.
year,university,school,degree,employment_rate_overall,employment_rate_ft_perm,basic_monthly_mean,basic_monthly_median,gross_monthly_mean,gross_monthly_median,gross_mthly_25_percentile,gross_mthly_75_percentile
2013,Nanyang Technological University,College of Business (Nanyang Business School),Accountancy and Business,97.4,96.1,3701,3200,3727,3350,2900,4000
2013,Nanyang Technological University,College of Business (Nanyang Business
School),Accountancy (3-yr direct Honours Programme),97.1,95.7,2850,2700,2938,2700,2700,2900
2013,Nanyang Technological University,College of Business (Nanyang Business
School),Business (3-yr direct Honours Programme),90.9,85.7,3053,3000,3214,3000,2700,3500
2013,Nanyang Technological University,"College of Humanities, Arts & Social
Sciences",Economics,89.9,83.5,3085,3000,3148,3000,2800,3545
2013,Nanyang Technological University,College of Sciences,Biomedical Sciences
**,na,na,na,na,na,na,na,na
2013,Nanyang Technological University,College of Sciences,Biomedical Sciences
(Traditional Chinese Medicine) #,90.7,88.4,2840,2800,2883,2807,2700,3000
2013,Nanyang Technological University,College of Sciences,Mathematics & Economics
**,na,na,na,na,na,na,na,na
2014,Nanyang Technological University,"College of Humanities, Arts & Social
Sciences","Art, Design & Media",80,68,2761,2600,2791,2700,2300,3000
Upvotes: 1
Views: 141
Reputation: 229
I imported your file as a .csv and as @fischmalte pointed out there are new lines, for instance in Nanyang Business School
.
However this is not causing your error.
In fact, the error Line #5 (got 13 columns instead of 12)
,
is caused by the "
of "College of Humanities, Arts & Social Sciences"
The csv reader generates one more column due to that.
Remove them and your error will disappear.
Also, if you use pandas, the "
will be handled automatically:
import pandas as pd
df = pd.DataFrame("my_file.csv")
(It will not take care of the line breaker though )
Upvotes: 1