LLynn
LLynn

Reputation: 19

Import csv using genfromtxt() and converters

import numpy as np
import csv
filename = "a.csv"

def convert(s): 
    s = s.strip().replace(',', '.')
    return str(s)

salary_data = np.genfromtxt(filename,
                     delimiter= ',',
                     dtype=[('year','i8'),('university','U50'),('school','U250'), 
                     ('degree','U250'),('employement_rate_overall','f8'), 
                     ('basic_monthly_mean','f8'),('gross_monthly_mean','i8'), 
                     ('gross_monthly_median','i8'),('gross_mthly_25_percentile','i8'), 
                     ('gross_mthly_75_percentile','i8')], 
                     encoding= None, #avoid having the deprecated warning
                     skip_header=1,
                     missing_values=['na','-'],filling_values=[0],
                     converters={2: convert} ,
                     comments=None)
print(salary_data)

I was trying to load the csv data, but the data is quite dirty as it contains quotation marks/commas inside the some of the value field and causes me an error.

      Some errors were detected!
      Line #5 (got 13 columns instead of 12) 

I was trying to clean the commas by using the converters. However, the code doesn't seem to work. and I tried

      converters={2: lambda s: str(s.replace(',', '.'))}

This is also not working for my cases. I hope to know what is my mistake and thanks for helping! Thank you for those spotting out my mistake! Even I tried to replace the quotation marks the code is not functioning. The text below is the csv file that I am loading.

      year,university,school,degree,employment_rate_overall,employment_rate_ft_perm,basic_monthly_mean,basic_monthly_median,gross_monthly_mean,gross_monthly_median,gross_mthly_25_percentile,gross_mthly_75_percentile
     2013,Nanyang Technological University,College of Business (Nanyang Business School),Accountancy and Business,97.4,96.1,3701,3200,3727,3350,2900,4000
     2013,Nanyang Technological University,College of Business (Nanyang Business 
     School),Accountancy (3-yr direct Honours Programme),97.1,95.7,2850,2700,2938,2700,2700,2900
     2013,Nanyang Technological University,College of Business (Nanyang Business 
     School),Business (3-yr direct Honours Programme),90.9,85.7,3053,3000,3214,3000,2700,3500
     2013,Nanyang Technological University,"College of Humanities, Arts & Social 
     Sciences",Economics,89.9,83.5,3085,3000,3148,3000,2800,3545
     2013,Nanyang Technological University,College of Sciences,Biomedical Sciences 
     **,na,na,na,na,na,na,na,na
     2013,Nanyang Technological University,College of Sciences,Biomedical Sciences 
     (Traditional Chinese Medicine) #,90.7,88.4,2840,2800,2883,2807,2700,3000
     2013,Nanyang Technological University,College of Sciences,Mathematics & Economics 
     **,na,na,na,na,na,na,na,na
     2014,Nanyang Technological University,"College of Humanities, Arts & Social 
     Sciences","Art, Design & Media",80,68,2761,2600,2791,2700,2300,3000

Upvotes: 1

Views: 141

Answers (1)

pguardati
pguardati

Reputation: 229

I imported your file as a .csv and as @fischmalte pointed out there are new lines, for instance in Nanyang Business School.

However this is not causing your error.
In fact, the error Line #5 (got 13 columns instead of 12), is caused by the " of "College of Humanities, Arts & Social Sciences"

The csv reader generates one more column due to that.
Remove them and your error will disappear.

Also, if you use pandas, the " will be handled automatically:

import pandas as pd
df = pd.DataFrame("my_file.csv")

(It will not take care of the line breaker though )

Upvotes: 1

Related Questions