Reputation: 1194
I have a CSV-file looking like this:
"row ID","label","val"
"Row0","5",6
"Row1","",6
"Row2","",6
"Row3","5",7
"Row4","5",8
"Row5",,9
"Row6","nan",
"Row7","nan",
"Row8","nan",0
"Row9","nan",3
"Row10","nan",
All quoted entries are strings. Non-quoted entries are numerical. Empty fields are missing values (NaN), Quoted empty fields still should be considered as empty strings. I tried to read it in with pandas read_csv but I cannot get it working the way I would like to have it... It still consideres ,"", and ,, as NaN, while it's not true for the first one.
d = pd.read_csv(csv_filename, sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
Can anybody help? Is it possible at all?
Upvotes: 0
Views: 10585
Reputation: 1194
I found a way to get it more or less working. I just don't know, why I need to specify dtype=type(None) to get it working... Comments on this piece of code are very welcome!
import re
import pandas as pd
import numpy as np
# clear quoting characters
def filterTheField(s):
m = re.match(r'^"?(.*)?"$', s.strip())
if m:
return m.group(1)
else:
return np.nan
file = 'test.csv'
y = np.genfromtxt(file, delimiter = ',', filling_values = np.nan, names = True, dtype = type(None), converters = {'row_ID': filterTheField, 'label': filterTheField,'val': float})
d = pd.DataFrame(y)
print(d)
Upvotes: 0
Reputation: 119
You can try with numpy.genfromtxt
and specify the missing_values
parameter
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
Upvotes: 1
Reputation: 5408
Maybe something like:
import pandas as pd
import csv
import numpy as np
d = pd.read_csv('test.txt', sep=',', keep_default_na=False, na_values=[''], quoting = csv.QUOTE_NONNUMERIC)
mask = d['label'] == 'nan'
d.label[mask] = np.nan
Upvotes: 0