Reputation: 11
I'm trying to use genfromtxt to extract a csv file that contains missing values such as 'na' and '-' I am required to look for the minimum value in the list of data, but the missing values got returned as -1.
this is my code:
data = np.genfromtxt('median-resale-prices-for-registered-applications-by-town-and-flat-type.csv',
skip_header=1,
dtype=[('quarter', 'U7'), ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], delimiter=",",
missing_values=['na','-'], filling_values=[0])
min_price = np.min(data['price'])
print(min_price)
and this is what i have in return
-1
i have also tried isnan()
print("Original data: " + str(data.shape))
null_rows = np.isnan(data['price'])
print(null_rows)
nonnull_values = data[null_rows==False]
print("Filtered data: " + str(nonnull_values.shape))
however, python did not perceive the na and - values as isnan Original data: (9360,) [False False False ... False False False] Filtered data: (9360,)
is there something wrong with my code?
Upvotes: 0
Views: 76
Reputation: 231325
With the sample, adapted from the comment:
In [26]: txt1="""quarter,town,flat_type,price
...: 2007-Q2,ANG MO KIO,1-ROOM,na
...: 2007-Q2,ANG MO KIO,2-ROOM,-
...: 2007-Q2,ANG MO KIO,3-ROOM,172000
...: 2007-Q2,ANG MO KIO,4-ROOM,260000
...: 2007-Q2,ANG MO KIO,5-ROOM,372000"""
Accepting that last field as floats (no fill stuff):
In [27]: data = np.genfromtxt(txt1.splitlines(),
...: skip_header=1,
...: dtype=[('quarter', 'U7'), ('town', 'U50'), ('flat_type', 'U10'), ('price', 'f8')], delimiter=",")
In [28]: data
Out[28]:
array([('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
('2007-Q2', 'ANG MO KIO', '2-ROOM', nan),
('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.),
('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.),
('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.)],
dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])
genfromtxt
normally uses nan
for values it can't parse as floats.
For integer, it apparently uses -1 instead:
In [29]: data = np.genfromtxt(txt1.splitlines(),
...: skip_header=1,
...: dtype=[('quarter', 'U7'), ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], delimiter=",")
In [30]: data
Out[30]:
array([('2007-Q2', 'ANG MO KIO', '1-ROOM', -1),
('2007-Q2', 'ANG MO KIO', '2-ROOM', -1),
('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000),
('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000),
('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000)],
dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<i8')])
After some fiddling, I got this to work. The key was to use a single value of filling_values
, not a list.
In [54]: data = np.genfromtxt(txt1.splitlines(),
...: skip_header=1,
...: dtype=[('quarter', 'U7'), ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')],
...: delimiter=",",
...: missing_values=['na','-'],
...: filling_values=-999)
In [55]: data
Out[55]:
array([('2007-Q2', 'ANG MO KIO', '1-ROOM', -999),
('2007-Q2', 'ANG MO KIO', '2-ROOM', -999),
('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000),
('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000),
('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000)],
dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<i8')])
Looking at the code (via [source]
in the docs), I see we can use dict
, specifying different values for different columns. Thus
missing_values={3:['na','-']},
filling_values={3:-999})
There are more details in the code than in the documentation. I haven't used these values much, so each time I have learn more.
In today's question you seem to have forgotten all that you learned here. data
is a structured array. With f8
you get nan
for the missing values, not -1
. And you attempt to treat the array as a list of tuples. Why not continue to treat it as a structured array?
The price
field:
In [104]: data['price']
Out[104]: array([ nan, nan, 172000., 260000., 372000.])
In [106]: mask = np.isnan(data['price']) # or test against -1 for ints
In [107]: mask
Out[107]: array([ True, True, False, False, False])
Use the mask to select, or "delete" elements from the 1d array:
In [108]: data[mask]
Out[108]:
array([('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
('2007-Q2', 'ANG MO KIO', '2-ROOM', nan)],
dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])
In [109]: data[~mask]
Out[109]:
array([('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.),
('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.),
('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.)],
dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])
But here's a list based approach:
In [110]: alist = data.tolist()
In [111]: alist
Out[111]:
[('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
('2007-Q2', 'ANG MO KIO', '2-ROOM', nan),
('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.0),
('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.0),
('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.0)]
In [112]: [i for i in alist if not np.isnan(i[3])]
Out[112]:
[('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.0),
('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.0),
('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.0)]
Upvotes: 1