Yue Min
Yue Min

Reputation: 11

genfromtxt file returns filling values as -1

I'm trying to use genfromtxt to extract a csv file that contains missing values such as 'na' and '-' I am required to look for the minimum value in the list of data, but the missing values got returned as -1.

this is my code:

data = np.genfromtxt('median-resale-prices-for-registered-applications-by-town-and-flat-type.csv',
                     skip_header=1,
                     dtype=[('quarter', 'U7'),  ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], delimiter=",",
                     missing_values=['na','-'], filling_values=[0])

min_price = np.min(data['price'])
print(min_price)

and this is what i have in return

-1

i have also tried isnan()

print("Original data: " + str(data.shape))
null_rows = np.isnan(data['price'])
print(null_rows)
nonnull_values = data[null_rows==False]
print("Filtered data: " + str(nonnull_values.shape))

however, python did not perceive the na and - values as isnan Original data: (9360,) [False False False ... False False False] Filtered data: (9360,)

is there something wrong with my code?

Upvotes: 0

Views: 76

Answers (1)

hpaulj
hpaulj

Reputation: 231325

With the sample, adapted from the comment:

In [26]: txt1="""quarter,town,flat_type,price
    ...: 2007-Q2,ANG MO KIO,1-ROOM,na 
    ...: 2007-Q2,ANG MO KIO,2-ROOM,- 
    ...: 2007-Q2,ANG MO KIO,3-ROOM,172000 
    ...: 2007-Q2,ANG MO KIO,4-ROOM,260000 
    ...: 2007-Q2,ANG MO KIO,5-ROOM,372000"""

Accepting that last field as floats (no fill stuff):

In [27]: data = np.genfromtxt(txt1.splitlines(),
    ...:                      skip_header=1,
    ...:                      dtype=[('quarter', 'U7'),  ('town', 'U50'), ('flat_type', 'U10'), ('price', 'f8')], delimiter=",")

In [28]: data
Out[28]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM',     nan),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM',     nan),
       ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])

genfromtxt normally uses nan for values it can't parse as floats.

For integer, it apparently uses -1 instead:

In [29]: data = np.genfromtxt(txt1.splitlines(),
    ...:                      skip_header=1,
    ...:                      dtype=[('quarter', 'U7'),  ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], delimiter=",")

In [30]: data
Out[30]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM',     -1),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM',     -1),
       ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<i8')])

After some fiddling, I got this to work. The key was to use a single value of filling_values, not a list.

In [54]: data = np.genfromtxt(txt1.splitlines(),
    ...:                      skip_header=1,
    ...:                      dtype=[('quarter', 'U7'), ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], 
    ...:                      delimiter=",", 
    ...:                      missing_values=['na','-'], 
    ...:                      filling_values=-999)

In [55]: data
Out[55]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM',   -999),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM',   -999),
       ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<i8')])

Looking at the code (via [source] in the docs), I see we can use dict, specifying different values for different columns. Thus

missing_values={3:['na','-']}, 
filling_values={3:-999})

There are more details in the code than in the documentation. I haven't used these values much, so each time I have learn more.

edit

In today's question you seem to have forgotten all that you learned here. data is a structured array. With f8 you get nan for the missing values, not -1. And you attempt to treat the array as a list of tuples. Why not continue to treat it as a structured array?

The pricefield:

In [104]: data['price']
Out[104]: array([    nan,     nan, 172000., 260000., 372000.])

In [106]: mask = np.isnan(data['price'])   # or test against -1 for ints
In [107]: mask
Out[107]: array([ True,  True, False, False, False])

Use the mask to select, or "delete" elements from the 1d array:

In [108]: data[mask]
Out[108]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM', nan)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])

In [109]: data[~mask]
Out[109]: 
array([('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])

But here's a list based approach:

In [110]: alist = data.tolist()
In [111]: alist
Out[111]: 
[('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
 ('2007-Q2', 'ANG MO KIO', '2-ROOM', nan),
 ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.0),
 ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.0),
 ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.0)]

In [112]: [i for i in alist if not np.isnan(i[3])]
Out[112]: 
[('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.0),
 ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.0),
 ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.0)]

Upvotes: 1

Related Questions