Reputation: 29
I have an array which I used np. loadtext on an csv file.
dataresale = np.loadtxt(
resale, skiprows=1, usecols=(0,2,10),
dtype=[('month', 'U50'),
('flat_type', 'U50'),
('resale_price', 'f8')], delimiter=',')
print(dataresale['month'])
Below is the output:
['2017-01' '2017-01' '2017-01' ... '2021-03' '2021-10' '2021-12']
I would like to only take out data from year 2021 (all months) only
Below is a script I used to take out rows by year in the another array, but this particular dataset has the months tagged to it
x = datap[datax['year'] == 2019]
Is there a way I can modify the script above to take out all 2021 data?
Upvotes: 2
Views: 328
Reputation: 114230
Starting with numpy version 1.23.0, you can create views of non-contiguous arrays (PR #20722)1. The change was motivated by this question, among others, so I'm posting a second answer. Now you can do
dataresale['month'][:, None].view('U1')[:, :4].view('U4').squeeze().astype(int)
This is still annoying, so I added another layer, np.char.slice_
, based on my other answer here. PR #206941 is still currently a WIP, pending a change to np.lib.stride_tricks.as_strided
that is unrelated to this question. If/when that goes through, you will be able to do this instead:
np.char.slice_(dataresale['month'], 4).astype(int)
The code in the PR is completely functional, but has some hacky workarounds, so I do not recommend using it just yet.
1I am the author. This is a shameless plug, not spam.
Upvotes: 0
Reputation: 231335
Construct a sample array:
In [359]: arr = np.zeros(6, dtype=[('month', 'U50'),
...: ('flat_type', 'U50'),
...: ('resale_price', 'f8')])
In [360]: arr['month']=['2017-01', '2017-01', '2017-01','2021-03', '2021-10', '2
...: 021-12']
Since the interest is in the first for characters we can do:
In [362]: np.char.startswith(arr['month'],'2021')
Out[362]: array([False, False, False, True, True, True])
which effectively is:
In [364]: [s.startswith('2021') for s in arr['month']]
Out[364]: [False, False, False, True, True, True]
The list comprehension is faster, though for better comparison lets get the indices:
In [366]: timeit np.nonzero([s.startswith('2021') for s in arr['month']])
15.1 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [367]: timeit np.nonzero(np.char.startswith(arr['month'],'2021'))
16.7 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But astype
is a relatively quick way of truncating string dtypes, effectively the [:4]
type of string slice:
In [371]: arr['month'].astype('U4')
Out[371]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [372]: arr['month'].astype('U4')=='2021'
Out[372]: array([False, False, False, True, True, True])
In [374]: timeit np.nonzero(arr['month'].astype('U4')=='2021')
6.47 µs ± 7.53 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
another option is to convert the string to datetime64
In [376]: arr['month'].astype('datetime64[Y]')
Out[376]:
array(['2017', '2017', '2017', '2021', '2021', '2021'],
dtype='datetime64[Y]')
With the conversion time:
In [379]: timeit np.nonzero(arr['month'].astype('datetime64[Y]')==np.array('2021
...: ','datetime64[Y]'))
17.5 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And if we can justify doing the conversion ahead of time:
In [380]: %%timeit yrs = arr['month'].astype('datetime64[Y]')
...: np.nonzero(yrs==np.array('2021','datetime64[Y]'))
6.2 µs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [396]: char_slice(arr['month'],0,4)
Out[396]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [397]: char_slice(arr['month'],0,4)=='2021'
Out[397]: array([False, False, False, True, True, True])
In [398]: timeit np.nonzero(char_slice(arr['month'],0,4)=='2021')
37.2 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 4
Reputation: 114230
If you insist on using numpy, you can extract the data you want to integer format using slicing:
strs = dayaresale['month'].copy()[:, None].view('U1')
year = strs[:, :4].view('U4').astype(int).ravel()
month = strs[:, 5:7].view('U2').astype(int).ravel()
The conversion to 2D followed by the ravel at the end allows the 'S1'
view to expand into columns. The copy is necessary because the data is not completely contiguous (though it is in the column dimension).
Any mask you construct from these arrays will be applicable to the original, e.g.:
dataresale[year == 2021]
PS
The copy is really bothering me, since the original data is clearly "contiguous enough" to avoid it. If the elements of the string were not in a contiguous block, it would be understandable. I therefore propose the following alternative for string slicing, which is actually a lot cheaper and simpler in some ways:
yoffset = dataresale.dtype.fields['month'][1]
year = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=yoffset, strides=dataresale.strides, dtype='U4').astype(int)
moffset = dataresale.dtype.fields['month'][1] + dataresale.dtype.fields['month'][0].itemsize // 50 * 5
month = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=moffset, strides=dataresale.strides, dtype='U2').astype(int)
PPS
It's really bothering me that there isn't a generic string slicing method in numpy. It seems pretty simple to implement given the example above. So here's a more general solution:
def char_slice(a, start, stop=None):
"""
Apply `slice` to each element of string array `a`.
Parameters
----------
a : array-like
Must contain a ascii or unicode elements.
start, stop : int
The limits of the slice. Only contiguous one-directional
slices are supported. If only `start` is provided, the
slice is interpreted as `(0, start)`, as ususal in python.
Bounds past the ends of the string are silently truncated,
as with most python slicing. Negative slice values are
interpreted relative to the end of the datatype, not
necessarily contents of the individual elements. `start` is
inclusive, while `stop` is exclusive. `start <= stop` is
required after all other adjustments have been made.
Return
------
slice : np.ndarray
A view of the original data, sliced to show strings of the
required size. The dimensions of the array will be the same
as those of the input, and the datatype will be `SN` or
`UN`, with the same byte order and code as the input, but
with `N = stop - start`.
Note
----
There are two circumstances under which a view can not be
returned. The simplest is when the data is not in a suitable
format, such as a list or other array-like. As a rule of thumb,
anything that `numpy.asanarray` would copy becomes a copy. The
second circumstance is when the original base of the array `a`
is non-contiguous. `a` itself does not have to be contiguous for
a view to be successfully constructed.
"""
a = np.asanyarray(a)
dtype = a.dtype
if dtype.char not in 'US':
raise TypeError(f'Only U and S string datatypes supported. Found {dtype.char}')
length = int(dtype.str[2:])
# Adjust the bounds using a slice object
if stop is None:
start, stop = 0, start
start, stop, step = slice(start, stop).indices(length)
if start > stop or step != 1:
raise ValueError('Invalid start-stop combination. Start <= stop required after adjustment.')
# Get the real dtype information
charsize = dtype.itemsize // length
# Find the real base array
base = a
while base.base is not None:
base = base.base
realoffset = a.__array_interface__['data'][0] - base.__array_interface__['data'][0]
newoffset = start * charsize + realoffset
newdtype = np.dtype(f'{dtype.str[:2]}{stop - start}')
try:
newarray = np.ndarray(buffer=base, offset=newoffset, shape=a.shape, strides=a.strides, dtype=newdtype)
except ValueError as e:
if str(e) == 'ndarray is not contiguous':
a = a.copy()
newarray = np.ndarray(buffer=a, offset=start * charsize, shape=a.shape, strides=a.strides, dtype=newdtype)
else:
raise
return newarray
Now all you need to do to get the year out is
year = char_slice(strs = dayaresale['month'], 4).astype(int)
Upvotes: 2
Reputation: 95873
So, in general, numpy.ndarray
objects have limited support for string operations. Notably, string slicing seems to be absent. If you look at similar questions, you can hack a slice from the front at least using a view (with a small N
for the UN
type). However, since your array is a structured dtype, it doesn't like creating views.
In this particular case, though, you can use the np.char.startswith
function.
Some example data (please always provide this in the future, you are coming here asking for help, don't make people work to make your own question easy to answer, it's actually part of the rules, but it is also just common courtesy):
(py39) Juans-MBP:workspace juan$ cat resale.csv
2017-01,foo,4560.0
2019-01,bar,3432.34
2017-01,baz,34199.5
2019-01,baz,3232.34
2017-01,bar,932.34
Ok, so using that above:
In [1]: import numpy as np
In [2]: resale = "resale.csv"
In [3]: data = np.loadtxt(resale,dtype=[('month','U50'),('flat_type','U50'),
...: ('resale_price','f8')],delimiter=',')
In [4]: data
Out[4]:
array([('2017-01', 'foo', 4560. ), ('2019-01', 'bar', 3432.34),
('2017-01', 'baz', 34199.5 ), ('2019-01', 'baz', 3232.34),
('2017-01', 'bar', 932.34)],
dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])
In [5]: np.char.startswith(data['month'], "2019")
Out[5]: array([False, True, False, True, False])
In [6]: data[np.char.startswith(data['month'], "2019")]
Out[6]:
array([('2019-01', 'bar', 3432.34), ('2019-01', 'baz', 3232.34)],
dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])
Alternatively, though, in this case you are working with dates, which is a supported type in numpy
, so you can use the following dtype: 'datetime64[D]'
which will be a datetime64 but parsed by filling in the days for you:
In [14]: data = np.loadtxt(resale,dtype=[('month','datetime64[D]'),('flat_type','U50'),
...: ('resale_price','f8')],delimiter=',')
In [8]: data
Out[8]:
array([('2017-01-01', 'foo', 4560. ), ('2019-01-01', 'bar', 3432.34),
('2017-01-01', 'baz', 34199.5 ), ('2019-01-01', 'baz', 3232.34),
('2017-01-01', 'bar', 932.34)],
dtype=[('month', '<M8[D]'), ('flat_type', '<U50'), ('resale_price', '<f8')])
Then you can use something like:
In [9]: data['month'] >= np.datetime64("2019")
Out[9]: array([False, True, False, True, False])
Upvotes: 2
Reputation: 164
I think you can do it with the help of pandas.Series as follows:
dataresale[pd.Series(dataresale['month']).str.match(r'^2021-')]
Upvotes: 0