Kenny Soon
Kenny Soon

Reputation: 29

Slicing of Python Array

I have an array which I used np. loadtext on an csv file.

dataresale = np.loadtxt(
    resale, skiprows=1, usecols=(0,2,10),
    dtype=[('month', 'U50'),
           ('flat_type', 'U50'),
           ('resale_price', 'f8')], delimiter=',')

print(dataresale['month'])

Below is the output:

['2017-01' '2017-01' '2017-01' ... '2021-03' '2021-10' '2021-12']

I would like to only take out data from year 2021 (all months) only

Below is a script I used to take out rows by year in the another array, but this particular dataset has the months tagged to it

x = datap[datax['year'] == 2019]

Is there a way I can modify the script above to take out all 2021 data?

Upvotes: 2

Views: 328

Answers (5)

Mad Physicist
Mad Physicist

Reputation: 114230

Starting with numpy version 1.23.0, you can create views of non-contiguous arrays (PR #20722)1. The change was motivated by this question, among others, so I'm posting a second answer. Now you can do

dataresale['month'][:, None].view('U1')[:, :4].view('U4').squeeze().astype(int)

This is still annoying, so I added another layer, np.char.slice_, based on my other answer here. PR #206941 is still currently a WIP, pending a change to np.lib.stride_tricks.as_strided that is unrelated to this question. If/when that goes through, you will be able to do this instead:

np.char.slice_(dataresale['month'], 4).astype(int)

The code in the PR is completely functional, but has some hacky workarounds, so I do not recommend using it just yet.

1I am the author. This is a shameless plug, not spam.

Upvotes: 0

hpaulj
hpaulj

Reputation: 231335

Construct a sample array:

In [359]: arr = np.zeros(6, dtype=[('month', 'U50'),
     ...:            ('flat_type', 'U50'),
     ...:            ('resale_price', 'f8')])
In [360]: arr['month']=['2017-01', '2017-01', '2017-01','2021-03', '2021-10', '2
     ...: 021-12']

startswith

Since the interest is in the first for characters we can do:

In [362]: np.char.startswith(arr['month'],'2021')
Out[362]: array([False, False, False,  True,  True,  True])

which effectively is:

In [364]: [s.startswith('2021') for s in arr['month']]
Out[364]: [False, False, False, True, True, True]

The list comprehension is faster, though for better comparison lets get the indices:

In [366]: timeit np.nonzero([s.startswith('2021') for s in arr['month']])
15.1 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [367]: timeit  np.nonzero(np.char.startswith(arr['month'],'2021'))
16.7 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

astype truncation

But astype is a relatively quick way of truncating string dtypes, effectively the [:4] type of string slice:

In [371]: arr['month'].astype('U4')
Out[371]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [372]: arr['month'].astype('U4')=='2021'
Out[372]: array([False, False, False,  True,  True,  True])

In [374]: timeit np.nonzero(arr['month'].astype('U4')=='2021')
6.47 µs ± 7.53 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

datetime[Y]

another option is to convert the string to datetime64

In [376]: arr['month'].astype('datetime64[Y]')
Out[376]: 
array(['2017', '2017', '2017', '2021', '2021', '2021'],
      dtype='datetime64[Y]')

With the conversion time:

In [379]: timeit np.nonzero(arr['month'].astype('datetime64[Y]')==np.array('2021
     ...: ','datetime64[Y]'))
17.5 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And if we can justify doing the conversion ahead of time:

In [380]: %%timeit yrs = arr['month'].astype('datetime64[Y]')
     ...: np.nonzero(yrs==np.array('2021','datetime64[Y]'))
6.2 µs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

char_slice

In [396]: char_slice(arr['month'],0,4)
Out[396]: array(['2017', '2017', '2017', '2021', '2021', '2021'], dtype='<U4')
In [397]: char_slice(arr['month'],0,4)=='2021'
Out[397]: array([False, False, False,  True,  True,  True])
In [398]: timeit np.nonzero(char_slice(arr['month'],0,4)=='2021')

37.2 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 4

Mad Physicist
Mad Physicist

Reputation: 114230

If you insist on using numpy, you can extract the data you want to integer format using slicing:

strs = dayaresale['month'].copy()[:, None].view('U1')
year = strs[:, :4].view('U4').astype(int).ravel()
month = strs[:, 5:7].view('U2').astype(int).ravel()

The conversion to 2D followed by the ravel at the end allows the 'S1' view to expand into columns. The copy is necessary because the data is not completely contiguous (though it is in the column dimension).

Any mask you construct from these arrays will be applicable to the original, e.g.:

dataresale[year == 2021]

PS

The copy is really bothering me, since the original data is clearly "contiguous enough" to avoid it. If the elements of the string were not in a contiguous block, it would be understandable. I therefore propose the following alternative for string slicing, which is actually a lot cheaper and simpler in some ways:

yoffset = dataresale.dtype.fields['month'][1]
year = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=yoffset, strides=dataresale.strides, dtype='U4').astype(int)
moffset = dataresale.dtype.fields['month'][1] + dataresale.dtype.fields['month'][0].itemsize // 50 * 5
month = np.ndarray(buffer=dataresale, shape=dataresale.shape, offset=moffset, strides=dataresale.strides, dtype='U2').astype(int)

PPS

It's really bothering me that there isn't a generic string slicing method in numpy. It seems pretty simple to implement given the example above. So here's a more general solution:

def char_slice(a, start, stop=None):
    """
    Apply `slice` to each element of string array `a`.

    Parameters
    ----------
    a : array-like
        Must contain a ascii or unicode elements.
    start, stop : int
        The limits of the slice. Only contiguous one-directional
        slices are supported. If only `start` is provided, the
        slice is interpreted as `(0, start)`, as ususal in python.
        Bounds past the ends of the string are silently truncated,
        as with most python slicing. Negative slice values are
        interpreted relative to the end of the datatype, not
        necessarily contents of the individual elements. `start` is
        inclusive, while `stop` is exclusive. `start <= stop` is
        required after all other adjustments have been made.

    Return
    ------
    slice : np.ndarray
        A view of the original data, sliced to show strings of the
        required size. The dimensions of the array will be the same
        as those of the input, and the datatype will be `SN` or
        `UN`, with the same byte order and code as the input, but
        with `N = stop - start`.

    Note
    ----
    There are two circumstances under which a view can not be
    returned. The simplest is when the data is not in a suitable
    format, such as a list or other array-like. As a rule of thumb,
    anything that `numpy.asanarray` would copy becomes a copy. The
    second circumstance is when the original base of the array `a`
    is non-contiguous. `a` itself does not have to be contiguous for
    a view to be successfully constructed.
    """
    a = np.asanyarray(a)
    dtype = a.dtype

    if dtype.char not in 'US':
        raise TypeError(f'Only U and S string datatypes supported. Found {dtype.char}')

    length = int(dtype.str[2:])

    # Adjust the bounds using a slice object
    if stop is None:
        start, stop = 0, start
    start, stop, step = slice(start, stop).indices(length)
    if start > stop or step != 1:
        raise ValueError('Invalid start-stop combination. Start <= stop required after adjustment.')

    # Get the real dtype information
    charsize = dtype.itemsize // length

    # Find the real base array
    base = a
    while base.base is not None:
        base = base.base
    realoffset = a.__array_interface__['data'][0] - base.__array_interface__['data'][0]

    newoffset = start * charsize + realoffset
    newdtype = np.dtype(f'{dtype.str[:2]}{stop - start}')
    try:
        newarray = np.ndarray(buffer=base, offset=newoffset, shape=a.shape, strides=a.strides, dtype=newdtype)
    except ValueError as e:
        if str(e) == 'ndarray is not contiguous':
            a = a.copy()
            newarray = np.ndarray(buffer=a, offset=start * charsize, shape=a.shape, strides=a.strides, dtype=newdtype)
        else:
            raise
    return newarray

Now all you need to do to get the year out is

year = char_slice(strs = dayaresale['month'], 4).astype(int)

Upvotes: 2

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95873

So, in general, numpy.ndarray objects have limited support for string operations. Notably, string slicing seems to be absent. If you look at similar questions, you can hack a slice from the front at least using a view (with a small N for the UN type). However, since your array is a structured dtype, it doesn't like creating views.

In this particular case, though, you can use the np.char.startswith function.

Some example data (please always provide this in the future, you are coming here asking for help, don't make people work to make your own question easy to answer, it's actually part of the rules, but it is also just common courtesy):

(py39) Juans-MBP:workspace juan$ cat resale.csv
2017-01,foo,4560.0
2019-01,bar,3432.34
2017-01,baz,34199.5
2019-01,baz,3232.34
2017-01,bar,932.34

Ok, so using that above:

In [1]: import numpy as np

In [2]: resale = "resale.csv"

In [3]: data = np.loadtxt(resale,dtype=[('month','U50'),('flat_type','U50'),
   ...:                                       ('resale_price','f8')],delimiter=',')

In [4]: data
Out[4]:
array([('2017-01', 'foo',  4560.  ), ('2019-01', 'bar',  3432.34),
       ('2017-01', 'baz', 34199.5 ), ('2019-01', 'baz',  3232.34),
       ('2017-01', 'bar',   932.34)],
      dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])

In [5]: np.char.startswith(data['month'], "2019")
Out[5]: array([False,  True, False,  True, False])

In [6]: data[np.char.startswith(data['month'], "2019")]
Out[6]:
array([('2019-01', 'bar', 3432.34), ('2019-01', 'baz', 3232.34)],
      dtype=[('month', '<U50'), ('flat_type', '<U50'), ('resale_price', '<f8')])

Alternatively, though, in this case you are working with dates, which is a supported type in numpy, so you can use the following dtype: 'datetime64[D]' which will be a datetime64 but parsed by filling in the days for you:

In [14]: data = np.loadtxt(resale,dtype=[('month','datetime64[D]'),('flat_type','U50'),
    ...:                                       ('resale_price','f8')],delimiter=',')

In [8]: data
Out[8]:
array([('2017-01-01', 'foo',  4560.  ), ('2019-01-01', 'bar',  3432.34),
       ('2017-01-01', 'baz', 34199.5 ), ('2019-01-01', 'baz',  3232.34),
       ('2017-01-01', 'bar',   932.34)],
      dtype=[('month', '<M8[D]'), ('flat_type', '<U50'), ('resale_price', '<f8')])

Then you can use something like:

In [9]: data['month'] >= np.datetime64("2019")
Out[9]: array([False,  True, False,  True, False])

Upvotes: 2

Arun
Arun

Reputation: 164

I think you can do it with the help of pandas.Series as follows:

dataresale[pd.Series(dataresale['month']).str.match(r'^2021-')]

Upvotes: 0

Related Questions