Ben
Ben

Reputation: 1581

Multi-dimensional slicing a list of strings with numpy

Say I have the following:

my_list = np.array(["abc", "def", "ghi"])

and I'd like to get:

np.array(["ef", "hi"])

I tried:

my_list[1:,1:]

But then I get:

IndexError: too many indices for array

Does Numpy support slicing strings?

Upvotes: 2

Views: 704

Answers (5)

Mad Physicist
Mad Physicist

Reputation: 114320

Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as @hpaulj's answer clearly shows.

>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])

I'm working on another layer of abstraction, specifically for string arrays called np.slice_ (currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do

>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])

Upvotes: 0

hpaulj
hpaulj

Reputation: 231385

Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.

In [116]: my_list
Out[116]: 
array(['abc', 'def', 'ghi'], 
      dtype='|S3')

A S1,S2 dtype views each element as 2 strings, with 1 and 2 char each:

In [115]: my_list.view('S1,S2')
Out[115]: 
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')], 
     dtype=[('f0', 'S1'), ('f1', 'S2')])

select the 2nd field to get an array with the desired characters:

In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]: 
array(['ef', 'hi'], 
      dtype='|S2')

My first attempt with view was to split the array into single byte strings, and play with the resulting 2d array:

In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)

In [49]: my_2dstrings
Out[49]: 
array([['a', 'b', 'c'],
       ['d', 'e', 'f'],
       ['g', 'h', 'i']], 
      dtype='|S1')

This array can then be sliced in both dimensions. I used flatten to remove a dimension, and to force a copy (to get a new contiguous buffer).

In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]: 
array(['ef', 'hi'], 
      dtype='|S2')

If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.

Some timings with the 1000 x 64 list that wflynny tests

In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop   # mine's slower computer

In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop

In [100]: %%timeit arr =np.array(my_list_64)
   .....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')   .....: 
10000 loops, best of 3: 23.2 us per loop

Creating the array from the list is slow, but once created the view approach is much faster.


See my edit history for my earlier notes on np.char.

Upvotes: 1

wflynny
wflynny

Reputation: 18521

As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy later in your pipeline, I would urge against it.

[s[1:] for s in my_list[1:]]

is fast:

In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase) 
                                 for _ in range(randint(2, 64))])
                        for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
                      for i in range(1000)]

In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop

Using numpy just adds overhead.

Upvotes: 0

rth
rth

Reputation: 11201

No, you cannot do that. For numpy np.array(["abc", "def", "ghi"]) is a 1D array of strings, therefore you cannot use 2D slicing.

You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,

In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]: 
array(['ef', 'hi'], dtype='|S2')

Upvotes: 2

rassa45
rassa45

Reputation: 3550

Your slicing is incorrectly syntaxed. You only need to do my_list[1:] to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])

Upvotes: -2

Related Questions