Reputation: 1581
Say I have the following:
my_list = np.array(["abc", "def", "ghi"])
and I'd like to get:
np.array(["ef", "hi"])
I tried:
my_list[1:,1:]
But then I get:
IndexError: too many indices for array
Does Numpy support slicing strings?
Upvotes: 2
Views: 704
Reputation: 114320
Starting with numpy 1.23.0, I added a mechanism to change the dtype of views of non-contiguous arrays. That means you can view your array as individual characters, slice it how you like, and then build it back together. Before this would require a copy, as @hpaulj's answer clearly shows.
>>> my_list = np.array(["abc", "def", "ghi"])
>>> my_list[:, None].view('U1')[1:, 1:].view('U2').squeeze()
array(['ef', 'hi'])
I'm working on another layer of abstraction, specifically for string arrays called np.slice_
(currently work-in-progress in PR #20694, but the code is functional). If that should get accepted, you will be able to do
>>> np.char.slice_(my_list[1:], 1)
array(['ef', 'hi'])
Upvotes: 0
Reputation: 231385
Your array of strings stores the data as a contiguous block of characters, using the 'S3' dtype to divide it into strings of length 3.
In [116]: my_list
Out[116]:
array(['abc', 'def', 'ghi'],
dtype='|S3')
A S1,S2
dtype views each element as 2 strings, with 1 and 2 char each:
In [115]: my_list.view('S1,S2')
Out[115]:
array([('a', 'bc'), ('d', 'ef'), ('g', 'hi')],
dtype=[('f0', 'S1'), ('f1', 'S2')])
select the 2nd field to get an array with the desired characters:
In [114]: my_list.view('S1,S2')[1:]['f1']
Out[114]:
array(['ef', 'hi'],
dtype='|S2')
My first attempt with view
was to split the array into single byte strings, and play with the resulting 2d array:
In [48]: my_2dstrings = my_list.view(dtype='|S1').reshape(3,-1)
In [49]: my_2dstrings
Out[49]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
dtype='|S1')
This array can then be sliced in both dimensions. I used flatten
to remove a dimension, and to force a copy (to get a new contiguous buffer).
In [50]: my_2dstrings[1:,1:].flatten().view(dtype='|S2')
Out[50]:
array(['ef', 'hi'],
dtype='|S2')
If the strings are already in an array (as opposed to a list) then this approach is much faster than the list comprehension approaches.
Some timings with the 1000 x 64 list that wflynny
tests
In [98]: timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 173 us per loop # mine's slower computer
In [99]: timeit np.array(my_list_64).view('S1').reshape(64,-1)[1:,1:].flatten().view('S63')
1000 loops, best of 3: 213 us per loop
In [100]: %%timeit arr =np.array(my_list_64)
.....: arr.view('S1').reshape(64,-1)[1:,1:].flatten().view('S63') .....:
10000 loops, best of 3: 23.2 us per loop
Creating the array from the list is slow, but once created the view
approach is much faster.
See my edit history for my earlier notes on np.char
.
Upvotes: 1
Reputation: 18521
As per Joe Kington here, python is very good at string manipulations and generator/list comprehensions are fast and flexible for string operations. Unless you need to use numpy
later in your pipeline, I would urge against it.
[s[1:] for s in my_list[1:]]
is fast:
In [1]: from string import ascii_lowercase
In [2]: from random import randint, choice
In [3]: my_list_rand = [''.join([choice(ascii_lowercase)
for _ in range(randint(2, 64))])
for i in range(1000)]
In [4]: my_list_64 = [''.join([choice(ascii_lowercase) for _ in range(64)])
for i in range(1000)]
In [5]: %timeit [s[1:] for s in my_list_rand[1:]]
10000 loops, best of 3: 47.6 µs per loop
In [6]: %timeit [s[1:] for s in my_list_64[1:]]
10000 loops, best of 3: 45.3 µs per loop
Using numpy
just adds overhead.
Upvotes: 0
Reputation: 11201
No, you cannot do that. For numpy np.array(["abc", "def", "ghi"])
is a 1D array of strings, therefore you cannot use 2D slicing.
You could either define your array as a 2D array or characters, or simply use list comprehension for slicing,
In [4]: np.asarray([el[1:] for el in my_list[1:]])
Out[4]:
array(['ef', 'hi'], dtype='|S2')
Upvotes: 2
Reputation: 3550
Your slicing is incorrectly syntaxed. You only need to do my_list[1:]
to get what you need. If you want to copy the elements twice onto a list, You can do something = mylist[1:].extend(mylist[1:])
Upvotes: -2