AJG519
AJG519

Reputation: 3379

Pandas - identifying dataframe values that start with value in a list

Say I have the following dataframe:

>>> import pandas as pd
>>> d=pd.DataFrame()
>>> d['A']=['12345','12354','76','4']
>>> d['B']=['4442','2345','33','5']
>>> d['C']=['5553','4343','33','5']
>>> d
       A     B     C
0  12345  4442  5553
1  12354  2345  4343
2     76    33    33
3      4     5     5

And say I have 3 values of interest:

>>> vals=['123','76']

I am interested in determining which values in my dataframe start with any of the values in my list. There are 3 cases in my example: (0,A) starts with 123; (1,A) starts with 123; and (2,A) starts with 76.

Is there a way I can do this without looping through each of my values?

If I were interested in matching values exactly I could just do:

>>> d.isin(vals)
       A      B      C
0  False  False  False
1  False  False  False
2   True  False  False
3  False  False  False
>>> 

And if I was interested in whether the values start with 1 particular value I could do:

>>> d.applymap(lambda x:x.startswith('123'))
       A      B      C
0   True  False  False
1   True  False  False
2  False  False  False
3  False  False  False
>>> 

But how can I combine these two to find any value that starts with any value in my list?

Upvotes: 3

Views: 1850

Answers (4)

EdChum
EdChum

Reputation: 394179

You can construct a regex pattern and test each column in turn using apply with a lambda calling str.contains:

In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.apply(lambda x: x.str.contains('|'.join(v)))

Out[9]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

The resulting regex pattern:

In [10]:
'|'.join(v)

Out[10]:
'^123|^76'

update

Actually you can do this using stack and unstack so that you initially stack all columns into a single column, call str.contains with the regex pattern and then unstacking back to the original form:

In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.stack().str.contains('|'.join(v)).unstack()

Out[9]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

This is a cleaner way of doing it compared to using apply

Upvotes: 3

jezrael
jezrael

Reputation: 863226

A bit complicated, but it seems the fastest solution.

You can use str.startswith which works only with Series, so use list comprehension and concat output. But need check more items in list vals, so use another list comprehension with numpy.logical_or with reduce which works with numpy array - convert output of concat and last create Dataframe with same column and index as original and data are output of numpy.logical_or:

print ([pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals])
[array([[ True, False, False],
       [ True, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool), array([[False, False, False],
       [False, False, False],
       [ True, False, False],
       [False, False, False]], dtype=bool)]

print (np.logical_or.reduce(
      [pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals]))
[[ True False False]
 [ True False False]
 [ True False False]
 [False False False]]

print(pd.DataFrame(np.logical_or.reduce(
      [pd.concat([d[col].strstartswith(i) for col in d], 
                 axis=1).values for i in vals]), 
                   index=d.index, columns=d.columns))

       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

Timings:

#[40000 rows x 3 columns]
d = pd.concat([d]*10000).reset_index(drop=True)


In [77]: %timeit (d.applymap(lambda x: any([x.startswith(v) for v in vals])))
1 loop, best of 3: 228 ms per loop

In [78]: %timeit (d.apply(lambda x: x.str.contains('|'.join(['^' + x for x in vals]))))
10 loops, best of 3: 147 ms per loop

In [79]: %timeit (d.astype(str).stack().str.match('^(?:{})'.format('|'.join(vals))).unstack())
10 loops, best of 3: 172 ms per loop

In [80]: %timeit (pd.DataFrame(np.logical_or.reduce([pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals]), index=d.index, columns=d.columns))
10 loops, best of 3: 116 ms per loop

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210912

Alternative solution, which doesn't use .apply():

In [66]: search_re = '^(?:{})'.format('|'.join(vals))

In [67]: search_re
Out[67]: '^(?:123|76)'

In [69]: df.astype(str).stack().str.match(search_re).unstack()
Out[69]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

Upvotes: 1

Julien Spronck
Julien Spronck

Reputation: 15433

You could do this:

d.applymap(lambda x: any([x.startswith(v) for v in vals]))

Upvotes: 3

Related Questions