Reputation: 3379
Say I have the following dataframe:
>>> import pandas as pd
>>> d=pd.DataFrame()
>>> d['A']=['12345','12354','76','4']
>>> d['B']=['4442','2345','33','5']
>>> d['C']=['5553','4343','33','5']
>>> d
A B C
0 12345 4442 5553
1 12354 2345 4343
2 76 33 33
3 4 5 5
And say I have 3 values of interest:
>>> vals=['123','76']
I am interested in determining which values in my dataframe start with any of the values in my list. There are 3 cases in my example: (0,A) starts with 123; (1,A) starts with 123; and (2,A) starts with 76.
Is there a way I can do this without looping through each of my values?
If I were interested in matching values exactly I could just do:
>>> d.isin(vals)
A B C
0 False False False
1 False False False
2 True False False
3 False False False
>>>
And if I was interested in whether the values start with 1 particular value I could do:
>>> d.applymap(lambda x:x.startswith('123'))
A B C
0 True False False
1 True False False
2 False False False
3 False False False
>>>
But how can I combine these two to find any value that starts with any value in my list?
Upvotes: 3
Views: 1850
Reputation: 394179
You can construct a regex pattern and test each column in turn using apply
with a lambda calling str.contains
:
In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.apply(lambda x: x.str.contains('|'.join(v)))
Out[9]:
A B C
0 True False False
1 True False False
2 True False False
3 False False False
The resulting regex pattern:
In [10]:
'|'.join(v)
Out[10]:
'^123|^76'
update
Actually you can do this using stack
and unstack
so that you initially stack all columns into a single column, call str.contains
with the regex pattern and then unstack
ing back to the original form:
In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.stack().str.contains('|'.join(v)).unstack()
Out[9]:
A B C
0 True False False
1 True False False
2 True False False
3 False False False
This is a cleaner way of doing it compared to using apply
Upvotes: 3
Reputation: 863226
A bit complicated, but it seems the fastest solution.
You can use str.startswith
which works only with Series, so use list
comprehension and concat
output. But need check more items in list vals
, so use another list comprehension with numpy.logical_or
with reduce
which works with numpy array
- convert output of concat
and last create Dataframe
with same column and index as original and data are output of numpy.logical_or
:
print ([pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals])
[array([[ True, False, False],
[ True, False, False],
[False, False, False],
[False, False, False]], dtype=bool), array([[False, False, False],
[False, False, False],
[ True, False, False],
[False, False, False]], dtype=bool)]
print (np.logical_or.reduce(
[pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals]))
[[ True False False]
[ True False False]
[ True False False]
[False False False]]
print(pd.DataFrame(np.logical_or.reduce(
[pd.concat([d[col].strstartswith(i) for col in d],
axis=1).values for i in vals]),
index=d.index, columns=d.columns))
A B C
0 True False False
1 True False False
2 True False False
3 False False False
Timings:
#[40000 rows x 3 columns]
d = pd.concat([d]*10000).reset_index(drop=True)
In [77]: %timeit (d.applymap(lambda x: any([x.startswith(v) for v in vals])))
1 loop, best of 3: 228 ms per loop
In [78]: %timeit (d.apply(lambda x: x.str.contains('|'.join(['^' + x for x in vals]))))
10 loops, best of 3: 147 ms per loop
In [79]: %timeit (d.astype(str).stack().str.match('^(?:{})'.format('|'.join(vals))).unstack())
10 loops, best of 3: 172 ms per loop
In [80]: %timeit (pd.DataFrame(np.logical_or.reduce([pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals]), index=d.index, columns=d.columns))
10 loops, best of 3: 116 ms per loop
Upvotes: 0
Reputation: 210912
Alternative solution, which doesn't use .apply()
:
In [66]: search_re = '^(?:{})'.format('|'.join(vals))
In [67]: search_re
Out[67]: '^(?:123|76)'
In [69]: df.astype(str).stack().str.match(search_re).unstack()
Out[69]:
A B C
0 True False False
1 True False False
2 True False False
3 False False False
Upvotes: 1
Reputation: 15433
You could do this:
d.applymap(lambda x: any([x.startswith(v) for v in vals]))
Upvotes: 3