Reputation: 71
Suppose i have a data set df like below:
x | y
----|--------
foo | 1.foo-ya
bar | 2.bar-ga
baz | 3.ha-baz
qux | None
I want to filter the rows where y contains x exactly in the middle (not beginning nor end, i.e. matching the pattern '^.+\w+.+$', hitting row 1 & 2), excluding None/NaN:
x | y
----|-----
foo | 1.foo-ya
bar | 2.bar-ga
It's a typical pair-wise character comparison which is easy in SQL:
select x, y from df where y like concat('^.+', x, '.+%');
or in R:
library(dplyr)
library(stringr)
library(glue)
df %>% filter(str_detect(y, glue('^.+{x}.+$')))
But since I am not an expert in pandas, it seems there is not a similar simple "vectorized" regex matching method in pandas? I applied a lambda approach:
import pandas as pd
import re
df.loc[df.apply(lambda row: bool(re.search(
'^.+' + row.x + '.+$', row.y))
if row.x and row.y else False, axis=1), :]
Are there any more elegant methods in pandas to get it done?
Moreover, I want to extract the leading numbers (1, 2, ...) in the matched records yielded in Part I:
x | y | z
----|----------|---
foo | 1.foo-ya | 1
bar | 2.bar-ga | 2
In R, I can do a straight-forward pipe wrangling:
df %>%
filter(str_detect(y, glue('^.+{x}.+$'))) %>%
mutate(z=str_replace(y, glue('^(\\d+)\\.{x}.+$'), '\\1') %>%
as.numeric)
But in pandas, I am only aware of lambda approach. Are there any "better" approaches than it?
a = df.loc[df.apply(lambda row: bool(
re.search('^.+' + row.x + '.+$', row.y))
if row.x and row.y else False, axis=1),
['x', 'y']]
a['z'] = a.apply(lambda row: re.sub(
r'^(\d+)\.' + row.x + '.+$', r'\1', row.y), axis=1).astype('int')
a
BTW, assign
method fails to work.
df.loc[df.apply(lambda row: bool(re.search(
'^.+' + row.x + '.+$', row.y))
if row.x and row.y else False, axis=1),
['x', 'y']].assign(z=lambda row: re.sub(
r'^(\d+)\.' + row.x + '.+$', r'\1', row.y))
Thank you!
Upvotes: 2
Views: 889
Reputation: 3825
Is this the way you wanted? Pretty much replicated what you did in R:
>>> from numpy import vectorize
>>> from pipda import register_func
>>> from datar.all import f, tribble, filter, grepl, paste0, mutate, sub, as_numeric
[2021-06-24 17:27:16][datar][WARNING] Builtin name "filter" has been overriden by datar.
>>>
>>> df = tribble(
... f.x, f.y,
... "foo", "1.foo-ya",
... "bar", "2.bar-ga",
... "baz", "3.ha-baz",
... "qux", None
... )
>>>
>>> @register_func(None)
... @vectorize
... def str_detect(text, pattern):
... return grepl(pattern, text)
...
>>> @register_func(None)
... @vectorize
... def str_replace(text, pattern, replacement):
... return sub(pattern, replacement, text)
...
>>> df >> \
... filter(str_detect(f.y, paste0('^.+', f.x, '.+$'))) >> \
... mutate(z=as_numeric(str_replace(f.y, paste0(r'^(\d+)\.', f.x, '.+$'), r'\1')))
x y z
<object> <object> <float64>
0 foo 1.foo-ya 1.0
1 bar 2.bar-ga 2.0
Disclaimer: I am the author of the datar
package.
Upvotes: 1
Reputation: 71
Thanks for all the inspiring replies. I have to say, although Python excels in many areas, I prefer R when it comes to such vectorized operations. So I reinvented the wheel for this case.
def str_detect(string: pd.Series, pattern: pd.Series) -> List[bool]:
"""mimic str_detect in R
"""
if len(string) > len(pattern):
pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
elif len(string) < len(pattern):
pattern = pattern[1:len(string)]
return [bool(re.match(y, x)) if x and y else False
for x, y in zip(string, pattern)]
def str_extract(string: pd.Series, pattern: pd.Series) -> List[str]:
"""mimic str_extract in R
"""
if len(string) > len(pattern):
pattern.extend([pattern[-1]] * (len(string)-len(pattern)))
elif len(string) < len(pattern):
pattern = pattern[1:len(string)]
o = [re.search(y, x) if x and y else None
for x, y in zip(string, pattern)]
return [x.group() if x else np.nan for x in o]
then
df.loc[str_detect(
df['y'], '^.+' + df['x']+'.+$'), ['x', 'y']]
(df
.assign(z=str_extract(df['y'], r'^(\d+)(?=\.' + df['x'] + ')'))
.dropna(subset=['z'])
.loc[:, ['x', 'y', 'z']])
Upvotes: 0
Reputation: 28644
pandas string operations are built on python's string and re module. Have a go at this and see if it is what you want:
import re
#find out if values in column x are in column y
#according to the pattern u wrote in the question
pattern = [re.match(fr'^.+{a}.+$',b)
for a,b
in zip(df.x.str.strip(),
df.y.str.strip())
]
match = [ent.group() if ent is not None else np.nan for ent in pattern]
#extract values for digit immediately preceding val in col x
ext = [re.search(fr'\d(?=\.{a})', b) for a,b in
zip(df.x.str.strip(),
df.y.str.strip())]
extract = [ent.group() if ent is not None else np.nan for ent in ext]
df['match'], df['extract'] = match, extract
x y match extract
1 foo 1.foo-ya 1.foo-ya 1
2 bar 2.bar-ga 2.bar-ga 2
3 baz 3.ha-baz NaN NaN
4 qux None NaN NaN
Upvotes: 1