Reputation: 4278
I have 2 dataframes: one (A
) with some whitelist hostnames in regex form (ie (.*)microsoft.com
, (*.)go.microsoft.com
...) and another (B
) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace()
method doesn't care about what order items are in for its to_replace
and value
args.
My data looks like this:
In [1] A
Out[1]:
wildcards \
42 (.*)activation.playready.microsoft.com
35 (.*)v10.vortex-win.data.microsoft.com
40 (.*)settings-win.data.microsoft.com
43 (.*)smartscreen.microsoft.com
39 (.*).playready.microsoft.com
38 (.*)go.microsoft.com
240 (.*)i.microsoft.com
238 (.*)microsoft.com
regex
42 re.compile('^(.*)activation.playready.microsof...
35 re.compile('^(.*)v10.vortex-win.data.microsoft...
40 re.compile('^(.*)settings-win.data.microsoft.c...
43 re.compile('^(.*)smartscreen.microsoft.com$')
39 re.compile('^(.*).playready.microsoft.com$')
38 re.compile('^(.*)go.microsoft.com$')
240 re.compile('^(.*)i.microsoft.com$')
238 re.compile('^(.*)microsoft.com$')
In [2] B.head()
Out[2]:
server_hostname
146 mobile.pipe.aria.microsoft.com
205 settings-win.data.microsoft.com
341 nav.smartscreen.microsoft.com
406 v10.vortex-win.data.microsoft.com
667 www.microsoft.com
Notice that A
has a column of compiled regexes in similar form to the wildcards
column. I want to add a wildcard
column to B
like this:
B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())
But the problem is, all of B
's wildcard values become (.*)microsoft.com
. This happens no matter the order of A
's wildcard values. It appears .replace()
aims to use the to_replace
regex's by shortest value first rather than the order provided.
How can I provide a list of to_replace
values so that I ultimately get the most details hostname wildcards
value associated with B
's server_hostname
values?
Upvotes: 9
Views: 627
Reputation: 4278
Most answers use apply()
which is known to be slower than built-in vector function solutions. My hope in using .replace()
was that it would be fast since it is such a built in vector function. @vlemaistre's answer was the only one to not use .apply()
as is my solution here, which instead of compiling each wildcard into a regex, it treats it as a right-hand substring to use logic: "If server_hostname
ends with wildcard
, then it's a match". So long as I sort my wildcards by length, then it works just fine.
My function which does this is:
def match_to_whitelist(accepts_df, whitelist_df):
""" Adds `whitelists` column to accepts_df showing which (if any) whitelist entry it matches with """
accepts_df.loc[:, 'wildcards'] = None
for wildcard in whitelist_df['wildcards']:
accepts_df.loc[(accepts_df['wildcards'].isnull()) & (
accepts_df['server_hostname'].str.endswith(wildcard)), 'wildcards'] = wildcard
rows_matched = len(accepts_df['wildcards'].notnull())
matched {rows_matched}")
return accepts_df
Here, accepts_df
is like B
from before, and whitelist_df
is like A
before, but with 2 differences:
regex
columnwildcards
values are no longer in glob/regex format (ie "(.*)microsoft.com" becomes "microsoft.com"To benchmark answers on my machine, I'll use mine as a baseline, taking 27secs to process 100k accepts_df
rows with 400 whitelist_df
rows. Using the same dataset, here are the times for other solutions (i was lazy: if they didn't run out the gate, I didn't debug much to find out):
A
is sorted): over 6mins, so quit...lastgroup
: didn't test because I can't (quickly) build a long RE programatically. Series.str.match()
: Error: "pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects"Ultimately, none of our answers say how to use .replace()
as desired, so for the time being, I'll leave this unanswered for a few weeks in case someone can provide an answer to better use .replace()
, or at least some other fast vector-based solution. Until then, I'll keep with what I have, or maybe use aws_apprentice's after I verify results.
EDIT
I improved my matcher by adding a "domain" column to both DFs, which consists of the last 2 parts of each wildcard/server_hostname (ie www.microsoft.com becomes "microsoft.com"). I then used groupby('domain')
on both DFs, iterated through the whitelist groups of domains, fetched the same domain-group from the server_hostname DF (B) and did my matching just using the subset of wildcards/server_hostnames from each group. This cut my processing time to match in half.
Upvotes: 1
Reputation: 1096
The pandas documentation describes the .replace()
method as:
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
This implies that the method will iterate over all cells in the dataframe and replace what it can for each query provided in the to_replace
argument. A quick example to demonstrate this:
df = pd.DataFrame({'A':['a','c'],'B':['b','d']})
df.replace(['a','b'],['b','c'])
Output:
A B
0 c c
1 c d
In your example, each regex rule overwrites previous replacements when there is a new match, which is how you end up with a vector of (.*)microsoft.com
results.
You could use the .apply()
method instead. For instance, by sorting your whitelist (A) descending by length, iterate over each row of your value DataFrame (B) and return each first match:
import pandas as pd
import re
# Using the definitions for A and B from your question,
# where A is sorted descending by length.
def first_match(x):
for index, row in A.iterrows():
if bool(re.search(row['wildcards'], x['server_hostname'])) is True:
return row['wildcards']
B['wildcards'] = B.apply(first_match, axis=1)
B
Output:
server_hostname wildcards
0 mobile.pipe.aria.microsoft.com (.*)microsoft.com
1 settings-win.data.microsoft.com (.*)settings-win.data.microsoft.com
2 nav.smartscreen.microsoft.com (.*)smartscreen.microsoft.com
3 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
4 www.microsoft.com (.*)microsoft.com
It might also be worth reading up on the split-apply-combine pattern for more advanced strategies. I hope that helps.
Upvotes: 0
Reputation: 83
The purest pandas approach that I could find involves running Series.str.match()
on B.server_hostname for each regex, then taking the first match from each column with idxmax()
.
# Create input data
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
# Ensure B has a unique index
B = B.reset_index(drop=True)
# Check which regexes match each hostname
df_match = A.regex.apply(lambda x: B.server_hostname.str.match(x))
df_match.index= A.wildcards
df_match.columns=B.server_hostname
# Get first match for each hostname
df_first_match = df_match.idxmax().rename('wildcards').reset_index()
Output:
print(df_match)
print(df_first_match)
server_hostname v10.vortex-win.data.microsoft.com www.microsoft.com
wildcards
(.*)activation.playready.microsoft.com False False
(.*)v10.vortex-win.data.microsoft.com True False
(.*)i.microsoft.com False False
(.*)microsoft.com True True
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
1 www.microsoft.com (.*)microsoft.com
That said, this seems to be a bit slower than other solutions posted earlier.
Upvotes: 0
Reputation: 375685
An alternative tack, which unfortunately still needs an apply, is to use lastgroup
. This entails compiling a single regex and then looking up the name of the matched group (row):
In [11]: regex = re.compile("|".join([f"(?P<i{i}>{regex})" for i, regex in s["wildcards"].items()]))
In [12]: regex
Out[12]:
re.compile(r'(?P<i42>(.*)activation.playready.microsoft.com)|(?P<i35>(.*)v10.vortex-win.data.microsoft.com)|(?P<i40>(.*)settings-win.data.microsoft.com)|(?P<i43>(.*)smartscreen.microsoft.com)|(?P<i39>(.*).playready.microsoft.com)|(?P<i38>(.*)go.microsoft.com)|(?P<i240>(.*)i.microsoft.com)|(?P<i238>(.*)microsoft.com)',
re.UNICODE)
In [13]: B.server_hostname.apply(lambda s: int(re.match(regex, s).lastgroup[1:]))
Out[13]:
146 238
205 40
341 43
406 35
667 238
Name: server_hostname, dtype: int64
In [14]: B.server_hostname.apply(lambda s: int(re.match(regex, s).lastgroup[1:])).map(s.wildcards)
Out[14]:
146 (.*)microsoft.com
205 (.*)settings-win.data.microsoft.com
341 (.*)smartscreen.microsoft.com
406 (.*)v10.vortex-win.data.microsoft.com
667 (.*)microsoft.com
Name: server_hostname, dtype: object
This attribute isn't exposed by pandas (but it might be possible to do something clever with the internals)...
Upvotes: 0
Reputation: 14226
Here is another approach using apply
. There is no pure pandas
way to do this as far as I know. I also borrowed the data that @vlemaistre provided.
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
pats = set(A.regex)
def max_match(hostname):
d = {}
for pat in pats:
maybe_result = pat.search(hostname)
if maybe_result:
p = pat.pattern
d[len(p)] = p
return d.get(max([*d]))
B['wildcards'] = B['server_hostname'].apply(max_match)
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com ^(.*)v10.vortex-win.data.microsoft.com$
1 www.microsoft.com ^(.*)microsoft.com$
Upvotes: 0
Reputation: 2308
One alternate way would be to use SequenceMatcher and re.match.
Taken data from the answer given by @vlemaistre
from difflib import SequenceMatcher
import pandas as pd
import re
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com', 'www.i.microsoft.com']})
def regex_match(x):
match = None
ratio = 0
for w, r in A[['wildcards', 'regex']].to_numpy():
if re.match(r, x) is not None:
pct = SequenceMatcher(None, w, x).ratio()
if ratio < pct: ratio = pct; match = w
return match
B['wildcards'] = B.server_hostname.apply(regex_match)
# print(B.wildcards)
0 (.*)v10.vortex-win.data.microsoft.com
1 (.*)microsoft.com
2 (.*)i.microsoft.com
Name: server_hostname, dtype: object
Upvotes: 0
Reputation: 3331
Here is a way to do this using a double list comprehension and the re.sub()
function :
import re
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
'(.*)v10.vortex-win.data.microsoft.com',
'(.*)i.microsoft.com', '(.*)microsoft.com'],
'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
re.compile('^(.*)v10.vortex-win.data.microsoft.com$'),
re.compile('^(.*)i.microsoft.com$'),
re.compile('^(.*)microsoft.com$')]})
B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
'www.microsoft.com']})
# For each server_hostname we try each regex and keep the longest matching one
B['wildcards'] = [max([re.sub(to_replace, value, x) for to_replace, value
in A[['regex', 'wildcards']].values
if re.sub(to_replace, value, x)!=x], key=len)
for x in B['server_hostname']]
Output :
server_hostname wildcards
0 v10.vortex-win.data.microsoft.com (.*)v10.vortex-win.data.microsoft.com
1 www.microsoft.com (.*)microsoft.com
Upvotes: 0