hamx0r
hamx0r

Reputation: 4278

How to use pandas .replace() with list of regexs while honoring list order?

I have 2 dataframes: one (A) with some whitelist hostnames in regex form (ie (.*)microsoft.com, (*.)go.microsoft.com...) and another (B) with actual full hostnames of sites. I want to add a new column to this 2nd dataframe with the regex text of the Whitelist (1st) dataframe. However, it appears that Pandas' .replace() method doesn't care about what order items are in for its to_replace and value args.

My data looks like this:

In [1] A
Out[1]: 
                                  wildcards  \
42   (.*)activation.playready.microsoft.com   
35    (.*)v10.vortex-win.data.microsoft.com   
40      (.*)settings-win.data.microsoft.com   
43            (.*)smartscreen.microsoft.com   
39             (.*).playready.microsoft.com   
38                     (.*)go.microsoft.com   
240                     (.*)i.microsoft.com   
238                       (.*)microsoft.com   
                                                 regex  
42   re.compile('^(.*)activation.playready.microsof...  
35   re.compile('^(.*)v10.vortex-win.data.microsoft...  
40   re.compile('^(.*)settings-win.data.microsoft.c...  
43       re.compile('^(.*)smartscreen.microsoft.com$')  
39        re.compile('^(.*).playready.microsoft.com$')  
38                re.compile('^(.*)go.microsoft.com$')  
240                re.compile('^(.*)i.microsoft.com$')  
238                  re.compile('^(.*)microsoft.com$')  


In [2] B.head()
Out[2]: 
                       server_hostname
146     mobile.pipe.aria.microsoft.com
205    settings-win.data.microsoft.com
341      nav.smartscreen.microsoft.com
406  v10.vortex-win.data.microsoft.com
667                  www.microsoft.com

Notice that A has a column of compiled regexes in similar form to the wildcards column. I want to add a wildcard column to B like this:

B.loc[:,'wildcards'] = B['server_hostname'].replace(A['regex'].tolist(), A['wildcards'].tolist())

But the problem is, all of B's wildcard values become (.*)microsoft.com. This happens no matter the order of A's wildcard values. It appears .replace() aims to use the to_replace regex's by shortest value first rather than the order provided.

How can I provide a list of to_replace values so that I ultimately get the most details hostname wildcards value associated with B's server_hostname values?

Upvotes: 9

Views: 627

Answers (7)

hamx0r
hamx0r

Reputation: 4278

Most answers use apply() which is known to be slower than built-in vector function solutions. My hope in using .replace() was that it would be fast since it is such a built in vector function. @vlemaistre's answer was the only one to not use .apply() as is my solution here, which instead of compiling each wildcard into a regex, it treats it as a right-hand substring to use logic: "If server_hostname ends with wildcard, then it's a match". So long as I sort my wildcards by length, then it works just fine.

My function which does this is:

def match_to_whitelist(accepts_df, whitelist_df):
    """ Adds `whitelists` column to accepts_df showing which (if any) whitelist entry it matches with """
    accepts_df.loc[:, 'wildcards'] = None
    for wildcard in whitelist_df['wildcards']:
        accepts_df.loc[(accepts_df['wildcards'].isnull()) & (
            accepts_df['server_hostname'].str.endswith(wildcard)), 'wildcards'] = wildcard
    rows_matched = len(accepts_df['wildcards'].notnull())
matched {rows_matched}")
    return accepts_df

Here, accepts_df is like B from before, and whitelist_df is like A before, but with 2 differences:

  1. no regex column
  2. the wildcards values are no longer in glob/regex format (ie "(.*)microsoft.com" becomes "microsoft.com"

To benchmark answers on my machine, I'll use mine as a baseline, taking 27secs to process 100k accepts_df rows with 400 whitelist_df rows. Using the same dataset, here are the times for other solutions (i was lazy: if they didn't run out the gate, I didn't debug much to find out):

  • @vlemaistre - List Comprehension with vector functions: 193sec
  • @user214 - SequenceMatcher: 234sec
  • @aws_apprentice - Compare RE search result lengths: 24sec
  • @fpersyn - First match (will be best match if A is sorted): over 6mins, so quit...
  • @Andy Hayden - lastgroup: didn't test because I can't (quickly) build a long RE programatically.
  • @capelastegui - Series.str.match(): Error: "pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects"

Ultimately, none of our answers say how to use .replace() as desired, so for the time being, I'll leave this unanswered for a few weeks in case someone can provide an answer to better use .replace(), or at least some other fast vector-based solution. Until then, I'll keep with what I have, or maybe use aws_apprentice's after I verify results.

EDIT I improved my matcher by adding a "domain" column to both DFs, which consists of the last 2 parts of each wildcard/server_hostname (ie www.microsoft.com becomes "microsoft.com"). I then used groupby('domain') on both DFs, iterated through the whitelist groups of domains, fetched the same domain-group from the server_hostname DF (B) and did my matching just using the subset of wildcards/server_hostnames from each group. This cut my processing time to match in half.

Upvotes: 1

fpersyn
fpersyn

Reputation: 1096

The pandas documentation describes the .replace() method as:

Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

This implies that the method will iterate over all cells in the dataframe and replace what it can for each query provided in the to_replace argument. A quick example to demonstrate this:

df = pd.DataFrame({'A':['a','c'],'B':['b','d']})
df.replace(['a','b'],['b','c'])

Output:
    A   B
0   c   c
1   c   d

In your example, each regex rule overwrites previous replacements when there is a new match, which is how you end up with a vector of (.*)microsoft.com results.

You could use the .apply() method instead. For instance, by sorting your whitelist (A) descending by length, iterate over each row of your value DataFrame (B) and return each first match:

import pandas as pd
import re

# Using the definitions for A and B from your question, 
# where A is sorted descending by length.

def first_match(x):
    for index, row in A.iterrows():
        if bool(re.search(row['wildcards'], x['server_hostname'])) is True:
            return row['wildcards']
B['wildcards'] = B.apply(first_match, axis=1)
B

Output:
    server_hostname                     wildcards
0   mobile.pipe.aria.microsoft.com      (.*)microsoft.com
1   settings-win.data.microsoft.com     (.*)settings-win.data.microsoft.com
2   nav.smartscreen.microsoft.com       (.*)smartscreen.microsoft.com
3   v10.vortex-win.data.microsoft.com   (.*)v10.vortex-win.data.microsoft.com
4   www.microsoft.com                   (.*)microsoft.com

It might also be worth reading up on the split-apply-combine pattern for more advanced strategies. I hope that helps.

Upvotes: 0

capelastegui
capelastegui

Reputation: 83

The purest pandas approach that I could find involves running Series.str.match() on B.server_hostname for each regex, then taking the first match from each column with idxmax().

# Create input data
A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com']})

# Ensure B has a unique index
B = B.reset_index(drop=True)

# Check which regexes match each hostname
df_match = A.regex.apply(lambda x: B.server_hostname.str.match(x))
df_match.index= A.wildcards
df_match.columns=B.server_hostname

# Get first match for each hostname
df_first_match = df_match.idxmax().rename('wildcards').reset_index()

Output:

print(df_match)
print(df_first_match)

server_hostname                         v10.vortex-win.data.microsoft.com  www.microsoft.com
wildcards                                                                                   
(.*)activation.playready.microsoft.com                              False              False
(.*)v10.vortex-win.data.microsoft.com                                True              False
(.*)i.microsoft.com                                                 False              False
(.*)microsoft.com                                                    True               True

                     server_hostname                              wildcards
0  v10.vortex-win.data.microsoft.com  (.*)v10.vortex-win.data.microsoft.com
1                  www.microsoft.com                      (.*)microsoft.com

That said, this seems to be a bit slower than other solutions posted earlier.

Upvotes: 0

Andy Hayden
Andy Hayden

Reputation: 375685

An alternative tack, which unfortunately still needs an apply, is to use lastgroup. This entails compiling a single regex and then looking up the name of the matched group (row):

In [11]: regex = re.compile("|".join([f"(?P<i{i}>{regex})" for i, regex in s["wildcards"].items()]))

In [12]: regex
Out[12]:
re.compile(r'(?P<i42>(.*)activation.playready.microsoft.com)|(?P<i35>(.*)v10.vortex-win.data.microsoft.com)|(?P<i40>(.*)settings-win.data.microsoft.com)|(?P<i43>(.*)smartscreen.microsoft.com)|(?P<i39>(.*).playready.microsoft.com)|(?P<i38>(.*)go.microsoft.com)|(?P<i240>(.*)i.microsoft.com)|(?P<i238>(.*)microsoft.com)',
re.UNICODE)

In [13]: B.server_hostname.apply(lambda s: int(re.match(regex, s).lastgroup[1:]))
Out[13]:
146    238
205     40
341     43
406     35
667    238
Name: server_hostname, dtype: int64

In [14]: B.server_hostname.apply(lambda s: int(re.match(regex, s).lastgroup[1:])).map(s.wildcards)
Out[14]:
146                        (.*)microsoft.com
205      (.*)settings-win.data.microsoft.com
341            (.*)smartscreen.microsoft.com
406    (.*)v10.vortex-win.data.microsoft.com
667                        (.*)microsoft.com
Name: server_hostname, dtype: object

This attribute isn't exposed by pandas (but it might be possible to do something clever with the internals)...

Upvotes: 0

gold_cy
gold_cy

Reputation: 14226

Here is another approach using apply. There is no pure pandas way to do this as far as I know. I also borrowed the data that @vlemaistre provided.

A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com']})

pats = set(A.regex)

def max_match(hostname):
    d = {}
    for pat in pats:
        maybe_result = pat.search(hostname)
        if maybe_result:
            p = pat.pattern
            d[len(p)] = p
    return d.get(max([*d]))

B['wildcards'] = B['server_hostname'].apply(max_match)

                     server_hostname                                wildcards
0  v10.vortex-win.data.microsoft.com  ^(.*)v10.vortex-win.data.microsoft.com$
1                  www.microsoft.com                      ^(.*)microsoft.com$

Upvotes: 0

Anjith
Anjith

Reputation: 2308

One alternate way would be to use SequenceMatcher and re.match.

Taken data from the answer given by @vlemaistre

from difflib import SequenceMatcher
import pandas as pd
import re

A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com', 'www.i.microsoft.com']})

def regex_match(x):
    match = None
    ratio = 0
    for w, r in A[['wildcards', 'regex']].to_numpy():
        if re.match(r, x) is not None:
            pct = SequenceMatcher(None, w, x).ratio()
            if ratio < pct: ratio = pct; match = w
    return match

B['wildcards'] = B.server_hostname.apply(regex_match)

# print(B.wildcards)
0    (.*)v10.vortex-win.data.microsoft.com
1                        (.*)microsoft.com
2                      (.*)i.microsoft.com
Name: server_hostname, dtype: object

Upvotes: 0

vlemaistre
vlemaistre

Reputation: 3331

Here is a way to do this using a double list comprehension and the re.sub() function :

import re

A = pd.DataFrame({'wildcards' : ['(.*)activation.playready.microsoft.com',
                                 '(.*)v10.vortex-win.data.microsoft.com',
                                 '(.*)i.microsoft.com', '(.*)microsoft.com'],
                  'regex' : [re.compile('^(.*)activation.playready.microsoft.com$'),
                             re.compile('^(.*)v10.vortex-win.data.microsoft.com$'), 
                             re.compile('^(.*)i.microsoft.com$'), 
                             re.compile('^(.*)microsoft.com$')]})

B = pd.DataFrame({'server_hostname' : ['v10.vortex-win.data.microsoft.com',
                                       'www.microsoft.com']})
# For each server_hostname we try each regex and keep the longest matching one
B['wildcards'] = [max([re.sub(to_replace, value, x) for to_replace, value
                       in A[['regex', 'wildcards']].values
                       if re.sub(to_replace, value, x)!=x], key=len) 
                  for x in B['server_hostname']]

Output : 
                     server_hostname                              wildcards
0  v10.vortex-win.data.microsoft.com  (.*)v10.vortex-win.data.microsoft.com
1                  www.microsoft.com                      (.*)microsoft.com

Upvotes: 0

Related Questions