How to match a regex pattern between two columns in pandas dataframe efficiently?

Question

My question is about using a regex pattern efficiently to find matches between two pandas df extracted from excel files

First, thanks for taking the time to look at my issue. I am very new to python, and even more so when using it to deal with large datasets. I know that I should use stack as a very last resort.

Now I'm hitting my head against a wall for something that's probably simple.

The problem : I have two excel files :

One is only 1 columns * 300 rows =original urls

The other one can be very large from 20k and more translated urls

The purpose : I have original urls and I have to find their translated counterpart that are in the big excel file.

The only common point between the two urls is an 8 digits number somewhere in the url

I have thought about merging but it won't work since it is a partial match (to the best of my knowledge.) I have found a potentially interesting solution : .where which could allows me to do exactly what I want. However I get this error:

master_list["translated"] = crawlfr.url.where(number_search.search(master_list).group(0) == number_search.search(crawl_fr).group(0), master_list.url) TypeError: expected string or buffer

As I understand this error, the issue could come from regex that doesn't seem to take anything else but a string.

In fact when I compare two strings by extracting the match objects in function like this it works.

def skl_finder(master_list,crawl_fr):
    skl_search=re.compile("\d{8}")
    if skl_search.search(master_list).group(0) == 
    skl_search.search(crawl_fr).group(0):
        return skl_search.search(master_list).group(0)

I think this question is very close to what I want to do but it did not have any reply: Pandas: Comparing two dataframes with identical data structure but differences in data using pattern matching

import regex as re
import pandas as pd
crawl_loc="translated_file_set.xlsx"

master_list_loc="original_fileset.xlsx"

crawlfr=pd.read_excel(crawl_loc,parse_cols="E")

master_list=pd.read_excel(master_list_loc)

number_search=re.compile("\d{8}")

master_list["translated"] = 
crawlfr.url.where(number_search.search(master_list).group(0) == 
number_search.search(crawl_fr).group(0), master_list.url)


master_list.to_excel("result_file.xlsx")

Kyle · Accepted Answer

Make sure your url cols are dtype str.

Try creating new cols with the 8 digit id for each:

crawlfr['url_id'] = crawlfr.url.str.extract("\d{8}")
master_list['url_id'] = master_list.url.str.extract("\d{8}")

Then join on url_id:

crawlfr.join(master_list, on='url_id')

How to match a regex pattern between two columns in pandas dataframe efficiently?

Answers (1)

Related Questions