Reputation: 87
My question is about using a regex pattern efficiently to find matches between two pandas df extracted from excel files
First, thanks for taking the time to look at my issue. I am very new to python, and even more so when using it to deal with large datasets. I know that I should use stack as a very last resort.
Now I'm hitting my head against a wall for something that's probably simple.
The problem : I have two excel files :
One is only 1 columns * 300 rows =original urls
The other one can be very large from 20k and more translated urls
The purpose : I have original urls and I have to find their translated counterpart that are in the big excel file.
The only common point between the two urls is an 8 digits number somewhere in the url
I have thought about merging but it won't work since it is a partial match (to the best of my knowledge.) I have found a potentially interesting solution : .where which could allows me to do exactly what I want. However I get this error:
master_list["translated"] = crawlfr.url.where(number_search.search(master_list).group(0) == number_search.search(crawl_fr).group(0), master_list.url) TypeError: expected string or buffer
As I understand this error, the issue could come from regex that doesn't seem to take anything else but a string.
In fact when I compare two strings by extracting the match objects in function like this it works.
def skl_finder(master_list,crawl_fr):
skl_search=re.compile("\d{8}")
if skl_search.search(master_list).group(0) ==
skl_search.search(crawl_fr).group(0):
return skl_search.search(master_list).group(0)
I think this question is very close to what I want to do but it did not have any reply: Pandas: Comparing two dataframes with identical data structure but differences in data using pattern matching
import regex as re
import pandas as pd
crawl_loc="translated_file_set.xlsx"
master_list_loc="original_fileset.xlsx"
crawlfr=pd.read_excel(crawl_loc,parse_cols="E")
master_list=pd.read_excel(master_list_loc)
number_search=re.compile("\d{8}")
master_list["translated"] =
crawlfr.url.where(number_search.search(master_list).group(0) ==
number_search.search(crawl_fr).group(0), master_list.url)
master_list.to_excel("result_file.xlsx")
Upvotes: 0
Views: 2837
Reputation: 2904
Make sure your url cols are dtype str
.
Try creating new cols with the 8 digit id for each:
crawlfr['url_id'] = crawlfr.url.str.extract("\d{8}")
master_list['url_id'] = master_list.url.str.extract("\d{8}")
Then join
on url_id
:
crawlfr.join(master_list, on='url_id')
Upvotes: 1