RajB_007
RajB_007

Reputation: 55

How to compare list items to a pandas DataFrame value

I want a method to match a whole value of the loaded_list DataFrame with an item from the domain_list. If an email in loaded_list contains a domain in domain_list then it should be populated in match_list.

I have tried many methods such as contains(domain_list), loaded_list == domain_list - with [row] and DataFrame column header name and IsIn method from pandas. All no luck

loaded_list = []
match_list = []
domain_list = ['@hotmail.co.uk', '@gmail.com']

#This line below is from List to DataFrame
domain_list = pd.DataFrame(domain_list, columns=['Email Address'])
with open(self.breach_file, 'r', encoding='utf-8-sig') as breach_file:
    found_reader = pd.read_csv(breach_file, sep=':', names=['Email Address'], engine='c')
    loaded_list = found_reader
    print("List Parsed... Enumerating Content Types")
    breach_file.close()


match_list = ???
print(f"Match:\n {match_list}")

The expected outcome I would like is the var match_list displaying the emails in loaded_list which contain domain_list.

Many errors have popped up from the methods tried (isin, contains()). Dont want to use For Loops as ill be processing large data.

List Examples

loaded_list:
    [email protected]
    [email protected]
    [email protected]
    [email protected]
    [email protected]

domain_list:
    @gmail.com
    @hotmail.co.uk

Upvotes: 1

Views: 75

Answers (1)

Lawis
Lawis

Reputation: 125

Did you try generating a regex with your domain_list by concatening the values separated by "|" then filter loaded_list using this generated pattern ?

Example:

In[1]: loaded_list=pd.Series([
    "[email protected]",
    "[email protected]",
    "[email protected]",
    "[email protected]",
    "[email protected]"
])


In[2]: domain_list=pd.Series([
    "@gmail.com",
    "@hotmail.co.uk"
])
In[3]: import re
In[4]: match_list = loaded_list[loaded_list.str.contains(domain_list.apply(re.escape).str.cat(sep="|"))]
In[5]: match_list
Out[5]:
0        [email protected]
2    [email protected]
dtype: object

I escaped all special characters in domain_list (to avoid any problem with regex special characters) and then used cat to join all domain_list patterns in one pattern with multiple alternatives using the str.cat method.

Upvotes: 1

Related Questions