Reputation: 673
I have two dataframes:
df_A:
SrNo Statement
1 My name is Liam and my son is Olivia and Noah
2 My name is Emma and my son is Oliver and Eva
3 My name is William and my son is Sophia, Elijha
4 My name is Isabella and my son is James, Charlotte, Benjamin ,Amelia
df_B:
SrNo Statement Value1 Parent
1 Son of Olivia is Lucas and Mason Olivia
2 Son of Noah is Ethan and Evelyn Noah
3 Son of Oliver is Harper and Mia Oliver
4 Son of Eva is Aria and Mila Eva
5 Son of Sophia is Sofia and Avery Sophia
6 Son of James is Logan and Jackson James
7 Son of Charlotte is Sebastian and Jack Charlotte
I need to populate the SrNo from df_A to df_B, column Parent but i do not have any common columns in those two dataframes. Also there is no same format in Statement column of df_A and df_B but the individual values will always be there.
I was trying to find the value from Value1 of df_B in Statement column of df_A so that i can relate the two dataframes, but have not been successful. Is there something i can do to achieve this?
Due to operational limitations i cannot modify df_A.
Expected output:
df_C:
SrNo Statement Value1 Parent
1 Son of Olivia is Lucas and Mason Olivia 1
2 Son of Noah is Ethan and Evelyn Noah 1
3 Son of Oliver is Harper and Mia Oliver 2
4 Son of Eva is Aria and Mila Eva 2
5 Son of Sophia is Sofia and Avery Sophia 3
6 Son of James is Logan and Jackson James 4
7 Son of Charlotte is Sebastian and Jack Charlotte 4
Upvotes: 0
Views: 91
Reputation: 11067
Create a dictionary from df_A that maps different names to their associated df_A SrNo IDs.
The text in df_A seems to follow a pattern - all the interesting names come after the text "and my son is". Additionally, the names are separated by either commas or the word and, so after setting up your df's:
import pandas as pd
df_A = pd.DataFrame( {
(1, "My name is Liam and my son is Olivia and Noah"),
(2 , "My name is Emma and my son is Oliver and Eva"),
(3 , "My name is William and my son is Sophia, Elijha"),
(4 , "My name is Isabella and my son is James, Charlotte, Benjamin ,Amelia")}, columns=["SrNo", "Statement"])
Ensure that the dataframe's index is set to "SrNo" that we'll use as an ID.
df_A.set_index("SrNo", inplace=True)
and then create the function that will extract the values we'll use to inform your relation:
import re
def extract_names(text, ignore_before):
start=text.find(ignore_before,0)
extract_content = text[start+len(ignore_before):].strip()
return [n.strip() for n in re.split(r"(?:,|and)", extract_content)]
This creates a function that will extract the names as a list from your text patterns - it uses re.split
to extract those names into a list, using commas and the word "and" as separators[1]. If you want to change that for alternate matches, read up on regex, or try out some examples live at this useful website www.regex101.com - tailoring the regex to select any delimiters you want to identify and split your names by. The function will ignore any text that comes before the ignore_before
parameter.
So for example, if I call that function as follows:
extract_names("My name is Isabella and my son is James, Charlotte, Benjamin ,Amelia", "and my son is")
It returns a list of extracted names:
['James', 'Charlotte', 'Benjamin', 'Amelia']
Using that function, you can then create a lookup dictionary that translates the names extracted into their index lookup values (the value of SrNo from df_A)
lookup_dict = {n:k for k,v in df_A.to_dict()['Statement'].items() for n in extract_names(v, "and my son is")}
The contents of this would be :
{'James': 4,
'Charlotte': 4,
'Benjamin': 4,
'Amelia': 4,
'Sophia': 3,
'Elijha': 3,
'Oliver': 2,
'Eva': 2,
'Olivia': 1,
'Noah': 1}
I've written it here as the product of a list (dict) comprehension that loops over df_A, does the extract and maps it onto the dictionary key.
Finally, to apply this lookup dictionary to the content of df_B, you can make use of pandas' apply
method as so:
df_B['Parent']=df_B['Value1'].apply(lambda x : lookup_dict.get(x))
This will update/overwrite the "Parent" column with the appropriate value extracted from the lookup_dict
using whatever name is found in "Value1".
Value1 Parent
James 4
Olivia 1
Elijha 3
Amelia 4
Oliver 2
Charlotte 4
Names that aren't matched will return a value of None, so keep an eye out for those.
You might encounter problems if any of your names contain the word "and"[1], or include a comma - additionally, this makes the assumption that all the names are unique.
[1] Fix this by tweaking the regex used to do the splitting to r"(?:,|\band\b)"
which forces that "and" to be a word on its own, allowing you to match names like "Ferdinand" for example.
Upvotes: 1