RSM
RSM

Reputation: 673

link two dataframes without a common column to form a relation

I have two dataframes:

df_A:

SrNo    Statement       
1       My name is Liam and my son is Olivia    and Noah
2       My name is Emma and my son is Oliver and Eva
3       My name is William and my son is Sophia, Elijha
4       My name is Isabella and my son is James, Charlotte, Benjamin ,Amelia

df_B:

SrNo    Statement                               Value1  Parent
1   Son of Olivia is Lucas and Mason            Olivia      
2   Son of Noah is Ethan and Evelyn             Noah 
3   Son of Oliver is Harper and Mia             Oliver
4   Son of Eva is Aria and Mila                 Eva
5   Son of Sophia is Sofia and Avery            Sophia 
6   Son of James is Logan and Jackson           James
7   Son of Charlotte is Sebastian and Jack     Charlotte 

I need to populate the SrNo from df_A to df_B, column Parent but i do not have any common columns in those two dataframes. Also there is no same format in Statement column of df_A and df_B but the individual values will always be there.

I was trying to find the value from Value1 of df_B in Statement column of df_A so that i can relate the two dataframes, but have not been successful. Is there something i can do to achieve this?

Due to operational limitations i cannot modify df_A.

Expected output:

df_C:

SrNo    Statement                               Value1  Parent
1   Son of Olivia is Lucas and Mason            Olivia      1   
2   Son of Noah is Ethan and Evelyn             Noah        1
3   Son of Oliver is Harper and Mia             Oliver      2
4   Son of Eva is Aria and Mila                 Eva         2
5   Son of Sophia is Sofia and Avery            Sophia      3
6   Son of James is Logan and Jackson           James       4
7   Son of Charlotte is Sebastian and Jack     Charlotte    4

Upvotes: 0

Views: 91

Answers (1)

Thomas Kimber
Thomas Kimber

Reputation: 11067

Create a dictionary from df_A that maps different names to their associated df_A SrNo IDs.

The text in df_A seems to follow a pattern - all the interesting names come after the text "and my son is". Additionally, the names are separated by either commas or the word and, so after setting up your df's:

import pandas as pd
df_A = pd.DataFrame( {
(1,       "My name is Liam and my son is Olivia    and Noah"),
(2 ,      "My name is Emma and my son is Oliver and Eva"),
(3 ,      "My name is William and my son is Sophia, Elijha"),
(4 ,      "My name is Isabella and my son is James, Charlotte, Benjamin ,Amelia")}, columns=["SrNo", "Statement"])

Ensure that the dataframe's index is set to "SrNo" that we'll use as an ID.

df_A.set_index("SrNo", inplace=True)

and then create the function that will extract the values we'll use to inform your relation:

import re
def extract_names(text, ignore_before):
    start=text.find(ignore_before,0)
    extract_content = text[start+len(ignore_before):].strip()
    return [n.strip() for n in re.split(r"(?:,|and)", extract_content)]

This creates a function that will extract the names as a list from your text patterns - it uses re.split to extract those names into a list, using commas and the word "and" as separators[1]. If you want to change that for alternate matches, read up on regex, or try out some examples live at this useful website www.regex101.com - tailoring the regex to select any delimiters you want to identify and split your names by. The function will ignore any text that comes before the ignore_before parameter.

So for example, if I call that function as follows:

extract_names("My name is Isabella and my son is James, Charlotte, Benjamin ,Amelia", "and my son is")

It returns a list of extracted names:

['James', 'Charlotte', 'Benjamin', 'Amelia']

Using that function, you can then create a lookup dictionary that translates the names extracted into their index lookup values (the value of SrNo from df_A)

lookup_dict = {n:k for k,v in df_A.to_dict()['Statement'].items() for n in extract_names(v, "and my son is")}

The contents of this would be :

 {'James': 4,
 'Charlotte': 4,
 'Benjamin': 4,
 'Amelia': 4,
 'Sophia': 3,
 'Elijha': 3,
 'Oliver': 2,
 'Eva': 2,
 'Olivia': 1,
 'Noah': 1}

I've written it here as the product of a list (dict) comprehension that loops over df_A, does the extract and maps it onto the dictionary key.

Finally, to apply this lookup dictionary to the content of df_B, you can make use of pandas' apply method as so:

df_B['Parent']=df_B['Value1'].apply(lambda x : lookup_dict.get(x))

This will update/overwrite the "Parent" column with the appropriate value extracted from the lookup_dict using whatever name is found in "Value1".

        Value1      Parent
        James       4
        Olivia      1
        Elijha      3
        Amelia      4
        Oliver      2
        Charlotte   4

Names that aren't matched will return a value of None, so keep an eye out for those.

You might encounter problems if any of your names contain the word "and"[1], or include a comma - additionally, this makes the assumption that all the names are unique.


[1] Fix this by tweaking the regex used to do the splitting to r"(?:,|\band\b)" which forces that "and" to be a word on its own, allowing you to match names like "Ferdinand" for example.

Upvotes: 1

Related Questions