matching between two columns and taking value from another in pandas

Question

First of all I am sorry if this question is already answered clearly. I have seen there are very similar answers, but I couldn't use it. So my problem is to match between two sets of columns (UsedFName==FName and UsedLName==LName) and then fill the Usedid Column with the ids from 'id' column when it fully matches.

So here is a toy data set

>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain     NaN
1         Nadia      Alam  2002    Tanvir   Hossain     NaN
2           Pia     Naime  2003    Tanvir   Hossain     NaN
3        Koethe  Talukdar  2004    Koethe  Talukdar     NaN
4        Manual   Hausman  2005    Koethe  Talukdar     NaN
5   Constantine      Pape   NaN       Max     Weber     NaN
6       Andreas       Kai  2006       Max     Weber     NaN
7           Max     Weber  2007    Manual   Hausman     NaN
8         Weber       Mac  2008    Manual   Hausman     NaN
9         Plank      Ingo  2009    Manual   Hausman     NaN
10       Tanvir   Hossain  2001       Pia     Naime     NaN
11        Weber       Mac  2008       Pia     Naime     NaN
12       Manual   Hausman  2005    Tanvir   Hossain     NaN
13          Max     Weber  2007    Tanvir   Hossain     NaN
14        Nadia      Alam  2002    Manual   Hausman     NaN
15        Weber       Mac  2008    Manual   Hausman     NaN
16          Pia     Naime  2003    Koethe  Talukdar     NaN
17          Pia     Naime  2003    Koethe  Talukdar     NaN
18  Constantine      Pape   NaN    Koethe  Talukdar     NaN
19       Koethe  Talukdar  2004    Koethe  Talukdar     NaN
20       Koethe  Talukdar  2005    Manual   Hausman     NaN
21          NaN       NaN   NaN    Manual   Hausman     NaN
22          NaN       NaN   NaN    Manual   Hausman     NaN
23          NaN       NaN   NaN    Manual   Hausman     NaN
24          NaN       NaN   NaN    Manual   Hausman     NaN
25          NaN       NaN   NaN    Manual   Hausman     NaN
26          NaN       NaN   NaN    Manual   Hausman     NaN
27          NaN       NaN   NaN    Manual   Hausman     NaN

This is the output

>>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005

Actually I was able to do it using nested for loops, here is the code:

for i in df['UsedFName'].index:
    for j in df['FName'].index:
        if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
            df.ix[i,'Usedid'] = df.ix[j,'id']

But using nested for loops here is computationally very expensive. I have a huge data set. Is it possible to use it without nested loops? Is there any simple Pythonic ways or Pandas/Numpy ways that I can use here?

Many thanks in advance for the help...looking forward to learn Python.

Mike M&#252;ller · Accepted Answer

Solution

This works:

ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)

Explanation

First we find ids for the FName and LName:

ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])

They look like this:

FName        LName   
Andreas      Kai         2006
Constantine  Pape         NaN
Koethe       Talukdar    2005
Manual       Hausman     2005
Max          Weber       2007
Nadia        Alam        2002
Pia          Naime       2003
Plank        Ingo        2009
Tanvir       Hossain     2001
Weber        Mac         2008
Name: id, dtype: float64

Here groupby() groups by two columns, the first and the last names. To "see" anything, you need to "do" something with it. Let's convert all ids per group into a list:

>>> df.groupby(['FName', 'LName']).id.apply(list)

FName        LName   
Andreas      Kai                         [2006.0]
Constantine  Pape                      [nan, nan]
Koethe       Talukdar    [2004.0, 2004.0, 2005.0]
Manual       Hausman             [2005.0, 2005.0]
Max          Weber               [2007.0, 2007.0]
Nadia        Alam                [2002.0, 2002.0]
Pia          Naime       [2003.0, 2003.0, 2003.0]
Plank        Ingo                        [2009.0]
Tanvir       Hossain             [2001.0, 2001.0]
Weber        Mac         [2008.0, 2008.0, 2008.0]
Name: id, dtype: object

Since we have NaN, the datatype is float.

We want only the last id per group. So, instead of list() we use a lambda function:

lambda x: list(x)[-1]

In the second step we use our ids:

df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)

We apply a function to the dataframe going line by line (axis=1). Here x is a line. We use the values in the columns UsedFName and UsedLName to get the appropriate id and assign it to the result column with df.Usedid =.

Output

df looks like this:

          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005

matching between two columns and taking value from another in pandas

Answers (2)

Solution

Explanation

Output

Related Questions