Shaikh Tanvir Hossain
Shaikh Tanvir Hossain

Reputation: 101

matching between two columns and taking value from another in pandas

First of all I am sorry if this question is already answered clearly. I have seen there are very similar answers, but I couldn't use it. So my problem is to match between two sets of columns (UsedFName==FName and UsedLName==LName) and then fill the Usedid Column with the ids from 'id' column when it fully matches.

So here is a toy data set

>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain     NaN
1         Nadia      Alam  2002    Tanvir   Hossain     NaN
2           Pia     Naime  2003    Tanvir   Hossain     NaN
3        Koethe  Talukdar  2004    Koethe  Talukdar     NaN
4        Manual   Hausman  2005    Koethe  Talukdar     NaN
5   Constantine      Pape   NaN       Max     Weber     NaN
6       Andreas       Kai  2006       Max     Weber     NaN
7           Max     Weber  2007    Manual   Hausman     NaN
8         Weber       Mac  2008    Manual   Hausman     NaN
9         Plank      Ingo  2009    Manual   Hausman     NaN
10       Tanvir   Hossain  2001       Pia     Naime     NaN
11        Weber       Mac  2008       Pia     Naime     NaN
12       Manual   Hausman  2005    Tanvir   Hossain     NaN
13          Max     Weber  2007    Tanvir   Hossain     NaN
14        Nadia      Alam  2002    Manual   Hausman     NaN
15        Weber       Mac  2008    Manual   Hausman     NaN
16          Pia     Naime  2003    Koethe  Talukdar     NaN
17          Pia     Naime  2003    Koethe  Talukdar     NaN
18  Constantine      Pape   NaN    Koethe  Talukdar     NaN
19       Koethe  Talukdar  2004    Koethe  Talukdar     NaN
20       Koethe  Talukdar  2005    Manual   Hausman     NaN
21          NaN       NaN   NaN    Manual   Hausman     NaN
22          NaN       NaN   NaN    Manual   Hausman     NaN
23          NaN       NaN   NaN    Manual   Hausman     NaN
24          NaN       NaN   NaN    Manual   Hausman     NaN
25          NaN       NaN   NaN    Manual   Hausman     NaN
26          NaN       NaN   NaN    Manual   Hausman     NaN
27          NaN       NaN   NaN    Manual   Hausman     NaN

This is the output

>>> df
          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005

Actually I was able to do it using nested for loops, here is the code:

for i in df['UsedFName'].index:
    for j in df['FName'].index:
        if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
            df.ix[i,'Usedid'] = df.ix[j,'id']

But using nested for loops here is computationally very expensive. I have a huge data set. Is it possible to use it without nested loops? Is there any simple Pythonic ways or Pandas/Numpy ways that I can use here?

Many thanks in advance for the help...looking forward to learn Python.

Upvotes: 3

Views: 4907

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

You will have to think of a more pandaesque may to add the hashing logic, but this matches your expected output and is a lot more efficient, all you want is to use the ids matching UsedFName and "UsedLName" to FName and LNames:

import pandas as pd

# Create dict where each  key is tuple -> (FName,Lname)
# with the corresponding id as the value
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))

# Do a lookup in d using a tuple -> (UsedFName, UsedLName) to get the correct id for each pairing
df["Usedid"] = [d[(f, l)] for f,l in zip(df["UsedFName"], df["UsedLName"])]
print(df["Usedid"])

Output:

0     2001
1     2001
2     2001
3     2005
4     2005
5     2007
6     2007
7     2005
8     2005
9     2005
10    2003
11    2003
12    2001
13    2001
14    2005
15    2005
16    2005
17    2005
18    2005
19    2005
20    2005
21    2005
22    2005
23    2005
24    2005
25    2005
26    2005
27    2005
Name: Useid, dtype: float64

If some names may not be used you can use a default value using dict.get.

This is faster than the suggested groupby:

In [12]: %%timeit                                 
df = pd.read_csv("in.csv")
ids = df.groupby(['FName', 'LName']).id.apply(list)
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName][-1]), axis=1)
   ....: 
100 loops, best of 3: 5.27 ms per loop

In [13]: %%timeit                                 
df = pd.read_csv("in.csv")
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
df["Usedid"] = [d[(f, l)] for f, l in zip(df["UsedFName"], df["UsedLName"])]
   ....: 

1000 loops, best of 3: 1.41 ms per loop

Upvotes: 2

Mike Müller
Mike Müller

Reputation: 85442

Solution

This works:

ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)

Explanation

First we find ids for the FName and LName:

ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])

They look like this:

FName        LName   
Andreas      Kai         2006
Constantine  Pape         NaN
Koethe       Talukdar    2005
Manual       Hausman     2005
Max          Weber       2007
Nadia        Alam        2002
Pia          Naime       2003
Plank        Ingo        2009
Tanvir       Hossain     2001
Weber        Mac         2008
Name: id, dtype: float64

Here groupby() groups by two columns, the first and the last names. To "see" anything, you need to "do" something with it. Let's convert all ids per group into a list:

>>> df.groupby(['FName', 'LName']).id.apply(list)

FName        LName   
Andreas      Kai                         [2006.0]
Constantine  Pape                      [nan, nan]
Koethe       Talukdar    [2004.0, 2004.0, 2005.0]
Manual       Hausman             [2005.0, 2005.0]
Max          Weber               [2007.0, 2007.0]
Nadia        Alam                [2002.0, 2002.0]
Pia          Naime       [2003.0, 2003.0, 2003.0]
Plank        Ingo                        [2009.0]
Tanvir       Hossain             [2001.0, 2001.0]
Weber        Mac         [2008.0, 2008.0, 2008.0]
Name: id, dtype: object

Since we have NaN, the datatype is float.

We want only the last id per group. So, instead of list() we use a lambda function:

lambda x: list(x)[-1]

In the second step we use our ids:

df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)

We apply a function to the dataframe going line by line (axis=1). Here x is a line. We use the values in the columns UsedFName and UsedLName to get the appropriate id and assign it to the result column with df.Usedid =.

Output

df looks like this:

          FName     LName    id UsedFName UsedLName  Usedid
0        Tanvir   Hossain  2001    Tanvir   Hossain    2001
1         Nadia      Alam  2002    Tanvir   Hossain    2001
2           Pia     Naime  2003    Tanvir   Hossain    2001
3        Koethe  Talukdar  2004    Koethe  Talukdar    2005
4        Manual   Hausman  2005    Koethe  Talukdar    2005
5   Constantine      Pape   NaN       Max     Weber    2007
6       Andreas       Kai  2006       Max     Weber    2007
7           Max     Weber  2007    Manual   Hausman    2005
8         Weber       Mac  2008    Manual   Hausman    2005
9         Plank      Ingo  2009    Manual   Hausman    2005
10       Tanvir   Hossain  2001       Pia     Naime    2003
11        Weber       Mac  2008       Pia     Naime    2003
12       Manual   Hausman  2005    Tanvir   Hossain    2001
13          Max     Weber  2007    Tanvir   Hossain    2001
14        Nadia      Alam  2002    Manual   Hausman    2005
15        Weber       Mac  2008    Manual   Hausman    2005
16          Pia     Naime  2003    Koethe  Talukdar    2005
17          Pia     Naime  2003    Koethe  Talukdar    2005
18  Constantine      Pape   NaN    Koethe  Talukdar    2005
19       Koethe  Talukdar  2004    Koethe  Talukdar    2005
20       Koethe  Talukdar  2005    Manual   Hausman    2005
21          NaN       NaN   NaN    Manual   Hausman    2005
22          NaN       NaN   NaN    Manual   Hausman    2005
23          NaN       NaN   NaN    Manual   Hausman    2005
24          NaN       NaN   NaN    Manual   Hausman    2005
25          NaN       NaN   NaN    Manual   Hausman    2005
26          NaN       NaN   NaN    Manual   Hausman    2005
27          NaN       NaN   NaN    Manual   Hausman    2005

Upvotes: 1

Related Questions