Reputation: 101
First of all I am sorry if this question is already answered clearly. I have seen there are very similar answers, but I couldn't use it. So my problem is to match between two sets of columns (UsedFName==FName and UsedLName==LName)
and then fill the Usedid
Column with the ids from 'id' column when it fully matches.
So here is a toy data set
>> df
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain NaN
1 Nadia Alam 2002 Tanvir Hossain NaN
2 Pia Naime 2003 Tanvir Hossain NaN
3 Koethe Talukdar 2004 Koethe Talukdar NaN
4 Manual Hausman 2005 Koethe Talukdar NaN
5 Constantine Pape NaN Max Weber NaN
6 Andreas Kai 2006 Max Weber NaN
7 Max Weber 2007 Manual Hausman NaN
8 Weber Mac 2008 Manual Hausman NaN
9 Plank Ingo 2009 Manual Hausman NaN
10 Tanvir Hossain 2001 Pia Naime NaN
11 Weber Mac 2008 Pia Naime NaN
12 Manual Hausman 2005 Tanvir Hossain NaN
13 Max Weber 2007 Tanvir Hossain NaN
14 Nadia Alam 2002 Manual Hausman NaN
15 Weber Mac 2008 Manual Hausman NaN
16 Pia Naime 2003 Koethe Talukdar NaN
17 Pia Naime 2003 Koethe Talukdar NaN
18 Constantine Pape NaN Koethe Talukdar NaN
19 Koethe Talukdar 2004 Koethe Talukdar NaN
20 Koethe Talukdar 2005 Manual Hausman NaN
21 NaN NaN NaN Manual Hausman NaN
22 NaN NaN NaN Manual Hausman NaN
23 NaN NaN NaN Manual Hausman NaN
24 NaN NaN NaN Manual Hausman NaN
25 NaN NaN NaN Manual Hausman NaN
26 NaN NaN NaN Manual Hausman NaN
27 NaN NaN NaN Manual Hausman NaN
This is the output
>>> df
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain 2001
1 Nadia Alam 2002 Tanvir Hossain 2001
2 Pia Naime 2003 Tanvir Hossain 2001
3 Koethe Talukdar 2004 Koethe Talukdar 2005
4 Manual Hausman 2005 Koethe Talukdar 2005
5 Constantine Pape NaN Max Weber 2007
6 Andreas Kai 2006 Max Weber 2007
7 Max Weber 2007 Manual Hausman 2005
8 Weber Mac 2008 Manual Hausman 2005
9 Plank Ingo 2009 Manual Hausman 2005
10 Tanvir Hossain 2001 Pia Naime 2003
11 Weber Mac 2008 Pia Naime 2003
12 Manual Hausman 2005 Tanvir Hossain 2001
13 Max Weber 2007 Tanvir Hossain 2001
14 Nadia Alam 2002 Manual Hausman 2005
15 Weber Mac 2008 Manual Hausman 2005
16 Pia Naime 2003 Koethe Talukdar 2005
17 Pia Naime 2003 Koethe Talukdar 2005
18 Constantine Pape NaN Koethe Talukdar 2005
19 Koethe Talukdar 2004 Koethe Talukdar 2005
20 Koethe Talukdar 2005 Manual Hausman 2005
21 NaN NaN NaN Manual Hausman 2005
22 NaN NaN NaN Manual Hausman 2005
23 NaN NaN NaN Manual Hausman 2005
24 NaN NaN NaN Manual Hausman 2005
25 NaN NaN NaN Manual Hausman 2005
26 NaN NaN NaN Manual Hausman 2005
27 NaN NaN NaN Manual Hausman 2005
Actually I was able to do it using nested for loops, here is the code:
for i in df['UsedFName'].index:
for j in df['FName'].index:
if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
df.ix[i,'Usedid'] = df.ix[j,'id']
But using nested for loops here is computationally very expensive. I have a huge data set. Is it possible to use it without nested loops? Is there any simple Pythonic ways or Pandas/Numpy ways that I can use here?
Many thanks in advance for the help...looking forward to learn Python.
Upvotes: 3
Views: 4907
Reputation: 180391
You will have to think of a more pandaesque may to add the hashing logic, but this matches your expected output and is a lot more efficient, all you want is to use the ids matching UsedFName
and "UsedLName"
to FName
and LNames
:
import pandas as pd
# Create dict where each key is tuple -> (FName,Lname)
# with the corresponding id as the value
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
# Do a lookup in d using a tuple -> (UsedFName, UsedLName) to get the correct id for each pairing
df["Usedid"] = [d[(f, l)] for f,l in zip(df["UsedFName"], df["UsedLName"])]
print(df["Usedid"])
Output:
0 2001
1 2001
2 2001
3 2005
4 2005
5 2007
6 2007
7 2005
8 2005
9 2005
10 2003
11 2003
12 2001
13 2001
14 2005
15 2005
16 2005
17 2005
18 2005
19 2005
20 2005
21 2005
22 2005
23 2005
24 2005
25 2005
26 2005
27 2005
Name: Useid, dtype: float64
If some names may not be used you can use a default value using dict.get.
This is faster than the suggested groupby:
In [12]: %%timeit
df = pd.read_csv("in.csv")
ids = df.groupby(['FName', 'LName']).id.apply(list)
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName][-1]), axis=1)
....:
100 loops, best of 3: 5.27 ms per loop
In [13]: %%timeit
df = pd.read_csv("in.csv")
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
df["Usedid"] = [d[(f, l)] for f, l in zip(df["UsedFName"], df["UsedLName"])]
....:
1000 loops, best of 3: 1.41 ms per loop
Upvotes: 2
Reputation: 85442
This works:
ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)
First we find ids for the FName
and LName
:
ids = df.groupby(['FName', 'LName']).id.apply(lambda x: list(x)[-1])
They look like this:
FName LName
Andreas Kai 2006
Constantine Pape NaN
Koethe Talukdar 2005
Manual Hausman 2005
Max Weber 2007
Nadia Alam 2002
Pia Naime 2003
Plank Ingo 2009
Tanvir Hossain 2001
Weber Mac 2008
Name: id, dtype: float64
Here groupby()
groups by two columns, the first and the last names. To "see" anything, you need to "do" something with it. Let's convert all ids per group into a list:
>>> df.groupby(['FName', 'LName']).id.apply(list)
FName LName
Andreas Kai [2006.0]
Constantine Pape [nan, nan]
Koethe Talukdar [2004.0, 2004.0, 2005.0]
Manual Hausman [2005.0, 2005.0]
Max Weber [2007.0, 2007.0]
Nadia Alam [2002.0, 2002.0]
Pia Naime [2003.0, 2003.0, 2003.0]
Plank Ingo [2009.0]
Tanvir Hossain [2001.0, 2001.0]
Weber Mac [2008.0, 2008.0, 2008.0]
Name: id, dtype: object
Since we have NaN
, the datatype is float
.
We want only the last id per group. So, instead of list()
we use a lambda
function:
lambda x: list(x)[-1]
In the second step we use our ids
:
df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName]), axis=1)
We apply a function to the dataframe going line by line (axis=1
). Here x
is a line. We use the values in the columns UsedFName
and UsedLName
to get the appropriate id and assign it to the result column with df.Usedid =
.
df
looks like this:
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain 2001
1 Nadia Alam 2002 Tanvir Hossain 2001
2 Pia Naime 2003 Tanvir Hossain 2001
3 Koethe Talukdar 2004 Koethe Talukdar 2005
4 Manual Hausman 2005 Koethe Talukdar 2005
5 Constantine Pape NaN Max Weber 2007
6 Andreas Kai 2006 Max Weber 2007
7 Max Weber 2007 Manual Hausman 2005
8 Weber Mac 2008 Manual Hausman 2005
9 Plank Ingo 2009 Manual Hausman 2005
10 Tanvir Hossain 2001 Pia Naime 2003
11 Weber Mac 2008 Pia Naime 2003
12 Manual Hausman 2005 Tanvir Hossain 2001
13 Max Weber 2007 Tanvir Hossain 2001
14 Nadia Alam 2002 Manual Hausman 2005
15 Weber Mac 2008 Manual Hausman 2005
16 Pia Naime 2003 Koethe Talukdar 2005
17 Pia Naime 2003 Koethe Talukdar 2005
18 Constantine Pape NaN Koethe Talukdar 2005
19 Koethe Talukdar 2004 Koethe Talukdar 2005
20 Koethe Talukdar 2005 Manual Hausman 2005
21 NaN NaN NaN Manual Hausman 2005
22 NaN NaN NaN Manual Hausman 2005
23 NaN NaN NaN Manual Hausman 2005
24 NaN NaN NaN Manual Hausman 2005
25 NaN NaN NaN Manual Hausman 2005
26 NaN NaN NaN Manual Hausman 2005
27 NaN NaN NaN Manual Hausman 2005
Upvotes: 1