Reputation: 153
I have a data frame which has 3 columns (called all_names). The first column is called ID, the second column is 'First_names' and the third is 'Last_names' - the data frame has 1 million rows. I have a different data frame (called combos) which has 2 rows: 'First' and 'Last'. (the data frames also have an index column). I need to check the First_names and Last_names column at the same time to see if they contain the combination of first and last in the other data frame.
Currently, I have:
all_names['First_names'] = all_names.First_names.astype(str) #setting column to string data type
all_names['Last_names'] = all_names.Last_names.astype(str)
combos['First'] = combos.First.astype(str)
combos['Last'] = combos.Last.astype(str) #setting column to string data type
for index, row in combos.iterrows():
correct_IDS = all_names.loc[all_names.First_names.str.contains(row.First)] & all_names.loc[all_names.Last_names.str.contains(row.Last), 'ID']
print(correct_tiles)
However, this doesn't work and is messy as has to iterate through all rows. any help would be great
The all_names looks like this (when opened in notepad):
,ID,First_names,Last_names
0,5231,Harry,Smith
1,2745,Mark,Hammond
The combos looks like this (when opened in notepad):
,First,Last
0,Liam,Bradnam
1,James,Beckham
Upvotes: 1
Views: 213
Reputation: 153
Using Ben.T's idea of merging the columns then doing the check. I'm sure his code works but in my instance, I was getting memory error so I tried this which solved my question:
combos['both'] = combos['First'].map(str) + combos['Last']
all_names['both_main'] = all_names['First_names'].map(str) + all_names['Last_names']
both = combos['both'].tolist()
name_IDS = all_names.loc[all_names.both_main.isin(both)]
name_IDS = name_IDS['ID']
Upvotes: 1
Reputation: 173
First create a sample and combine first and last names:
l = [[1, 'Harry', 'M.'], [2, 'H.', 'Maguire'], [3, 'chun', 'ior'], [4, 'Harry', 'ior']]
df1 = pd.DataFrame(l, columns=['ID', 'First_name', 'Last_name'])
df2 = df1.iloc[0:3, 1:]
df2.columns = ['First', 'Last']
# combine first and last names
df1['Combined'] = df1['First_name'] + df1['Last_name']
df2['Combined'] = df2['First'] + df2['Last']
The dataframes look like this:
ID First_name Last_name Combined
0 1 Harry M. HarryM.
1 2 H. Maguire H.Maguire
2 3 chun ior chunior
3 4 Harry ior Harryior
and
First Last Combined
0 Harry M. HarryM.
1 H. Maguire H.Maguire
2 chun ior chunior
Filtering by using isin()
for the combined name yields:
print(df1[df1['Combined'].isin(df2['Combined'])])
ID First_name Last_name Combined
0 1 Harry M. HarryM.
1 2 H. Maguire H.Maguire
2 3 chun ior chunior
while filtering for any possible combination returns:
print(df1[df1['First_name'].isin(df2['First']) & df1['Last_name'].isin(df2['Last'])])
ID First_name Last_name Combined
0 1 Harry M. HarryM.
1 2 H. Maguire H.Maguire
2 3 chun ior chunior
3 4 Harry ior Harryior
Upvotes: 0
Reputation: 29635
Your problem can be solved using merge
. Let's say we have
all_names = pd.DataFrame({'First_names':['John','John','Bob','Robert'],
'Last_names':['Do','Smith','Do','Smith'],'ID':[1,2,3,4]})
combos = pd.DataFrame({'First':['John','Bob','Robert'],'Last':['Smith','Do','Do']})
Then if you use rename
in the merge
, with how='inner'
to keep common couple (First, Last) between both dataframes:
combos.merge(all_names.rename(columns={'First_names':'First','Last_names':'Last'}),how='inner')
and you get
First Last ID
0 John Smith 2
1 Bob Do 3
Now if you want only a list of ID's, you do
list_ID = combos.merge(all_names.rename(columns={'First_names':'First','Last_names':'Last'})
,how='inner')['ID'].tolist()
and you have list_ID
equal to [2, 3]
Upvotes: 2