Reputation: 581
I am using R to match names in two different data sets. I would like to compare strings. I have basically two data frames of strings, both containing a location ID (not unique) as well as the full names of people. One data frame, has full names that may contain two last names, for some people. The other data frame has the same location code (not unique) but the last names will be only have one of the two (always random which of the two).
What I want to do is do a grep()
, line by line of the first data frame to and get outputted search results of the second. My way of going about it is to do the following:
using the paste()
function, paste the location id and the first name. This will help with the matching. But I really need to match on the last name (can be any one of the last names). Let's call this new vector location_first
use the function strsplit()
on the last name column. Some elements of the list will only have one item, while others (i.e. those individuals with two last names), will have two items in that element. We can call this list strsplit_ln
.
I would then do a second paste in the form of a loop: paste the first element of the strsplit_ln
with location_first
, do a grep on that, then move to the next element of strplit_ln
and do a grep on that. I want to print out the entirety of the grep
search results on my console on sinked text file.
Here is a step by step process of what I would like to do in the form of a loop (or a nested loop)
# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez", "Mar", "Yoon"), stringsAsFactors = F)
names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)
# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step.
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)
# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list.
last_name_strsplit = strsplit(names_df1$last_name, split = " ")
# these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
# paste(location_name_df1[i], last_name_strsplit[[i]][v])
paste(location_name_df1[1], last_name_strsplit[[1]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][1])
paste(location_name_df1[2], last_name_strsplit[[2]][2])
paste(location_name_df1[3], last_name_strsplit[[3]][1])
paste(location_name_df1[3], last_name_strsplit[[3]][2])
paste(location_name_df1[4], last_name_strsplit[[4]][1])
paste(location_name_df1[5], last_name_strsplit[[5]][1])
# this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row
names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful
names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful
I am pretty sure I have to do a nested loop structure where I go through each element of a list (i), then through each of its subelements (v). However, when I do a nested loop, what tends to happen is that I duplicate a lot of the pasting and the search itself goes awry.
Can someone please give me some pointers as to how to create a loop structure with the above steps? Again I am using R/RStudio to match the data.
Thanks!
Upvotes: 0
Views: 270
Reputation: 145755
Here's a simpler approach. First we do a full join on both location and first name, then we use stringr::str_detect
(which, unlike grep
is vectorized over both the string and the pattern) to filter out rows where the last single last name isn't one of the possibly double last names:
full = merge(names_df1, names_df2, by = c("location", "first_name"))
library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches
# location first_name last_name.x last_name.y
# 1 1530 Axel Williams Williams
# 2 1530 Carlos Lopez Gutierrez Lopez
# 3 1967 Jong Yoon Yoon
# 4 6801 Bill Johnson Clarke Clarke
# 5 6801 Flavio Mar Mar
If you prefer dplyr
, you can do it this way:
library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>%
filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))
Upvotes: 1