im2wddrf
im2wddrf

Reputation: 581

How to Construct Nested For Loops in R

I am using R to match names in two different data sets. I would like to compare strings. I have basically two data frames of strings, both containing a location ID (not unique) as well as the full names of people. One data frame, has full names that may contain two last names, for some people. The other data frame has the same location code (not unique) but the last names will be only have one of the two (always random which of the two).

What I want to do is do a grep(), line by line of the first data frame to and get outputted search results of the second. My way of going about it is to do the following:

  1. using the paste() function, paste the location id and the first name. This will help with the matching. But I really need to match on the last name (can be any one of the last names). Let's call this new vector location_first

  2. use the function strsplit() on the last name column. Some elements of the list will only have one item, while others (i.e. those individuals with two last names), will have two items in that element. We can call this list strsplit_ln.

  3. I would then do a second paste in the form of a loop: paste the first element of the strsplit_ln with location_first, do a grep on that, then move to the next element of strplit_ln and do a grep on that. I want to print out the entirety of the grep search results on my console on sinked text file.

Here is a step by step process of what I would like to do in the form of a loop (or a nested loop)

# prepare the test data
names_df1 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Johnson Clarke", "Lopez Gutierrez",  "Mar", "Yoon"), stringsAsFactors = F)

names_df2 = data.frame(location = c(1530, 6801, 1530, 6801, 1967),
                       first_name = c("Axel", "Bill", "Carlos", "Flavio", "Jong"),
                       last_name = c("Williams", "Clarke", "Lopez", "Mar", "Yoon"), stringsAsFactors = F)


# Step 1: paste id and first name. Location ID and First Name are identical in both data frames. I will paste the last name in the second step. 
location_name_df1 = paste(names_df1$location, names_df1$first_name)
location_name_df2 = paste(names_df2$location, names_df2$first_name, names_df2$last_name)


# Step 2: string split the last names in df1. I want a loop to go through each element and subelement of this list. 
last_name_strsplit = strsplit(names_df1$last_name, split = " ")


          # these are what I would be searching. Note that in the loop, I go search through each sub element v of the ith element in the list.
          # paste(location_name_df1[i], last_name_strsplit[[i]][v])
          paste(location_name_df1[1], last_name_strsplit[[1]][1])

          paste(location_name_df1[2], last_name_strsplit[[2]][1])
          paste(location_name_df1[2], last_name_strsplit[[2]][2])

          paste(location_name_df1[3], last_name_strsplit[[3]][1])
          paste(location_name_df1[3], last_name_strsplit[[3]][2])

          paste(location_name_df1[4], last_name_strsplit[[4]][1])

          paste(location_name_df1[5], last_name_strsplit[[5]][1])


    # this is the actual search I would like to do. I paste the location_name_df1 with the last names in last_name_strsplit, going through each element (i), as well as each sub element (v)
    names_df1[grep(paste(location_name_df1[1], last_name_strsplit[[1]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][1]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. Loop should jump to the second sub element of last_name_strplit
    names_df1[grep(paste(location_name_df1[2], last_name_strsplit[[2]][2]),location_name_df2),] # This search result was successful

    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][1]),location_name_df2),] # search result successful
    names_df1[grep(paste(location_name_df1[3], last_name_strsplit[[3]][2]),location_name_df2),] # search result NOT successful. Note that this part of the list has two elements. End of sub elements, move on to the next row

    names_df1[grep(paste(location_name_df1[4], last_name_strsplit[[4]][1]),location_name_df2),] # search result successful

    names_df1[grep(paste(location_name_df1[5], last_name_strsplit[[5]][1]),location_name_df2),] # search result successful

I am pretty sure I have to do a nested loop structure where I go through each element of a list (i), then through each of its subelements (v). However, when I do a nested loop, what tends to happen is that I duplicate a lot of the pasting and the search itself goes awry.

Can someone please give me some pointers as to how to create a loop structure with the above steps? Again I am using R/RStudio to match the data.

Thanks!

Upvotes: 0

Views: 270

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145755

Here's a simpler approach. First we do a full join on both location and first name, then we use stringr::str_detect (which, unlike grep is vectorized over both the string and the pattern) to filter out rows where the last single last name isn't one of the possibly double last names:

full = merge(names_df1, names_df2, by = c("location", "first_name"))

library(stringr)
matches = full[str_detect(string = full$last_name.x, pattern = fixed(full$last_name.y)), ]
matches           
#   location first_name     last_name.x last_name.y
# 1     1530       Axel        Williams    Williams
# 2     1530     Carlos Lopez Gutierrez       Lopez
# 3     1967       Jong            Yoon        Yoon
# 4     6801       Bill  Johnson Clarke      Clarke
# 5     6801     Flavio             Mar         Mar

If you prefer dplyr, you can do it this way:

library(dplyr)
full_join(names_df1, names_df2, by = c("location", "first_name")) %>% 
  filter(str_detect(string = last_name.x, pattern = fixed(last_name.y))

Upvotes: 1

Related Questions