Jakob
Jakob

Reputation: 1453

vectorizing functions that include iterative loops inside a data.table

I have a dataset where I iteratively need to extract postcodes based on regularities in how they are structured. I first need to detect "AA00 0AA" before I can detect "A00 0AA" because the second would also detect the first if I have not already excluded it. I'm not looking for a workaround (like improving the regular expressions by adding a space before it or something like that), I'm trying to understand the following problem intuitively because many more functions with similar issues are coming up in my work:

Data:

library("data.table")
library("stringr")
testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
                                   "1 Some Street, sometown, A00 0AA"))

My function looks like this:

my_postcode_fun <- function(myscalar){
 allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]", 
                   # this is AA00 0AA
                  "[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]" 
                   # this is A00 0AA
                  )# these are the patterns I'm looking for

 for(i in allpatterns){
 if(!is.na(str_extract(myscalar, regex(i))[1])){ # only run the next steps, if found at least once
  
     output <- list(postcode = str_extract(myscalar, regex(i)),     # extract the postcode
               leftover = str_replace(myscalar, regex(i), ""))      # extract old string without the postcode
     break # stop if you've found one
 }
 }
 return(output) # return both elements
}

This function works for a scalar:

my_postcode_fun(testdata[1])

and I can loop it through a vector:

for(i in 1:2){
  print(my_postcode_fun(testdata[i,]))
}

it also works when I force data.table to go by row:

new <- testdata[, c("postcode", "Address"):= 
                  my_postcode_fun(Address), 
                  by = seq_len(nrow(testdata))]
new

But this is very inefficient. I can't get this function to work within a data.table using its "everything is done to each i at the same time" logic.

testdata[, c("postcode", "Address"):= 
           my_postcode_fun(Address)]

There are two problems I think, please correct me if I'm wrong.

The first problem is that data.table does the entire operation for every row, then goes to the next step of the loop. You can see this by commenting out the "break" in the function in which case it only returns the second step of the loop (and has presumably overwritten the first step).

The second problem I think is that the "break" will stop the loop for all rows. At least that is my intuition of how data.table approaches loops. I would like it to continue if it hasn't found anything for a particular row. First step for all elements, then second step for all elements etc.

So, how can I vectorize this AND use it in data.table AND have an iterative(!) loop that needs the first step to have happened before the second step is allowed to happen? I tried to make the example simpler but I'm afraid to miss important pieces because I don't fully understand it.

Upvotes: 1

Views: 86

Answers (2)

r2evans
r2evans

Reputation: 160417

First, the if conditional requires a single logical, not a vector, and if myscalar has a length over 1 (or just 0) then it will fail. Further, the if should really have a vectorized comparison in a sense, because you are likely to have one pattern match one but not all, etc.

For this, one might think ifelse as a vectorized if/else alternative, but I think a different method is a reductive method, where the try each pattern on any not-yet-matched inputs, and stop processing when everything has a match.


my_postcode_fun <- function(myvector){
  allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]", 
                   # this is AA00 0AA
                   "[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]" 
                   # this is A00 0AA
                   )# these are the patterns I'm looking for

  post <- rep(NA_character_, length(myvector))
  
  for (ptn in allpatterns) {
    isna <- is.na(post)
    if (!any(isna)) break
    post[isna] <- str_extract(myvector[isna], regex(ptn))
  }

  isna <- is.na(post)
  if (any(!isna)) {
    myvector[!isna] <- mapply(sub, post[!isna], "", myvector[!isna])
  }
  list(postcode = post, leftover = myvector)
}

testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
                                   "1 Some Street, sometown, A00 0AA",
                                   "1 Some Street, sometown, "))

testdata[, c("post","left") := my_postcode_fun(Address)][]
#                              Address     post                      left
#                               <char>   <char>                    <char>
# 1: 1 Some Street, sometown, AA00 0AA AA00 0AA 1 Some Street, sometown, 
# 2:  1 Some Street, sometown, A00 0AA  A00 0AA 1 Some Street, sometown, 
# 3:         1 Some Street, sometown,      <NA> 1 Some Street, sometown, 

(The post and left strings can definitely be cleaned up, perhaps trimws, but that's a different task.)

Quick walk-through:

  • We start by generating a post vector as long as the input myscalar, but all NA, this is an intentional starting-point.
  • For each ptn in the patterns, first we only want to look at those in post that are NA (yes, the first pass is always all-true, but the point of this state-machine is that it doesn't need to know if it's on the first or nth pass). For this, we define isna as a variable tracking which of post are NA, meaning which of myscalar have not yet had a match.
  • Quick check: if everything has a match (i.e., !any(isna)), then stop processing. This is a nice break point, since it means that if we have 1M inputs and the first pattern matches for all of them, then we do not have to proceed with any of the other patterns.
  • str_extract the remaining inputs with this ptn, and store in the not-yet-matches post output.
  • After the for loop, we populate isna one more time so that we can remove the postcode from the original input. sub is not vectorized on its pattern (just on its input x= vector), so I use mapply to vectorize it in a sense (there are other methods for doing so).
    • Note: there is a very slight risk here: if an input has two apparent postcodes in it (however unlikely), and they contain the same matching substring, then the first will be extracted into post, and sub will remove the first occurrence. Neither of these steps will acknowledge or remove the second. I doubt it's likely, but I wanted to identify this possibility.
  • BTW: while I named the function output with postcode and leftover (as in your question), those names are being dropped in this case by the c("post","left") := within data.table. I named them differently to demonstrate this. The function could easily have returned just list(post, leftover) and it works just as well (though I think the names are good, declarative, and if this function is ever used outside of data.table, the names may be quite helpful).

Upvotes: 2

Waldi
Waldi

Reputation: 41220

For the vectorized version to work with a double data.table column assignement, the function should return a double list, one for the addresses and one for the postcodes.
As your original function returns a list of postcode+address, I had to transpose this list to get the expected format.
This works, but I'm not sure it will be more efficient than the other solutions you tested.
Another point is to loop over the Address vector you get from data.table when using :=

library(data.table)
library(stringr)

testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
                                   "1 Some Street, sometown, A00 0AA"))

test.pattern <- function(myscalar,patterns) {
  output <- NA
  for(i in patterns){
    if(!is.na(str_extract(myscalar, regex(i))[1])){ # only run the next steps, if found at least once
      
      output <- list(postcode = str_extract(myscalar, regex(i)),     # extract the postcode
                     leftover = str_replace(myscalar, regex(i), ""))      # extract old string without the postcode
      break # stop if you've found one
    }
  }
  return(output)
}

my_postcode_fun <- function(myscalar){
  allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]", 
                   # this is AA00 0AA
                   "[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]" 
                   # this is A00 0AA
  )# these are the patterns I'm looking for
  l <- lapply(myscalar, test.pattern, patterns = allpatterns)
  purrr::transpose(l)
}

testdata[, c("postcode", "Address"):= 
           my_postcode_fun(Address)]

testdata
#>                      Address postcode
#> 1: 1 Some Street, sometown,  AA00 0AA
#> 2: 1 Some Street, sometown,   A00 0AA

Upvotes: 1

Related Questions