Reputation: 1453
I have a dataset where I iteratively need to extract postcodes based on regularities in how they are structured. I first need to detect "AA00 0AA" before I can detect "A00 0AA" because the second would also detect the first if I have not already excluded it. I'm not looking for a workaround (like improving the regular expressions by adding a space before it or something like that), I'm trying to understand the following problem intuitively because many more functions with similar issues are coming up in my work:
Data:
library("data.table")
library("stringr")
testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
"1 Some Street, sometown, A00 0AA"))
My function looks like this:
my_postcode_fun <- function(myscalar){
allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]",
# this is AA00 0AA
"[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]"
# this is A00 0AA
)# these are the patterns I'm looking for
for(i in allpatterns){
if(!is.na(str_extract(myscalar, regex(i))[1])){ # only run the next steps, if found at least once
output <- list(postcode = str_extract(myscalar, regex(i)), # extract the postcode
leftover = str_replace(myscalar, regex(i), "")) # extract old string without the postcode
break # stop if you've found one
}
}
return(output) # return both elements
}
This function works for a scalar:
my_postcode_fun(testdata[1])
and I can loop it through a vector:
for(i in 1:2){
print(my_postcode_fun(testdata[i,]))
}
it also works when I force data.table to go by row:
new <- testdata[, c("postcode", "Address"):=
my_postcode_fun(Address),
by = seq_len(nrow(testdata))]
new
But this is very inefficient. I can't get this function to work within a data.table using its "everything is done to each i at the same time" logic.
testdata[, c("postcode", "Address"):=
my_postcode_fun(Address)]
There are two problems I think, please correct me if I'm wrong.
The first problem is that data.table does the entire operation for every row, then goes to the next step of the loop. You can see this by commenting out the "break" in the function in which case it only returns the second step of the loop (and has presumably overwritten the first step).
The second problem I think is that the "break" will stop the loop for all rows. At least that is my intuition of how data.table approaches loops. I would like it to continue if it hasn't found anything for a particular row. First step for all elements, then second step for all elements etc.
So, how can I vectorize this AND use it in data.table AND have an iterative(!) loop that needs the first step to have happened before the second step is allowed to happen? I tried to make the example simpler but I'm afraid to miss important pieces because I don't fully understand it.
Upvotes: 1
Views: 86
Reputation: 160417
First, the if
conditional requires a single logical, not a vector, and if myscalar
has a length over 1 (or just 0) then it will fail. Further, the if
should really have a vectorized comparison in a sense, because you are likely to have one pattern match one but not all, etc.
For this, one might think ifelse
as a vectorized if
/else
alternative, but I think a different method is a reductive method, where the try each pattern on any not-yet-matched inputs, and stop processing when everything has a match.
my_postcode_fun <- function(myvector){
allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]",
# this is AA00 0AA
"[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]"
# this is A00 0AA
)# these are the patterns I'm looking for
post <- rep(NA_character_, length(myvector))
for (ptn in allpatterns) {
isna <- is.na(post)
if (!any(isna)) break
post[isna] <- str_extract(myvector[isna], regex(ptn))
}
isna <- is.na(post)
if (any(!isna)) {
myvector[!isna] <- mapply(sub, post[!isna], "", myvector[!isna])
}
list(postcode = post, leftover = myvector)
}
testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
"1 Some Street, sometown, A00 0AA",
"1 Some Street, sometown, "))
testdata[, c("post","left") := my_postcode_fun(Address)][]
# Address post left
# <char> <char> <char>
# 1: 1 Some Street, sometown, AA00 0AA AA00 0AA 1 Some Street, sometown,
# 2: 1 Some Street, sometown, A00 0AA A00 0AA 1 Some Street, sometown,
# 3: 1 Some Street, sometown, <NA> 1 Some Street, sometown,
(The post
and left
strings can definitely be cleaned up, perhaps trimws
, but that's a different task.)
Quick walk-through:
post
vector as long as the input myscalar
, but all NA
, this is an intentional starting-point.ptn
in the patterns, first we only want to look at those in post
that are NA
(yes, the first pass is always all-true, but the point of this state-machine is that it doesn't need to know if it's on the first or n
th pass). For this, we define isna
as a variable tracking which of post
are NA
, meaning which of myscalar
have not yet had a match.!any(isna)
), then stop processing. This is a nice break point, since it means that if we have 1M inputs and the first pattern matches for all of them, then we do not have to proceed with any of the other patterns.str_extract
the remaining inputs with this ptn
, and store in the not-yet-matches post
output.for
loop, we populate isna
one more time so that we can remove the postcode from the original input. sub
is not vectorized on its pattern (just on its input x=
vector), so I use mapply
to vectorize it in a sense (there are other methods for doing so).
post
, and sub
will remove the first occurrence. Neither of these steps will acknowledge or remove the second. I doubt it's likely, but I wanted to identify this possibility.postcode
and leftover
(as in your question), those names are being dropped in this case by the c("post","left") :=
within data.table
. I named them differently to demonstrate this. The function could easily have returned just list(post, leftover)
and it works just as well (though I think the names are good, declarative, and if this function is ever used outside of data.table
, the names may be quite helpful).Upvotes: 2
Reputation: 41220
For the vectorized version to work with a double data.table
column assignement, the function should return a double list, one for the addresses and one for the postcodes.
As your original function returns a list of postcode+address, I had to transpose this list to get the expected format.
This works, but I'm not sure it will be more efficient than the other solutions you tested.
Another point is to loop over the Address
vector you get from data.table
when using :=
library(data.table)
library(stringr)
testdata <- data.table(Address = c("1 Some Street, sometown, AA00 0AA",
"1 Some Street, sometown, A00 0AA"))
test.pattern <- function(myscalar,patterns) {
output <- NA
for(i in patterns){
if(!is.na(str_extract(myscalar, regex(i))[1])){ # only run the next steps, if found at least once
output <- list(postcode = str_extract(myscalar, regex(i)), # extract the postcode
leftover = str_replace(myscalar, regex(i), "")) # extract old string without the postcode
break # stop if you've found one
}
}
return(output)
}
my_postcode_fun <- function(myscalar){
allpatterns <- c("[[:alpha:]][[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]",
# this is AA00 0AA
"[[:alpha:]][[:digit:]][[:digit:]][[:space:]][[:digit:]][[:alpha:]][[:alpha:]]"
# this is A00 0AA
)# these are the patterns I'm looking for
l <- lapply(myscalar, test.pattern, patterns = allpatterns)
purrr::transpose(l)
}
testdata[, c("postcode", "Address"):=
my_postcode_fun(Address)]
testdata
#> Address postcode
#> 1: 1 Some Street, sometown, AA00 0AA
#> 2: 1 Some Street, sometown, A00 0AA
Upvotes: 1