Reputation: 169
I am running a regex query using R
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street
I am attempting to sort these addresses into two columns
I expected:
strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)
would split up the numbers on the left and the rest of the address on the right.
The result I am getting is:
[1] "" " Fake Street"
[1] "" " Fake Street"
[1] "" " M" " Ln"
[1] "" " Fake Street"
[1] "" " Fake Street"
The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?
Thanks
Upvotes: 0
Views: 50
Reputation: 626870
You are almost there, just append \\K\\s*
to your regex and prepend with the ^
, start of string anchor:
df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)
The \K
is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with -
enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.
See the R demo outputting:
[[1]]
[1] "955 - 959" "Fake Street"
[[2]]
[1] "95-99" "Fake Street"
[[3]]
[1] "4-9" "M4 Ln"
[[4]]
[1] "95 - 99" "Fake Street"
[[5]]
[1] "99" "Fake Street"
Upvotes: 2
Reputation: 1790
You could use lookbehinds and lookaheads to split at the space between a number and the character:
strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959" "Fake Street"
#
# [[2]]
# [1] "95-99" "Fake Street"
#
# [[3]]
# [1] "4-9" "M4" "Ln"
#
# [[4]]
# [1] "95 - 99" "Fake Street"
#
# [[5]]
# [1] "99" "Fake Street"
This, however also splits at the space between "M4"
and "Ln"
. If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as @d.b suggested):
splitDf <- data.frame(
numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))
splitDf
# numberPart rest
# 1 955 - 959 Fake Street
# 2 95-99 Fake Street
# 3 4-9 M4 Ln
# 4 95 - 99 Fake Street
# 5 99 Fake Street
Upvotes: 1