Jamzy
Jamzy

Reputation: 169

Issue with strsplit not storing searched field

I am running a regex query using R

df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")

955 - 959 Fake Street
95-99 Fake Street
4-9 M4 Ln
95 - 99 Fake Street
99 Fake Street

I am attempting to sort these addresses into two columns

I expected:

strsplit(df, "\\d+(\\s*-\\s*\\d+)?", perl=T)

would split up the numbers on the left and the rest of the address on the right.

The result I am getting is:

[1] ""             " Fake Street"
[1] ""             " Fake Street"
[1] ""    " M"  " Ln"
[1] ""             " Fake Street"
[1] ""             " Fake Street"

The strsplit function appears to be delete the field used to split the string. Is there any way I can preserve it?

Thanks

Upvotes: 0

Views: 50

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

You are almost there, just append \\K\\s* to your regex and prepend with the ^, start of string anchor:

df<- c("955 - 959 Fake Street","95-99 Fake Street","4-9 M4 Ln","95 - 99 Fake Street","99 Fake Street")
strsplit(df, "^\\d+(\\s*-\\s*\\d+)?\\K\\s*", perl=T)

The \K is a match reset operator that discards the text msatched so far, so after matching 1+ digits, optionally followed with - enclosed with 0+ whitespaces and 1+ digits at the start of the string, this whole text is dropped. Ony 0+ whitespaces get it into the match value, and they will be split on.

See the R demo outputting:

[[1]]
[1] "955 - 959"   "Fake Street"

[[2]]
[1] "95-99"       "Fake Street"

[[3]]
[1] "4-9"   "M4 Ln"

[[4]]
[1] "95 - 99"     "Fake Street"

[[5]]
[1] "99"          "Fake Street"

Upvotes: 2

ikop
ikop

Reputation: 1790

You could use lookbehinds and lookaheads to split at the space between a number and the character:

strsplit(df, "(?<=\\d)\\s(?=[[:alpha:]])", perl = TRUE)
# [[1]]
# [1] "955 - 959"   "Fake Street"
# 
# [[2]]
# [1] "95-99"       "Fake Street"
# 
# [[3]]
# [1] "4-9" "M4"  "Ln" 
# 
# [[4]]
# [1] "95 - 99"     "Fake Street"
# 
# [[5]]
# [1] "99"          "Fake Street"

This, however also splits at the space between "M4" and "Ln". If your addresses are always of the format "number (possible range) followed by rest of the address" you could extract the two parts separately (as @d.b suggested):

splitDf <- data.frame(
        numberPart = sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\1", df),
        rest = trimws(sub("(\\d+(\\s*-\\s*\\d+)?)(.*)", "\\3", df)))

splitDf
#   numberPart        rest
# 1  955 - 959 Fake Street
# 2      95-99 Fake Street
# 3        4-9       M4 Ln
# 4    95 - 99 Fake Street
# 5         99 Fake Street

Upvotes: 1

Related Questions