Reputation: 113
So i have a dataset with street adresses, they are formatted very differently. For example:
d <- c("street1234", "Street 423", "Long Street 12-14", "Road 18A", "Road 12 - 15", "Road 1/2")
From this I want to create two columns. 1. X: with the street address and 2. Y: with the number + everything that follows. Like this:
X Y
Street 1234
Street 423
Long Street 12-14
Road 18A
Road 12 - 15
Road 1/2
Until now I have tried strsplit and followed some similar questions here , for example: strsplit(d, split = "(?<=[a-zA-Z])(?=[0-9])", perl = T))
. I just can't seem to find the correct regular expression.
Any help is highly appreciated. Thank you in advance!
Upvotes: 9
Views: 5785
Reputation: 51582
A non-regex approach using str_locate
from stringr
to locate the first digit in the string and then split based on that location, i.e.
library(stringr)
ind <- str_locate(d, '[0-9]+')[,1]
setNames(data.frame(do.call(rbind, Map(function(x, y)
trimws(substring(x, seq(1, nchar(x), y-1), seq(y-1, nchar(x), nchar(x)-y+1))),
d, ind)))[,1:2]), c('X', 'Y'))
# X Y
#1 street 1234
#2 Street 423
#3 Long Street 12-14
#4 Road 18A
#5 Road 12 - 15
#6 Road 1/2
NOTE that you receive a (harmless) warning which is a result of the split at "Road 12 - 15"
string which gives [1] "Road" "12 - 15" ""
Upvotes: 3
Reputation: 887008
We can use read.csv
with sub
from base R
read.csv(text=sub("^([A-Za-z ]+)\\s*([0-9]+.*)", "\\1,\\2", d),
header=FALSE, col.names = c("X", "Y"), stringsAsFactors=FALSE)
# X Y
#1 street 1234
#2 Street 423
#3 Long Street 12-14
#4 Road 18A
#5 Road 12 - 15
#6 Road 1/2
Upvotes: 2
Reputation: 23101
This will also work:
do.call(rbind,strsplit(sub('([[:alpha:]]+)\\s*([[:digit:]]+)', '\\1$\\2', d), split='\\$'))
# [,1] [,2]
#[1,] "street" "1234"
#[2,] "Street" "423"
#[3,] "Long Street" "12-14"
#[4,] "Road" "18A"
#[5,] "Road" "12 - 15"
#[6,] "Road" "1/2"
Upvotes: 3
Reputation: 626738
There may be whitespace between the letter and a digit, so add \s*
(zero or more whitespace symbols) between the lookarounds:
> strsplit(d, split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE)
[[1]]
[1] "street" "1234"
[[2]]
[1] "Street" "423"
[[3]]
[1] "Long Street" "12-14"
[[4]]
[1] "Road" "18A"
[[5]]
[1] "Road" "12 - 15"
[[6]]
[1] "Road" "1/2"
And if you want to create columns based on that, you might leverage the separate
from tidyr package :
> library(tidyr)
> separate(data.frame(A = d), col = "A" , into = c("X", "Y"), sep = "(?<=[a-zA-Z])\\s*(?=[0-9])")
X Y
1 street 1234
2 Street 423
3 Long Street 12-14
4 Road 18A
5 Road 12 - 15
6 Road 1/2
Upvotes: 11