Reputation: 669
I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:
73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London
Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma
My desired output is:
Wandsworth
Lambeth
I can manage to extract the part before:
df$extraction <- sub('.*,\\s*','',address)
and after
df$extraction <- sub('.*,\\s*','',address)
But not the middle part that I need. Can someone please help?
Many Thanks!
Upvotes: 2
Views: 3877
Reputation: 10422
Here are two options that aren't dependent on the city name being the same. The first uses a regex pattern with stringr::str_extract()
:
raw_address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London",
"Street, District, City"
)
df <- data.frame(raw_address, stringsAsFactors = FALSE)
df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')
> df
raw_address distict
1 73 Greenhill Gardens, Wandsworth, London Wandsworth
2 22 Acacia Heights, Lambeth, London Lambeth
3 Street, District, City District
The second uses strsplit()
and makes getting the other elements of the address easier:
df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1)
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)
> df
raw_address address distict city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2 22 Acacia Heights, Lambeth, London 22 Acacia Heights Lambeth London
3 Street, District, City Street District City
The split is done on ,\\s*
in case there is no space or are multiple spaces after a comma.
Upvotes: 3
Reputation: 99331
You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv()
, taking advantage of the fact that colClasses
can be used to drop columns.
address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London"
)
read.csv(text = address, colClasses = c("NULL", "character", "NULL"),
header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Or we could use fread()
. Its select
argument is nice and it strips white space automatically.
data.table::fread(paste(address, collapse = "\n"),
select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Upvotes: 9
Reputation: 1324
You could try this
(?<=, )(.+?),
Works with any data set location doesn't have to be in london.
Upvotes: 1
Reputation: 17611
Here are a couple of approaches:
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
Or
# target the whole string, but use a capture group
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"
Upvotes: 6