Reputation: 669

Regex extraction of text data between 2 commas in R

I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:

73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London

Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma

My desired output is:

Wandsworth
Lambeth

I can manage to extract the part before:

df$extraction <- sub('.*,\\s*','',address)

and after

df$extraction <- sub('.*,\\s*','',address)

But not the middle part that I need. Can someone please help?

Many Thanks!

Upvotes: 2

Answers (4)

sbha

Reputation: 10432

Here are two options that aren't dependent on the city name being the same. The first uses a regex pattern with stringr::str_extract():

raw_address <- c(
  "73 Greenhill Gardens, Wandsworth, London", 
  "22 Acacia Heights, Lambeth, London",
  "Street, District, City"
)

df <- data.frame(raw_address, stringsAsFactors = FALSE)

df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')

> df
                               raw_address     distict
1 73 Greenhill Gardens, Wandsworth, London  Wandsworth
2       22 Acacia Heights, Lambeth, London     Lambeth
3                   Street, District, City    District

The second uses strsplit() and makes getting the other elements of the address easier:

df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1) 
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)

> df
                               raw_address              address    distict   city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2       22 Acacia Heights, Lambeth, London    22 Acacia Heights    Lambeth London
3                   Street, District, City               Street   District   City

The split is done on ,\\s* in case there is no space or are multiple spaces after a comma.

Upvotes: 3

Rich Scriven

Reputation: 99391

You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.

address <- c(
    "73 Greenhill Gardens, Wandsworth, London", 
    "22 Acacia Heights, Lambeth, London"
)

read.csv(text = address, colClasses = c("NULL", "character", "NULL"), 
    header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"

Or we could use fread(). Its select argument is nice and it strips white space automatically.

data.table::fread(paste(address, collapse = "\n"), 
    select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"

Upvotes: 9

Karakuchi

Reputation: 1324

You could try this

(?<=, )(.+?),

Works with any data set location doesn't have to be in london.

Upvotes: 1

Jota

Reputation: 17621

Here are a couple of approaches:

# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"

# target the whole string, but use a capture group 
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"

Upvotes: 6

Regex extraction of text data between 2 commas in R

Answers (4)

Related Questions