katdataecon
katdataecon

Reputation: 185

Extracting first word after a specific expression in R

I have a column that contains thousands of descriptions like this (example) :

Description
Building a hospital in the city of LA, USA
Building a school in the city of NYC, USA
Building shops in the city of Chicago, USA

I'd like to create a column with the first word after "city of", like that :

Description City
Building a hospital in the city of LA, USA LA
Building a school in the city of NYC, USA NYC
Building shops in the city of Chicago, USA Chicago

I tried with the following code after seeing this topic Extracting string after specific word, but my column is only filled with missing values

library(stringr)

df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))

df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))

I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.

Upvotes: 1

Views: 1107

Answers (1)

Edo
Edo

Reputation: 7818

Solution

This should make the trick for the data you showed:

df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")

df
#>                                  Description    city
#> 1 Building a hospital in the city of LA, USA      LA
#> 2  Building a school in the city of NYC, USA     NYC
#> 3 Building shops in the city of Chicago, USA Chicago

Alternative

However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:

df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")

Check out the following example:

df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
                                 "Building a school in the city of NYC, USA",
                                 "Building shops in the city of Chicago, USA",
                                 "Building a church in the city of Salt Lake City, USA"))

str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA"      "NYC"     "Chicago" "Salt"   

str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA"             "NYC"            "Chicago"        "Salt Lake City"

Documentation

Check out ?regex:

Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed. Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \C in ....

Upvotes: 3

Related Questions