Silhouettes
Silhouettes

Reputation: 175

How do I extract specific parts of text from a string in R?

In R, I have variable which contains large strings of text for each row. From these strings, I'd like to extract specific parts, and add them to separate variables to my data frame. For example, one string value would look like:

"identification"":""138""city"":""New-York"":COMMENT""text"":""Very good!""COMMENT""text"":""It was delicious""guests"":""2""

Desired result:

    city      comment_text_1  comment_text_2
1   New-York  Very good!      It was delicious!   

Each string differs in length and punctuation marks are being used throughout the string. Furthermore, there are some minor differences between the strings, for example, there might be another piece of text inbetween city"":"" and COMMENT""text"":""

What might be a start, is that the text I need is always the text after city"":"", the first COMMENT""text"":"" and the second COMMENT""text"":"". Furthermore, the text I need always ends with two quotation marks ""

Upvotes: 1

Views: 879

Answers (1)

Martin Gal
Martin Gal

Reputation: 16978

As @Mark Neal mentioned, this is a task you can solve by using regular expressions. I'm not very skilled in using regex, but perhaps I can give you some insights:

library(tidyverse)
text <- c('"identification"":""138""city"":""New-York"":COMMENT""text"":""Very good!""COMMENT""text"":""It was delicious""guests"":""2""')

city <- text %>% str_extract('(?<=city"":"").*(?="":COMMENT"")')
comment_1 <- text %>% str_extract('(?<=COMMENT""text"":"").*(?=""COMMENT"")')
comment_2 <- text %>% str_extract('(?<=COMMENT""text"":"").*(?=""guests"")') %>% str_extract('(?<=COMMENT""text"":"").*')

df <- data.frame(city=city, comment_1=comment_1, comment_2=comment_2)

What did I do?

city

str_extract('(?<=city"":"").*(?="":COMMENT"")')

I search for city"":"" and "":COMMENT"" and return everything inbetween:

[1] "New-York"

comment 1

comment_1 <- text %>% str_extract('(?<=COMMENT""text"":"").*(?=""COMMENT"")')

Same for COMMENT""text"":"" and ""COMMENT"" which yields

[1] "Very good!"

comment 2

Since I couldn't figure out how to get the desired result with one regex, I had to iterate.

comment_2 <- text %>% str_extract('(?<=COMMENT""text"":"").*(?=""guests"")') %>% str_extract('(?<=COMMENT""text"":"").*')

The first iteration COMMENT""text"":"" and ""guests"" returns

[1] "Very good!\"\"COMMENT\"\"text\"\":\"\"It was delicious"

since the regex is greedy i.e. it returns the maximum possible string matching the pattern. So the next iteration with COMMENT""text"":"" only returns just the desired last comment:

[1] "It was delicious"

Upvotes: 2

Related Questions