Javier
Javier

Reputation: 1550

Extract text between certain symbols using Regular Expression in R

I have a series of expressions such as:

"<i>the text I need to extract</i></b></a></div>"

I need to extract the text between the <i> and </i> "symbols". This is, the result should be:

"the text I need to extract"

At the moment I am using gsub in R to manually remove all the symbols that are not text. However, I would like to use a regular expression to do the job. Does anyone know a regular expression to extract the between <i> and </i>?

Thanks.

Upvotes: 12

Views: 14848

Answers (5)

Rich Scriven
Rich Scriven

Reputation: 99331

If this is html (which it look like it is) you should probably use an html parser. Package XML can do this

library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"

On an entire html document, you can use

doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)

Upvotes: 7

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

You can use the following approach with gregexpr and regmatches if you don't know the number of matches in a string.

vec <- c("<i>the text I need to extract</i></b></a></div>",
         "abc <i>another text</i> def <i>and another text</i> ghi")

regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
# 
# [[2]]
# [1] "another text"     "and another text"

Upvotes: 5

Tyler Rinker
Tyler Rinker

Reputation: 109874

This approach uses a package I maintain qdapRegex that isn't regex but may be of use to you or future searchers. The function rm_between allows the user to extract text between a left and right bound and optionally include them. This approach is easy in that you don't have to think of a specific regex, just the exact left and right boundaries:

library(qdapRegex)

x <- "<i>the text I need to extract</i></b></a></div>"

rm_between(x, "<i>", "</i>", extract=TRUE)

## [[1]]
## [1] "the text I need to extract"

I would point out that it may be more reliable to use an html parser for this job.

Upvotes: 11

G. Grothendieck
G. Grothendieck

Reputation: 269644

If there is only one <i>...</i> as in the example then match everything up to <i> and everything from </i> forward and replace them both with the empty string:

x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)

giving:

[1] "the text I need to extract"

If there could be multiple occurrences in the same string then try:

library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)

giving the same in this example.

Upvotes: 22

vks
vks

Reputation: 67968

<i>((?:(?!<\/i>).)*)<\/i>

This should do it for you.

Upvotes: 4

Related Questions