Ryan Warnick
Ryan Warnick

Reputation: 1099

Removing html tags from a string in R

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:

I tried implementing a function to remove the html tags:

cleanFun=function(fullStr)
{
 #find location of tags and citations
 tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);

 #create storage for tag strings
 tagStrings=list()

 #extract and store tag strings
 for(i in 1:dim(tagLoc)[1])
 {
   tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
 }

 #remove tag strings from paragraph
 newStr=fullStr
 for(i in 1:length(tagStrings))
 {
   newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
 }
 return(newStr)
};

This works for some tags but not all tags, an example where this fails is following string:

test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"

The goal would be to obtain:

cleanFun(test)="junk junk junk junk"

However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.

Upvotes: 44

Views: 39548

Answers (7)

Scott Ritchie
Scott Ritchie

Reputation: 10543

This can be achieved simply through regular expressions and the grep family:

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}

This will also work with multiple html tags in the same string!

This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.

Upvotes: 91

David Robinson
David Robinson

Reputation: 78610

You can also do this with two functions in the rvest package:

library(rvest)

strip_html <- function(s) {
    html_text(read_html(s))
}

Example output:

> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

Note that you should not use regexes to parse HTML.

Upvotes: 34

user1609452
user1609452

Reputation: 4444

It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags

Use a package like XML. Source the html code in parse it using for example htmlParse and use xpaths to find the quantities relevant to you.

UPDATE:

To answer the OP's question

require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)

Upvotes: 4

Peyton
Peyton

Reputation: 7396

Another approach, using tm.plugin.webmining, which uses XML internally.

> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

Upvotes: 11

PAC
PAC

Reputation: 5366

It may be easier with sub or gsub ?

> test  <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"

Upvotes: 2

Tyler Rinker
Tyler Rinker

Reputation: 109894

An approach using the qdap package:

library(qdap)
bracketX(test, "angle")

## > bracketX(test, "angle")
## [1] "junk junk junk junk"

Upvotes: 7

Hong Ooi
Hong Ooi

Reputation: 57686

First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \ to denote escaped characters for literal backslashes. In this case, \" means the double quote mark, not the two literal characters \ and ". You can use cat to see what the string would actually look like if escaped characters were treated literally.

Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all and str_replace_all.) This is another of the classic blunders; see here for more exposition.

Third, you should have mentioned in your post that you're using the stringr package, but this is only a minor blunder by comparison.

Upvotes: 1

Related Questions