Reputation: 181

Remove all punctuation except backslash in R

I am trying to extract html links from a data set. I am using strsplit and then grep to find the substring with the links but the result has unwanted chars either at the beginning or the end of the string....How can I extract only the string with the desired pattern or keep the string with the desired pattern

He is what I am currently doing.

1) I split a chunk of text using strplit and " " (space) as the delimiter

2) Next I grep the result of strsplit to find the pattern

e.g. grep("https:\/\/support.google.com\/blogger\/topic\/[0-9]",r)

3) And few variations of the result is shown below....

https://support.google.com/blogger/topic/12457 
https://support.google.com/blogger/topic/12457.
[https://support.google.com/blogger/topic/12457]  
<<https://support.google.com/blogger/topic/12457>>
https://support.google.com/blogger/topic/12457,
https://support.google.com/blogger/topic/12457),
xxxxxxhttps://support.google.com/blogger/topic/12457),hhhththta
etc...

How can I just extract "https://support.google.com/blogger/topic/12457" or after extracting the dirty data how can I remove the unwanted punctuations

Thx in advance.

Upvotes: 1

Answers (3)

Jim

Reputation: 4767

Using rex may make this type of task a little simpler.

# generate dataset
x <- c(
"https://support.google.com/blogger/topic/12457
https://support.google.com/blogger/topic/12457.
https://support.google.com/blogger/topic/12457] 
<<https://support.google.com/blogger/topic/12457>>
https://support.google.com/blogger/topic/12457,
https://support.google.com/blogger/topic/12457),
xxxxxxhttps://support.google.com/blogger/topic/12457),hhhththta")

# extract urls
# note you don't have to worry about escaping the html string yourself
library(rex)    
re <- rex(
  capture(name = "url",
    "https://support.google.com/blogger/topic/",
    digits
    ))

re_matches(x, re, global = TRUE)[[1]]
#>                                             url
#>1 https://support.google.com/blogger/topic/12457
#>2 https://support.google.com/blogger/topic/12457
#>3 https://support.google.com/blogger/topic/12457
#>4 https://support.google.com/blogger/topic/12457
#>5 https://support.google.com/blogger/topic/12457
#>6 https://support.google.com/blogger/topic/12457
#>7 https://support.google.com/blogger/topic/12457

Upvotes: 0

maloneypatr

Reputation: 3622

The qdapRegex package has an awesome function called rm_url that is perfect for this example.

install.packages('qdapRegex')
library(qdapRegex)

urls <- YOUR_VECTOR_OF_URLS
rm_url(urls, extract = T)

Upvotes: 1

Nick DiQuattro

Reputation: 739

If the data is HTML at some point, you could try this:

library(XML)
urls <- getNodeSet(htmlParse(htmldata), "//a[contains(@href, 'support.google.com')]/@href"))

Upvotes: 0

Remove all punctuation except backslash in R

Answers (3)

Related Questions