blue-sky
blue-sky

Reputation: 53826

Modify regex to exclude characters that occur at beginning

Using below code I'm extracting a generated html link :

mystr <- c("/url?q=http://www.mypage.html&sa=U&ved=0ahUKEwjgyMPj2pXXAhWB5CYKHXysDlsQqQIIKSgAMAg&usg=AOvVaw1VCvT8iznodM3l4xvc8CVq")

str_extract(mystr, "^.*(?=(&sa))") 

This returns :

[1] "/url?q=http://www.mypage.html"

How to modify regex in order to exclude /url?q= ? So just http://www.mypage.html is returned ?

Upvotes: 1

Views: 43

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

You may also use a base R sub solution to match up to the first http and capture it with any chsrs other than &:

sub(".*?(http[^&]*).*", "\\1", x)

You may precise the pattern to match only after q= aftrr .*?.

Details

  • .*? - any 0+ chars as few as possible,
  • (http[^&]*) - capturing group #1 matching http and then any zero or more chars other than &
  • .* - the rest of the string.

The \1 is a replacement backreference to the Group 1 value.

Upvotes: 1

Sotos
Sotos

Reputation: 51592

You can replace the beginning of the string (i.e. ^) with http,

stringr::str_extract(mystr, "http.*(?=(&sa))") 
#[1] "http://www.mypage.html"

Upvotes: 1

Related Questions