Beans On Toast
Beans On Toast

Reputation: 1091

Retain string up to second slash in regex?

I am trying to only retain the string after the first section of characters (which includes - and numerics) but before the forward slash.

I have the following string:

x <- c('/youtube.com/videos/cats', '/google.com/images/dogs', 'bbc.com/movies')

/youtube.com/videos/cats
/google.com/images/dogs
bbc.com/movies

So it would look like this

/youtube.com/
/google.com/
bbc.com/

For reference I am using R 3.6

I have tried positive lookbehinds and the closest I got was this: ^\/[^\/]*

Any help appreciated

So in the bbc.com/movies example - the string does not start with a forward slash / but I still want to be able to keep the bbc.com part during the match

Upvotes: 0

Views: 219

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627110

You can use a sub here to only perform a single regex replacement:

sub('^(/?[^/]*/).*', '\\1', x)

See the regex demo.

Details

  • ^ - start of string -(/?[^/]*/) - Capturing group 1 (\1 in the replacement pattern): an optional /, then 0 or more chars other than / and then a /
  • .* - any zero or more chars, as many as possible.

See an R test online:

test <- c("/youtube.com/videos/cats", "/google.com/images/dogs", "bbc.com/movies")
sub('^(/?[^/]*/).*', '\\1', test)
# => [1] "/youtube.com/" "/google.com/"  "bbc.com/"   

Upvotes: 1

Karthik S
Karthik S

Reputation: 11596

Using base R

gsub('(\\/?.*\\.com\\/).*', '\\1', x)
[1] "/youtube.com/" "/google.com/"  "bbc.com/"     

Upvotes: 0

DPH
DPH

Reputation: 4344

an alternative would be with the rebus Package:

library(rebus)
library(stringi)

t <-  c("/youtube.com/videos/cats"," /google.com/images/dogs"," bbc.com/movie")

 pattern <- zero_or_more("/") %R% one_or_more(ALPHA) %R% DOT %R% one_or_more(ALPHA) %R% zero_or_more("/")

 stringi::stri_extract_first_regex(t, pattern) 

[1] "/youtube.com/" "/google.com/"  "bbc.com/"

Upvotes: -1

NotThatKindODr
NotThatKindODr

Reputation: 719

First great username. Try this, you can leverage the fact str_extract only pulls the first match out. assuming all urls match letters.letters this pattern should work. Let me know if you have numbers in any of them.

library(stringr) 
c("/youtube.com/videos/cats",
  "/google.com/images/dogs",
  "bbc.com/movies") %>% 
   str_extract(., "/?\\w+\\.\\w+/")

produces

"/youtube.com/" "/google.com/"  "bbc.com/"  

Upvotes: 1

Related Questions