IRNotSmart
IRNotSmart

Reputation: 371

Grabbing part of a link from a URL in R

I have parts of links pertaining to baseball players in my character vector:

teamplayerlinks <- c(
    "/players/i/iannech01.shtml", 
    "/players/l/lindad01.shtml",  
    "/players/c/canoro01.shtml"
)

I would like to isolate the letters/numbers after the 3rd / sign, and before the .sthml portion. I want my resulting string to read:

desiredlinks
# [1] "iannech01" "lindad01"  "canoro01"

I assume this may be a job for sub, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.

Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.

Upvotes: 1

Views: 66

Answers (3)

akrun
akrun

Reputation: 887088

This can be also done without regex

tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01"  "canoro01" 

Upvotes: 2

Ven Yao
Ven Yao

Reputation: 3710

The basename function is useful here.

gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01"  "canoro01"

Upvotes: 2

Rich Scriven
Rich Scriven

Reputation: 99331

You could try

gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01"  "canoro01" 

Here we have

  • .*/ remove everything up to and including the last /
  • | or
  • \\..*$ remove everything after the ., starting from the end of the string

By the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman"). I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!

Upvotes: 3

Related Questions