Reputation: 371
I have parts of links pertaining to baseball players in my character vector:
teamplayerlinks <- c(
"/players/i/iannech01.shtml",
"/players/l/lindad01.shtml",
"/players/c/canoro01.shtml"
)
I would like to isolate the letters/numbers after the 3rd /
sign, and before the .sthml
portion. I want my resulting string to read:
desiredlinks
# [1] "iannech01" "lindad01" "canoro01"
I assume this may be a job for sub
, but I after many trials and error, I'm having a very tough time learning the escape and character sequences. I know it can be done with two sub
calls to remove the front and back portion, but I'd rather complete this to dynamically handle other links.
Thank you in advance to anyone who replies - I'm still learning R and trying to get better everyday.
Upvotes: 1
Views: 66
Reputation: 887088
This can be also done without regex
tools::file_path_sans_ext(basename(teamplayerlinks))
#[1] "iannech01" "lindad01" "canoro01"
Upvotes: 2
Reputation: 3710
The basename
function is useful here.
gsub("\\.shtml", "", basename(teamplayerlinks))
# [1] "iannech01" "lindad01" "canoro01"
Upvotes: 2
Reputation: 99331
You could try
gsub(".*/|\\..*$", "", teamplayerlinks)
# [1] "iannech01" "lindad01" "canoro01"
Here we have
.*/
remove everything up to and including the last /
|
or\\..*$
remove everything after the .
, starting from the end of the stringBy the way, these look a bit like player IDs given in the Lahman baseball data sets. If so, you can use the Lahman package in R and not have to scrape the web. It has numerous baseball data sets. It can be installed with install.packages("Lahman")
. I also wrote a package retrosheet for downloading data sets from retrosheet.com. It's also on CRAN. Check it out!
Upvotes: 3