Reputation: 4239
I want to do the following extraction in R.
I have a column which has links like these http://www.imdb.com/title/tt2569314/companycredits
I want to extract the tt2569314 out of this and store it in a new column.
The way I want to do it is, say, take substring of column where start position is LEN(http://www.imdb.com/) and end position is dynamic based on when the first '/' is found after the start position.
I want this to be kind of a mixture of SUBSTR and INSTR in SQL.
Please advise.
Upvotes: 2
Views: 2898
Reputation: 99361
If all the links are similar in path structure, you can use the dirname
x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"
Or you can paste together a regular expression with the base URL
y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"
Or you may even be able to get away with this:
basename(dirname(x))
# [1] "tt2569314"
It's a bit more drawn out if you use the substring. But stringr
has a couple of helpful functions.
library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"
Upvotes: 2
Reputation: 174796
You may try this also,
> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"
Upvotes: 0
Reputation: 887571
You could try:
str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
library(httr)
gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
#[1] "tt2569314"
Upvotes: 1
Reputation: 17611
You could try this:
a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"
Upvotes: 2