user3422637
user3422637

Reputation: 4239

Extract substring in R from string with fixed start position and end point as a character found

I want to do the following extraction in R.

I have a column which has links like these http://www.imdb.com/title/tt2569314/companycredits

I want to extract the tt2569314 out of this and store it in a new column.

The way I want to do it is, say, take substring of column where start position is LEN(http://www.imdb.com/) and end position is dynamic based on when the first '/' is found after the start position.

I want this to be kind of a mixture of SUBSTR and INSTR in SQL.

Please advise.

Upvotes: 2

Views: 2898

Answers (4)

Rich Scriven
Rich Scriven

Reputation: 99361

If all the links are similar in path structure, you can use the dirname

x <- "http://www.imdb.com/title/tt2569314/companycredits"
sub("(.*)[/]", "", dirname(x))
# [1] "tt2569314"

Or you can paste together a regular expression with the base URL

y <- "http://www.imdb.com"
sub(paste0(y, "[/](.*)[/](.*)[/](.*)"), "\\2", x)
# [1] "tt2569314"

Or you may even be able to get away with this:

basename(dirname(x))
# [1] "tt2569314"

It's a bit more drawn out if you use the substring. But stringr has a couple of helpful functions.

library(stringr)
s1 <- str_locate_all(x, "[/]")[[1]]
s2 <- str_locate(x, "http://www.imdb.com/title")
m <- match(s2[,2]+1, s1[,1])
substr(x, s1[m,1]+1, s1[m+1,1]-1)
# [1] "tt2569314"

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174796

You may try this also,

> x <- "http://www.imdb.com/title/tt2569314/companycredits"
> m <- regexpr("^http://www.imdb.com/[^/]*/\\K[^/]+", x, perl=TRUE)
> regmatches(x, m)
[1] "tt2569314"

Upvotes: 0

akrun
akrun

Reputation: 887571

You could try:

 str1 <- "http://www.imdb.com/title/tt2569314/companycredits"
 library(httr)
 gsub("^[^/]*\\/|\\/[^/]*", "", parse_url(str1)$path)
 #[1] "tt2569314"

Upvotes: 1

Jota
Jota

Reputation: 17611

You could try this:

a<-"http://www.imdb.com/title/tt2569314/companycredits"
sub("http://www.imdb.com/.+/(.+)/.+","\\1" ,a)
#[1] "tt2569314"

Upvotes: 2

Related Questions