mundos
mundos

Reputation: 459

Is there a regex to find a string between two forward slashes and after a specific string? [R]

I have a dataframe with a column that contains URls like this:

https://www.facebook.com/nameofpage/posts/13142894231

I am trying to extract only the nameofpage portion of this column into a new column. I cannot figure out how to extract the string at that exact position. The string sometimes contains a literal ".", text, and numbers.

I have been trying to use strsplit and separate from tidyr with limited success.

The tidyr code looks like this:

  separate(Link, c(NA, NA, NA, "target"), sep = "/")

However, this really does not work at all.

I would expect to extract the nameofpage into the column, but sometimes the output is actually another piece of the URL.

Upvotes: 2

Views: 3904

Answers (5)

G. Grothendieck
G. Grothendieck

Reputation: 269905

There is some question regarding exactly what we know about the position of the desired field but if we know it is the 4th /-separated field or 3rd from last we can use (1) or (2) respectively. (IF neither of these can be assumed please clarify exactly how we know which field is desired.)

1) read.table Using the character vector ss in the Note below as input we can use read.table if we know that the desired field is between the third and fourth slash.

read.table(text = ss, sep = "/", fill = TRUE, as.is = TRUE)[[4]]
## [1] "nameofpage" "nameofpage"

1a) Using separate:

library(tidyr)

separate(data.frame(ss), ss, c(NA, NA, NA, "target"), sep = "/", extra = "drop")
##       target
## 1 nameofpage
## 2 nameofpage

2) dirname/basement We can use dirname and basename if we know that the desired field is the third past field:

basename(dirname(dirname(ss)))
## [1] "nameofpage" "nameofpage"

Note

s <- "https://www.facebook.com/nameofpage/posts/13142894231"
ss <- c(s, s)

Upvotes: 0

NM_
NM_

Reputation: 2009

You can write a custom function to work on your strings:

get.nameofpage = function(string){
  (unlist(strsplit(string, "\\/")))[4]
}

# Example
my.string = "https://www.facebook.com/nameofpage/posts/13142894231"
> get.nameofpage(my.string)
[1] "nameofpage"

Upvotes: 1

Andrew
Andrew

Reputation: 5138

You could use gsub. This returns at least one + character after .com that is not a forward-slash [^/]:

link <- "https://www.facebook.com/nameofpage/posts/13142894231"

gsub("^.*\\.com/([^/]+).*", "\\1", link)
[1] "nameofpage"

Note: this will only work for a url with ".com" (i.e., it would not work for other domains .edu, .org, etc.)

Upvotes: 0

jspcal
jspcal

Reputation: 51914

In addition there's also str_match, which will return matched groups within a regular expression:

str_match(url, "://(.*?)/(.*?)(\/|$)")[,2]

Upvotes: 1

G5W
G5W

Reputation: 37661

You can use str_split from the stringr package.

URL = "https://www.facebook.com/nameofpage/posts/13142894231"

library(stringr)

str_split(URL, "/")
[[1]]
[1] "https:"           ""                 "www.facebook.com" "nameofpage"      
[5] "posts"            "13142894231"     

str_split(URL, "/")[[1]][4]
[1] "nameofpage"

Upvotes: 2

Related Questions