Reputation: 459
I have a dataframe with a column that contains URls like this:
https://www.facebook.com/nameofpage/posts/13142894231
I am trying to extract only the nameofpage
portion of this column into a new column. I cannot figure out how to extract the string at that exact position. The string sometimes contains a literal ".", text, and numbers.
I have been trying to use strsplit
and separate
from tidyr with limited success.
The tidyr code looks like this:
separate(Link, c(NA, NA, NA, "target"), sep = "/")
However, this really does not work at all.
I would expect to extract the nameofpage
into the column, but sometimes the output is actually another piece of the URL.
Upvotes: 2
Views: 3904
Reputation: 269905
There is some question regarding exactly what we know about the position of the desired field but if we know it is the 4th /-separated field or 3rd from last we can use (1) or (2) respectively. (IF neither of these can be assumed please clarify exactly how we know which field is desired.)
1) read.table Using the character vector ss
in the Note below as input we can use read.table
if we know that the desired field is between the third and fourth slash.
read.table(text = ss, sep = "/", fill = TRUE, as.is = TRUE)[[4]]
## [1] "nameofpage" "nameofpage"
1a) Using separate
:
library(tidyr)
separate(data.frame(ss), ss, c(NA, NA, NA, "target"), sep = "/", extra = "drop")
## target
## 1 nameofpage
## 2 nameofpage
2) dirname/basement We can use dirname
and basename
if we know that the desired field is the third past field:
basename(dirname(dirname(ss)))
## [1] "nameofpage" "nameofpage"
s <- "https://www.facebook.com/nameofpage/posts/13142894231"
ss <- c(s, s)
Upvotes: 0
Reputation: 2009
You can write a custom function to work on your strings:
get.nameofpage = function(string){
(unlist(strsplit(string, "\\/")))[4]
}
# Example
my.string = "https://www.facebook.com/nameofpage/posts/13142894231"
> get.nameofpage(my.string)
[1] "nameofpage"
Upvotes: 1
Reputation: 5138
You could use gsub
. This returns at least one +
character after .com that is not a forward-slash [^/]
:
link <- "https://www.facebook.com/nameofpage/posts/13142894231"
gsub("^.*\\.com/([^/]+).*", "\\1", link)
[1] "nameofpage"
Note: this will only work for a url with ".com" (i.e., it would not work for other domains .edu, .org, etc.)
Upvotes: 0
Reputation: 51914
In addition there's also str_match
, which will return matched groups within a regular expression:
str_match(url, "://(.*?)/(.*?)(\/|$)")[,2]
Upvotes: 1
Reputation: 37661
You can use str_split
from the stringr
package.
URL = "https://www.facebook.com/nameofpage/posts/13142894231"
library(stringr)
str_split(URL, "/")
[[1]]
[1] "https:" "" "www.facebook.com" "nameofpage"
[5] "posts" "13142894231"
str_split(URL, "/")[[1]][4]
[1] "nameofpage"
Upvotes: 2