Reputation: 525
Here's is the sample rows that I have. All I want is to extract website names, such as; 3dubs or adludio How to do that? Cheers,
URL
https://www.3dhubs.com/
https://adludio.com/
https://aircall.io/
https://www.andjaro.com/en/home/
Result
3dhubs
adludio
aircall
andjaro
Once I typed this code,
suffix_extract(domain(df$URL))
I got the result as follows: When I tried to assigned it, it looks differently. How to get domain and assign to a column?
host subdomain domain suffix
www.3dhubs.com www 3dhubs com
adludio.com <NA> adludio com
Upvotes: 1
Views: 886
Reputation: 23598
Library urltools should work. If your data is in object urls, this is what it returns.
library(urltools)
df1 <- suffix_extract(domain(urls))
df1
host subdomain domain suffix
1 www.3dhubs.com www 3dhubs com
2 adludio.com <NA> adludio com
3 aircall.io <NA> aircall io
4 www.andjaro.com www andjaro com
df1$domain
[1] "3dhubs" "adludio" "aircall" "andjaro"
A dplyr / tidyr option is as follows, but with a url_parse
from urltools to make sure it is a valid url.
library(dplyr)
library(tidyr)
df <- data_frame(urls)
df %>%
mutate(url_parsed = urltools::url_parse(urls)$domain) %>%
separate(url_parsed, into = c("subdomain", "domain", "suffix"), fill = "left")
# A tibble: 4 x 4
urls subdomain domain suffix
<chr> <chr> <chr> <chr>
1 https://www.3dhubs.com/ www 3dhubs com
2 https://adludio.com/ NA adludio com
3 https://aircall.io/ NA aircall io
4 https://www.andjaro.com/en/home/ www andjaro com
data:
urls <- c("https://www.3dhubs.com/", "https://adludio.com/", "https://aircall.io/",
"https://www.andjaro.com/en/home/")
Upvotes: 0
Reputation: 206197
It would probably be safest to use a proper URL parser like the one from the urltools
package. For example
dd$domain <- urltools::url_parse(dd$URL)$domain
Tested with
dd<-read.table(text="URL
https://www.3dhubs.com/
https://adludio.com/
https://aircall.io/
https://www.andjaro.com/en/home/", header=T, stringsAsFactors=FALSE)
Upvotes: 1
Reputation: 545588
How to get domain and assign to a column?
Using {urltools}, the following works for me:
df$domain = suffix_extract(domain(df$URL))$domain
Upvotes: 0