kimi
kimi

Reputation: 525

Extract part of string from URL in R

Here's is the sample rows that I have. All I want is to extract website names, such as; 3dubs or adludio How to do that? Cheers,

URL
https://www.3dhubs.com/
https://adludio.com/
https://aircall.io/
https://www.andjaro.com/en/home/

Result

3dhubs
adludio
aircall
andjaro

Once I typed this code,

suffix_extract(domain(df$URL))

I got the result as follows: When I tried to assigned it, it looks differently. How to get domain and assign to a column?

host            subdomain  domain   suffix
www.3dhubs.com  www        3dhubs   com
adludio.com     <NA>       adludio  com

Upvotes: 1

Views: 886

Answers (3)

phiver
phiver

Reputation: 23598

Library urltools should work. If your data is in object urls, this is what it returns.

library(urltools)
df1 <- suffix_extract(domain(urls))
df1
             host subdomain  domain suffix
1  www.3dhubs.com       www  3dhubs    com
2     adludio.com      <NA> adludio    com
3      aircall.io      <NA> aircall     io
4 www.andjaro.com       www andjaro    com

df1$domain
[1] "3dhubs"  "adludio" "aircall" "andjaro"

A dplyr / tidyr option is as follows, but with a url_parse from urltools to make sure it is a valid url.

library(dplyr)
library(tidyr)

df <- data_frame(urls)

df %>% 
  mutate(url_parsed = urltools::url_parse(urls)$domain) %>% 
  separate(url_parsed, into = c("subdomain", "domain", "suffix"), fill = "left")

# A tibble: 4 x 4
  urls                             subdomain domain  suffix
  <chr>                            <chr>     <chr>   <chr> 
1 https://www.3dhubs.com/          www       3dhubs  com   
2 https://adludio.com/             NA        adludio com   
3 https://aircall.io/              NA        aircall io    
4 https://www.andjaro.com/en/home/ www       andjaro com   

data:

urls <- c("https://www.3dhubs.com/", "https://adludio.com/", "https://aircall.io/", 
          "https://www.andjaro.com/en/home/")

Upvotes: 0

MrFlick
MrFlick

Reputation: 206197

It would probably be safest to use a proper URL parser like the one from the urltools package. For example

dd$domain <- urltools::url_parse(dd$URL)$domain

Tested with

dd<-read.table(text="URL
https://www.3dhubs.com/
https://adludio.com/
https://aircall.io/
https://www.andjaro.com/en/home/", header=T, stringsAsFactors=FALSE)

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 545588

How to get domain and assign to a column?

Using {urltools}, the following works for me:

df$domain = suffix_extract(domain(df$URL))$domain

Upvotes: 0

Related Questions