Neil
Neil

Reputation: 8247

How to extract the "domain" from an email address

I have following pattern in my column

[email protected]
[email protected]

Now, I want to extract text after @ and before . i.e gmail and hotmail .I am able to extract text after . with following code.

sub(".*@", "", email)

How can I modify above to fit in my use case?

Upvotes: 5

Views: 5851

Answers (4)

xaviescacs
xaviescacs

Reputation: 349

This is @hrbrmstr's function with stringr:

stringr::str_locate_all(email,"@") %>% purrr::map_int(~ .[2]) %>%
purrr::map2_df(email, ~ {
  stringr::str_sub(.y, .x+1, nchar(.y)) %>%
    urltools::suffix_extract()
})

Upvotes: 1

hrbrmstr
hrbrmstr

Reputation: 78832

You:

  1. really need to read Section 3 of RFC 3696 (TLDR: the @ can appear in multiple places)
  2. seem to not have considered that an email can be "[email protected]", "[email protected]" (i.e. naively assuming only a domain could come back to bite you at some point in this analysis)
  3. should be aware that if you're really looking for the email "domain name" then you also have to consider what really constitutes a domain name and a proper suffix.

So — unless you know for sure that you have and always will have simple email addresses — might I suggest:

library(stringi)
library(urltools)
library(dplyr)
library(purrr)

emails <- c("[email protected]", "[email protected]",
            "[email protected]",
            "[email protected]",
            "[email protected]")

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_df(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract()
  })
##                         host    subdomain      domain suffix
## 1                  gmail.com         <NA>       gmail    com
## 2                hotmail.com         <NA>     hotmail    com
## 3      deparment.example.com   department     example    com
## 4 yet.another.department.com  yet.another  department    com
## 5             froodyco.co.uk         <NA>   froodyorg  co.uk

Note the proper splitting of subdomain, domain & suffix, especially for the last one.

Knowing this, we can then change the code to:

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_chr(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract() %>%
      mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
      select(full_domain) %>%
      flatten_chr()
  })
## [1] "gmail"                   "hotmail"               
## [3] "department.example"      "yet.another.department"
## [5] "froodyorg"

Upvotes: 8

Jan
Jan

Reputation: 43189

You can use:

emails <- c("[email protected]", "[email protected]")
emails_new <- gsub("@(.+)$", "\\1", emails)
emails_new
# [1] "gmail.com"   "hotmail.com"

See a demo on ideone.com.

Upvotes: 3

akrun
akrun

Reputation: 887741

We can use gsub

gsub(".*@|\\..*", "", email)
#[1] "gmail"   "hotmail"

Upvotes: 5

Related Questions