I'll-Be-Back
I'll-Be-Back

Reputation: 10828

Regex get domain name from email

I am learning regex and am having trouble getting google from email address

String

[email protected]

I just want to get google, not google.com

Regex:

[^@].+(?=\.)

Result: https://regex101.com/r/wA5eX5/1

From my understanding. It ignore @ find a string after that until . (dot) using (?=\.)

What did I do wrong?

Upvotes: 31

Views: 57975

Answers (12)

1 Minutes Fact
1 Minutes Fact

Reputation: 1

Extract Domains and Count Unique:

awk -F'@' '{print $2}' your_csv_file | sort | uniq -c > your_results_file

Upvotes: 0

Tristan Bailey
Tristan Bailey

Reputation: 437

If you have a reliable source for email addresses, such as an email application API, the following regular expression will allow the domain to be extracted:

(?<Domain>[\w-]{2,63})\.(?:[\w-]{2,63}?)(?:\.(?:[a-z]{2}))?$

Here, you can use the named capture, 'Domain'. This takes subdomains into account by ignoring them, and only focuses on the end of every line. Note that the regular expression doesn't actually focus on a subdomain at all, if one exists, but it does need to match a top level domain (and potentially a country code) separately.

Don't use this regular expression if you are processing an email address directly from user input. This solution was inspired by m1m1k's answer, and m1m1k's answer should be used instead if the email address needs to be sanitised.

Upvotes: 0

G&#252;ney Saramalı
G&#252;ney Saramalı

Reputation: 799

[^\@][a-zA-Z0-9$&+,;=?#|'<>.^*()%!-]+$ for the ones looking for something compatible with golang to extract domain name from email address with regex.

Upvotes: 0

m1m1k
m1m1k

Reputation: 1435

Thanks everyone for your great responses, I took what you had and expanded it with labelled match-groups for easy extraction of separate parts.

Caveat : Regex.Speed = Slow

Another post mentioned how SLOW and nonperformant regexes are, and that is a fair point to remember. My particular need is targeting my own background/slow/reporting processes and therefore it doesn't matter how long it takes. But it's good to remember whenever possible Regex should NOT be used in any sort of web page load or "needs-to-be-quick" kind of application. In that case you're much better off using substring to algorithmically strip down the inputs and throw away all the junk that I'm optionally matching/allowing/including here.

https://regex101.com/r/ZnU3OC/1

One Regex to rule them all...

  • Subdomain/Domain/TopLevelDomain/CountryCode extraction for Emails, domain lists, & URLs
  • Also handles ?Querystring=junk, Slashes/With/Paths, #anchors
  • Now with more broth, batteries not included
^(?<Email>.*@)?(?<Protocol>\w+:\/\/)?(?<SubDomain>(?:[\w-]{2,63}\.){0,127}?)?(?<DomainWithTLD>(?<Domain>[\w-]{2,63})\.(?<TopLevelDomain>[\w-]{2,63}?)(?:\.(?<CountryCode>[a-z]{2}))?)(?:[:](?<Port>\d+))?(?<Path>(?:[\/]\w*)+)?(?<QString>(?<QSParams>(?:[?&=][\w-]*)+)?(?:[#](?<Anchor>\w*))*)?$

not overly complicated at all... why would you even say that? Jex-Regex-Visualization

Substitution / Outputs

EXAMPLE INPUT: "https://www.stackoverflow.co.uk/path/2?q=mysearch&and=more#stuff"
EXAMPLE OUTPUT:
{
  Protocol:            "https://"
  SubDomain:           "www"
  DomainWithTLD:       "stackoverflow.co.uk"
  Domain:              "stackoverflow"
  TopLevelDomain:      "co"
  CountryCode:         "uk"
  Path:                "/path/2"
  QString:             "?q=mysearch&and=more#stuff"
}

Allowed/Compliant Domains : Should ALL MATCH

www.bankofamerica.com
bankofamerica.com.securersite.regexr.com
bankofamerica.co.uk.blahblahblah.secure.com.it
dashes-bad-for-seo.but-technically-still-allowed.not-in-front-or-end
bit.ly
is.gd
foo.biz.pl
google.com.cn
stackoverflow.co.uk
level_three.sub_domain.example.com
www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com
https://www.stackoverflow.co.uk?q=mysearch&and=more
foo://5th.4th.3rd.example.com:8042/over/there
foo://subdomain.example.com:8042/over/there?name=ferret#nose
example.com
www.example.com
example.co.uk
trailing-slash.com/
trailing-pound.com#
trailing-question.com?
probably-not-valid.com.cn?&#
probably-not-valid.com.cn/?&#
example.com/page
example.com?key=value

* NOTE: PunyCodes (Unicode in urls) handled just fine with \w ,no extra sauce needed
xn--fsqu00a.xn--0zwm56d.com
xn--diseolatinoamericano-66b.com

Emails : Should ALL MATCH

[email protected]
[email protected],
[email protected],
[email protected]
[email protected]
[email protected]
[email protected]

Non-Compliant Domains : Should NOT MATCH

  • either not long-enough (domain min length 2), or too long (64)
v.gd
thing.y
0123456789012345678901234567890123456789012345678901234567891234.com
its-sixty-four-instead-of-sixty-three!.com
[email protected]
symbols-not-allowed#.com
symbols-not-allowed$.com
symbols-not-allowed%.com
symbols-not-allowed^.com
symbols-not-allowed&.com
symbols-not-allowed*.com
symbols-not-allowed(.com
symbols-not-allowed).com
symbols-not-allowed+.com
symbols-not-allowed=.com

TBD Not handled:

* dashes as start or ending is disallowed (dropped from Regex for readability)
-junk-.com 
* is underscore allowed? i donno... (but it simplifies the regex using \w instead of [a-zA-Z0-9\-] everywhere)
symbols-not-allowed_.com

* special case localhost?
.localhost

also see:

Domain Name Rules :: Super handy ASCII Diagram of a URL


  • Side NOTE: lazy load '?' for subdomains{0,127}? currently needed for any of the cases with country codes... (example: stackoverflow.co.uk)

  • Matches these, but does NOT grab $NLevelSubdomains in a match group, can only grab 3rd level only.

Upvotes: 2

punit choudhary
punit choudhary

Reputation: 1

I used this regular expression to get the complete domain name '.*@+(.*)' where .* will ignore all the character before @ (by @+) and start extracting cpmlete domain name by mentioning paranthesis and complete string inside(except linebrake characters)

Upvotes: 0

Roddo
Roddo

Reputation: 61

As I was working to get the domain name of email addresses and none corresponded to what I needed:

  • To not catch subdomains
  • To match countries top domains (like .com.ar or co.jp)

For example, in [email protected] I need to match domain.com.mx

So I made this one:

[^.@]*?\.\w{2,}$|[^.@]*?\.com?\.\w{2}$

Here is a link to regex101 to illustrate the regex: https://regex101.com/r/vE8rP9/59

You can get the sumdomain name (without the top-level domain ex: .com or .com.mx) by adding lookaround operators (but it will match twice in [email protected]):

[^.@]*?(?=\.\w{2,}$)|[^.@]*?(?=\.com?\.\w{2}$)

Upvotes: 5

Renel Chesak
Renel Chesak

Reputation: 697

This is a relatively simple regex, and it grabs everything between the @ and the final domain extension (e.g. .com, .org). It allows domain names that are made up of non-word characters, which exist in real-world data.

>>> regex = re.compile(r"^.+@(.+)\.[\w]+$")

>>> regex.findall('[email protected]')
['my-bank']

>>> regex.findall('[email protected]')
['spam']

>>> regex.findall('[email protected]')
['sandnes.district']

Upvotes: 0

Stephen
Stephen

Reputation: 1148

I used the solution's regex for my task, but realized that some of the emails weren't that easy: [email protected], [email protected], and[email protected]

To anyone who came here wanting the sub domain as well (or is being cut off by it), here's the regex:

(?<=@)[^.]*.[^.]*(?=\.)

Upvotes: 9

Rahul Desai
Rahul Desai

Reputation: 15491

Updated answer:
Use a capturing group and keep it simple :)

@(\w+)

Explanation by splitting it up
( capturing group for extraction )
\w stands for word character [A-Za-z0-9_]
+ is a quantifier for one or more occurances of \w

Regex explanation and demo on Regex101

Upvotes: 21

Hector Buelta
Hector Buelta

Reputation: 141

Maybe not strictly a "full regex answer" but more flexible ( in case the part before the @ is not "first.last") would be using cut:

cut -d @ -f 2 | cut -d . -f 1 

The first cut will isolate the part after @ and the second one will get what you want. This will work also for another kinds of email patterns : [email protected] / xxx.yyy.zzz@ server.com and so on...

Upvotes: 2

Israel Unterman
Israel Unterman

Reputation: 13510

This should be the regex:

(?<=@)[^.]+

(?<=@) - places the search right after the @ [^.]+ - take all the characters that are not dot (stops on dot)

So it extracts google from the email address.

Upvotes: 3

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726489

[^@] means "match one symbol that is not an @ sign. That is not what you are looking for - use lookbehind (?<=@) for @ and your (?=\.) lookahead for \. to extract server name in the middle:

(?<=@)[^.]+(?=\.)

The middle portion [^.]+ means "one or more non-dot characters".

Demo.

Upvotes: 33

Related Questions