yasein
yasein

Reputation: 107

Python - Regular Expression For Domain Names

I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it?

I don't know if this is suitable to ask this "fix code" question, maybe I should read more.

I just want to save some time.

Thanks.

pat_url = re.compile(r'''

            (?:https?://)*

            (?:[\w]+[\-\w]+[.])*

            (?P<domain>[\w\-]*[\w.](com|net)([.](cn|jp|us))*[/]*)

            ''')

print re.findall(pat_url,"http://www.google.com/abcde")

I want the output to be google.com.

Upvotes: 3

Views: 7604

Answers (4)

user3272884
user3272884

Reputation: 871

([a-z0-9][-a-z0-9]*[a-z0-9]|[a-z0-9])\.(COMMUNITY|DIRECTORY|EDUCATION|EQUIPMENT|INSTITUTE|MARKETING|SOLUTIONS|XN--J1AMH|XN--L1ACC|BARGAINS|BOUTIQUE|BUILDERS|CATERING|CLEANING|CLOTHING|COMPUTER|DEMOCRAT|DIAMONDS|GRAPHICS|HOLDINGS|LIGHTING|PARTNERS|PLUMBING|TRAINING|VENTURES|XN--P1AI|ACADEMY|CAREERS|COMPANY|CRUISES|DOMAINS|EXPOSED|FLIGHTS|FLORIST|GALLERY|GUITARS|HOLIDAY|KITCHEN|RECIPES|RENTALS|REVIEWS|SHIKSHA|SINGLES|SUPPORT|SYSTEMS|AGENCY|BERLIN|CAMERA|CENTER|COFFEE|CONDOS|DATING|ESTATE|EVENTS|EXPERT|FUTBOL|KAUFEN|LUXURY|MAISON|MONASH|MUSEUM|NAGOYA|PHOTOS|REPAIR|REPORT|SOCIAL|TATTOO|TIENDA|TRAVEL|VIAJES|VILLAS|VISION|VOTING|VOYAGE|BUILD|CARDS|CHEAP|CODES|DANCE|EMAIL|GLASS|HOUSE|NINJA|PARTS|PHOTO|SHOES|SOLAR|TODAY|TOKYO|TOOLS|WATCH|WORKS|AERO|ARPA|ASIA|BIKE|BLUE|BUZZ|CAMP|CLUB|COOL|COOP|FARM|GIFT|GURU|INFO|JOBS|KIWI|LAND|LIMO|LINK|MENU|MOBI|MODA|NAME|PICS|PINK|POST|QPON|RICH|RUHR|SEXY|TIPS|WANG|WIEN|ZONE|BIZ|CAB|CAT|CEO|COM|EDU|GOV|INT|KIM|MIL|NET|ONL|ORG|PRO|RED|TEL|UNO|WED|XXX|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CW|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MK|ML|MM|MN|MO|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SX|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|ZA|ZM|ZW)(?![-0-9a-z])(?!\.[a-z0-9])

This Regex uses all current valid TLDs found http://data.iana.org/TLD/tlds-alpha-by-domain.txt it will take a list of text and only return the domain.tld

Eg. Feed it

Will return

  • google.com
  • hotmail.com
  • yahoo.net

This isn't ideal, as the regex is quite long but it worked for what I needed at the time, hope it was helpful.

Upvotes: 0

piotr
piotr

Reputation: 5745

This is the only correct way to parse an url with a regex:

It's in C++ but you'll find trivial to convert to python by removing additional \. And with an enum for the captures.

Also see RFC3986 as original source for the regexp.

static const char* const url_regex[] = {
    /* RE_URL */
    "^(([^:/?#]+):)?(//([^/?#]*)|///)?([^?#]*)(\\?[^#]*)?(#.*)?",
};

enum {
    URL = 0,
    SCHEME_CLN = 1,
    SCHEME  = 2,
    DSLASH_AUTH = 3,
    AUTHORITY = 4,
    PATH    = 5,
    QUERY   = 6,
    FRAGMENT = 7
};

Upvotes: 3

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799250

The first is that you're missing the re.VERBOSE flag in the call to re.compile(). The second is that you should use the methods on the returned object. The third is that you're using a regular expression where an appropriate parser already exists in the stdlib.

Upvotes: 3

Amber
Amber

Reputation: 527298

Don't use regex for this. Use the urlparse standard library instead. It's far more straightforward and easier to read/maintain.

http://docs.python.org/library/urlparse.html

Upvotes: 8

Related Questions