Reputation: 107
I am trying use the following regular expression to extract domain name from a text, but it just produce nothing, what's wrong with it?
I don't know if this is suitable to ask this "fix code" question, maybe I should read more.
I just want to save some time.
Thanks.
pat_url = re.compile(r'''
(?:https?://)*
(?:[\w]+[\-\w]+[.])*
(?P<domain>[\w\-]*[\w.](com|net)([.](cn|jp|us))*[/]*)
''')
print re.findall(pat_url,"http://www.google.com/abcde")
I want the output to be google.com.
Upvotes: 3
Views: 7604
Reputation: 871
([a-z0-9][-a-z0-9]*[a-z0-9]|[a-z0-9])\.(COMMUNITY|DIRECTORY|EDUCATION|EQUIPMENT|INSTITUTE|MARKETING|SOLUTIONS|XN--J1AMH|XN--L1ACC|BARGAINS|BOUTIQUE|BUILDERS|CATERING|CLEANING|CLOTHING|COMPUTER|DEMOCRAT|DIAMONDS|GRAPHICS|HOLDINGS|LIGHTING|PARTNERS|PLUMBING|TRAINING|VENTURES|XN--P1AI|ACADEMY|CAREERS|COMPANY|CRUISES|DOMAINS|EXPOSED|FLIGHTS|FLORIST|GALLERY|GUITARS|HOLIDAY|KITCHEN|RECIPES|RENTALS|REVIEWS|SHIKSHA|SINGLES|SUPPORT|SYSTEMS|AGENCY|BERLIN|CAMERA|CENTER|COFFEE|CONDOS|DATING|ESTATE|EVENTS|EXPERT|FUTBOL|KAUFEN|LUXURY|MAISON|MONASH|MUSEUM|NAGOYA|PHOTOS|REPAIR|REPORT|SOCIAL|TATTOO|TIENDA|TRAVEL|VIAJES|VILLAS|VISION|VOTING|VOYAGE|BUILD|CARDS|CHEAP|CODES|DANCE|EMAIL|GLASS|HOUSE|NINJA|PARTS|PHOTO|SHOES|SOLAR|TODAY|TOKYO|TOOLS|WATCH|WORKS|AERO|ARPA|ASIA|BIKE|BLUE|BUZZ|CAMP|CLUB|COOL|COOP|FARM|GIFT|GURU|INFO|JOBS|KIWI|LAND|LIMO|LINK|MENU|MOBI|MODA|NAME|PICS|PINK|POST|QPON|RICH|RUHR|SEXY|TIPS|WANG|WIEN|ZONE|BIZ|CAB|CAT|CEO|COM|EDU|GOV|INT|KIM|MIL|NET|ONL|ORG|PRO|RED|TEL|UNO|WED|XXX|AC|AD|AE|AF|AG|AI|AL|AM|AN|AO|AQ|AR|AS|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|CR|CU|CV|CW|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|IO|IQ|IR|IS|IT|JE|JM|JO|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MK|ML|MM|MN|MO|MP|MQ|MR|MS|MT|MU|MV|MW|MX|MY|MZ|NA|NC|NE|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SX|SY|SZ|TC|TD|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|YE|YT|ZA|ZM|ZW)(?![-0-9a-z])(?!\.[a-z0-9])
This Regex uses all current valid TLDs found http://data.iana.org/TLD/tlds-alpha-by-domain.txt it will take a list of text and only return the domain.tld
Eg. Feed it
Will return
This isn't ideal, as the regex is quite long but it worked for what I needed at the time, hope it was helpful.
Upvotes: 0
Reputation: 5745
This is the only correct way to parse an url with a regex:
It's in C++ but you'll find trivial to convert to python by removing additional \. And with an enum for the captures.
Also see RFC3986 as original source for the regexp.
static const char* const url_regex[] = {
/* RE_URL */
"^(([^:/?#]+):)?(//([^/?#]*)|///)?([^?#]*)(\\?[^#]*)?(#.*)?",
};
enum {
URL = 0,
SCHEME_CLN = 1,
SCHEME = 2,
DSLASH_AUTH = 3,
AUTHORITY = 4,
PATH = 5,
QUERY = 6,
FRAGMENT = 7
};
Upvotes: 3
Reputation: 799250
The first is that you're missing the re.VERBOSE
flag in the call to re.compile()
. The second is that you should use the methods on the returned object. The third is that you're using a regular expression where an appropriate parser already exists in the stdlib.
Upvotes: 3
Reputation: 527298
Don't use regex for this. Use the urlparse
standard library instead. It's far more straightforward and easier to read/maintain.
http://docs.python.org/library/urlparse.html
Upvotes: 8