kumar
kumar

Reputation: 2796

More than one top level domain?

In a normal URL, you have a protocol, subdomains (optional), domain name, top level domain and subdirectories.

For example: http://www.google.com/path. Here www is subdomain, google is domain name and com is TLD; path is subdirectory here. Parsing this is simple programming task.

But the problem comes when there are more than one TLD's. For example: www.google.co.in/path. Here co.in is TLD. But I see that there is a website with name www.co.in also present.

My doubts are:

Upvotes: 9

Views: 8769

Answers (2)

creed
creed

Reputation: 182

Very slow yet comprehensive regex you could use: (sourced from Wikipedia and Mozilla)

[a-z0-9-]{1,63}(.ab.ca|.bc.ca|.mb.ca|.nb.ca|.nf.ca|.nl.ca|.ns.ca|.nt.ca|.nu.ca|.on.ca|.pe.ca|.qc.ca|.sk.ca|.yk.ca|.co.cc|.com.cd|.net.cd|.org.cd|.co.ck|.ac.cn|.com.cn|.edu.cn|.gov.cn|.net.cn|.org.cn|.ah.cn|.bj.cn|.cq.cn|.fj.cn|.gd.cn|.gs.cn|.gz.cn|.gx.cn|.ha.cn|.hb.cn|.he.cn|.hi.cn|.hl.cn|.hn.cn|.jl.cn|.js.cn|.jx.cn|.ln.cn|.nm.cn|.nx.cn|.qh.cn|.sc.cn|.sd.cn|.sh.cn|.sn.cn|.sx.cn|.tj.cn|.xj.cn|.xz.cn|.yn.cn|.zj.cn|.us.com|.com.cu|.edu.cu|.org.cu|.net.cu|.gov.cu|.inf.cu|.gov.cx|.com.dz|.org.dz|.net.dz|.gov.dz|.edu.dz|.asso.dz|.pol.dz|.art.dz|.com.ec|.info.ec|.net.ec|.fin.ec|.med.ec|.pro.ec|.org.ec|.edu.ec|.gov.ec|.mil.ec|.com.ee|.org.ee|.fie.ee|.pri.ee|.com.es|.nom.es|.org.es|.gob.es|.edu.es|.aland.fi|.tm.fr|.asso.fr|.nom.fr|.prd.fr|.presse.fr|.com.fr|.gouv.fr|.com.ge|.edu.ge|.gov.ge|.org.ge|.mil.ge|.net.ge|.pvt.ge|.co.gg|.net.gg|.org.gg|.com.gi|.ltd.gi|.gov.gi|.mod.gi|.edu.gi|.org.gi|.com.gp|.net.gp|.edu.gp|.asso.gp|.org.gp|.com.gr|.edu.gr|.net.gr|.org.gr|.gov.gr|.com.hk|.edu.hk|.gov.hk|.idv.hk|.net.hk|.org.hk|.com.hn|.edu.hn|.org.hn|.net.hn|.mil.hn|.gob.hn|.iz.hr|.from.hr|.name.hr|.com.hr|.com.ht|.net.ht|.firm.ht|.shop.ht|.info.ht|.pro.ht|.adult.ht|.org.ht|.art.ht|.pol.ht|.rel.ht|.asso.ht|.perso.ht|.coop.ht|.med.ht|.edu.ht|.gouv.ht|.gov.ie|.co.in|.firm.in|.net.in|.org.in|.gen.in|.ind.in|.nic.in|.ac.in|.edu.in|.res.in|.gov.in|.mil.in|.ac.ir|.co.ir|.gov.ir|.net.ir|.org.ir|.sch.ir|.co.je|.net.je|.org.je|.com.jo|.org.jo|.net.jo|.edu.jo|.gov.jo|.mil.jo|.co.kr|.or.kr|.edu.ky|.gov.ky|.com.ky|.org.ky|.net.ky|.gov.lk|.sch.lk|.net.lk|.int.lk|.com.lk|.org.lk|.edu.lk|.ngo.lk|.soc.lk|.web.lk|.ltd.lk|.assn.lk|.grp.lk|.hotel.lk|.gov.lt|.mil.lt|.gov.lu|.mil.lu|.org.lu|.net.lu|.com.lv|.edu.lv|.gov.lv|.org.lv|.mil.lv|.id.lv|.net.lv|.asn.lv|.conf.lv|.com.ly|.net.ly|.gov.ly|.plc.ly|.edu.ly|.sch.ly|.med.ly|.org.ly|.id.ly|.co.ma|.net.ma|.gov.ma|.org.ma|.tm.mc|.asso.mc|.org.mg|.nom.mg|.gov.mg|.prd.mg|.tm.mg|.com.mg|.edu.mg|.mil.mg|.com.mk|.org.mk|.com.mo|.net.mo|.org.mo|.edu.mo|.gov.mo|.org.mt|.com.mt|.gov.mt|.edu.mt|.net.mt|.com.mu|.co.mu|.gov.nr|.edu.nr|.biz.nr|.info.nr|.com.nr|.net.nr|.com.pf|.org.pf|.edu.pf|.com.ph|.gov.ph|.com.pk|.net.pk|.edu.pk|.org.pk|.fam.pk|.biz.pk|.web.pk|.gov.pk|.gob.pk|.gok.pk|.gon.pk|.gop.pk|.gos.pk|.com.pl|.biz.pl|.net.pl|.art.pl|.edu.pl|.org.pl|.ngo.pl|.gov.pl|.info.pl|.mil.pl|.waw.pl|.warszawa.pl|.wroc.pl|.wroclaw.pl|.krakow.pl|.poznan.pl|.lodz.pl|.gda.pl|.gdansk.pl|.slupsk.pl|.szczecin.pl|.lublin.pl|.bialystok.pl|.olsztyn.pl.torun.pl|.biz.pr|.com.pr|.edu.pr|.gov.pr|.info.pr|.isla.pr|.name.pr|.net.pr|.org.pr|.pro.pr|.edu.ps|.gov.ps|.sec.ps|.plo.ps|.com.ps|.org.ps|.net.ps|.com.pt|.edu.pt|.gov.pt|.int.pt|.net.pt|.nome.pt|.org.pt|.publ.pt|.com.ro|.org.ro|.tm.ro|.nt.ro|.nom.ro|.info.ro|.rec.ro|.arts.ro|.firm.ro|.store.ro|.www.ro|.com.ru|.net.ru|.org.ru|.pp.ru|.msk.ru|.int.ru|.ac.ru|.gov.rw|.net.rw|.edu.rw|.ac.rw|.com.rw|.co.rw|.int.rw|.mil.rw|.gouv.rw|.com.sc|.gov.sc|.net.sc|.org.sc|.edu.sc|.com.sd|.net.sd|.org.sd|.edu.sd|.med.sd|.tv.sd|.gov.sd|.info.sd|.org.se|.pp.se|.tm.se|.brand.se|.parti.se|.press.se|.komforb.se|.kommunalforbund.se|.komvux.se|.lanarb.se|.lanbib.se|.naturbruksgymn.se|.sshn.se|.fhv.se|.fhsk.se|.fh.se|.mil.se|.ab.se|.c.se|.d.se|.e.se|.f.se|.g.se|.h.se|.i.se|.k.se|.m.se|.n.se|.o.se|.s.se|.t.se|.u.se|.w.se|.x.se|.y.se|.z.se|.ac.se|.bd.se|.com.sg|.net.sg|.org.sg|.gov.sg|.edu.sg|.per.sg|.idn.sg|.ac.tj|.biz.tj|.com.tj|.co.tj|.edu.tj|.int.tj|.name.tj|.net.tj|.org.tj|.web.tj|.gov.tj|.go.tj|.mil.tj|.gov.to|.gov.tp|.co.tt|.com.tt|.org.tt|.net.tt|.biz.tt|.info.tt|.pro.tt|.name.tt|.edu.tt|.gov.tt|.gov.tv|.edu.tw|.gov.tw|.mil.tw|.com.tw|.net.tw|.org.tw|.idv.tw|.game.tw|.ebiz.tw|.club.tw|.com.ua|.gov.ua|.net.ua|.edu.ua|.org.ua|.cherkassy.ua|.ck.ua|.chernigov.ua|.cn.ua|.chernovtsy.ua|.cv.ua|.crimea.ua|.dnepropetrovsk.ua|.dp.ua|.donetsk.ua|.dn.ua|.ivano-frankivsk.ua|.if.ua|.kharkov.ua|.kh.ua|.kherson.ua|.ks.ua|.khmelnitskiy.ua|.km.ua|.kiev.ua|.kv.ua|.kirovograd.ua|.kr.ua|.lugansk.ua|.lg.ua|.lutsk.ua|.lviv.ua|.nikolaev.ua|.mk.ua|.odessa.ua|.od.ua|.poltava.ua|.pl.ua|.rovno.ua|.rv.ua|.sebastopol.ua|.sumy.ua|.ternopil.ua|.te.ua|.uzhgorod.ua|.vinnica.ua|.vn.ua|.zaporizhzhe.ua|.zp.ua|.zhitomir.ua|.zt.ua|.co.ug|.ac.ug|.sc.ug|.go.ug|.ne.ug|.or.ug|.ak.us|.al.us|.ar.us|.az.us|.ca.us|.co.us|.ct.us|.dc.us|.de.us|.dni.us|.fed.us|.fl.us|.ga.us|.hi.us|.ia.us|.id.us|.il.us|.in.us|.isa.us|.kids.us|.ks.us|.ky.us|.la.us|.ma.us|.md.us|.me.us|.mi.us|.mn.us|.mo.us|.ms.us|.mt.us|.nc.us|.nd.us|.ne.us|.nh.us|.nj.us|.nm.us|.nsn.us|.nv.us|.ny.us|.oh.us|.ok.us|.or.us|.pa.us|.ri.us|.sc.us|.sd.us|.tn.us|.tx.us|.ut.us|.vt.us|.va.us|.wa.us|.wi.us|.wv.us|.wy.us|.com.vi|.org.vi|.edu.vi|.gov.vi|.com.vn|.net.vn|.org.vn|.edu.vn|.gov.vn|.int.vn|.ac.vn|.biz.vn|.info.vn|.name.vn|.pro.vn|.health.vn|.com|.org|.net|.int|.edu|.gov|.mil|.arpa|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cu|.cv|.cw|.cx|.cy|.cz|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sx|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.za|.zm|.zw|.dz|.am|.bh|.bd|.by|.bg|.cn|.cn|.eg|.eu|.ge|.gr|.hk|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.in|.ir|.iq|.jo|.kz|.mo|.mo|.my|.mr|.mn|.ma|.mk|.om|.pk|.ps|.qa|.ru|.sa|.rs|.sg|.sg|.kr|.lk|.lk|.sd|.sy|.tw|.tw|.th|.tn|.ua|.ae|.ye|.academy|.accountant|.adult|.aero|.africa|.agency|.apartments|.app|.archi|.associates|.audio|.auto|.bar|.bargains|.bible|.bike|.biz|.black|.blackfriday|.blog|.blue|.builders|.cam|.cam|.camera|.camp|.cancerresearch|.car|.cards|.cars|.center|.cheap|.christmas|.church|.click|.clothing|.cloud|.club|.codes|.coffee|.college|.coop|.country|.dance|.date|.dating|.design|.dev|.diet|.directory|.download|.eco|.education|.email|.events|.exchange|.exposed|.faith|.farm|.flowers|.game|.gdn|.gift|.glass|.global|.gop|.green|.guitars|.guru|.help|.hiphop|.hiv|.holdings|.hosting|.house|.info|.ink|.international|.jobs|.kim|.land|.lgbt|.life|.lighting|.link|.live|.loan|.lol|.love|.map|.market|.med|.meet|.menu|.mobi|.moe|.mom|.movie|.museum|.music|.name|.new|.NGO_and_.ONG|.org_(top-level_domain)|.one|.one|.onl|.ooo|.organic|.pharmacy|.photo|.photos|.pics|.pink|.pizza|.plumbing|.porn|.post|.pro|.properties|.property|.realtor|.rich|.rocks|.sale|.science|.sex|.sexy|.shop|.singles|.social|.solar|.stream|.sucks|.support|.tattoo|.tel|.today|.top|.travel|.ventures|.video|.voting|.wedding|.wiki|.win|.work|.wtf|.xxx|.XYZ|.kaufen|.desi|.shiksha|.moda|.futbol|.juegos|.uno|.africa|.asia|.krd|.taipei|.tokyo|.alsace|.amsterdam|.bcn|.barcelona|.berlin|.brussels|.bzh|.cat|.cymru|.eus|.frl|.gal|.gent|.irish|.istanbul|.istanbul|.london|.paris|.saarland|.scot|.swiss|.wales|.wien|.miami|.nyc|.quebec|.vegas|.kiwi|.melbourne|.sydney|.lat|.rio|.ru|.aaa|.abb|.aeg|.afl|.aig|.airtel|.bbc|.bentley|.example|.invalid|.local|.localhost|.onion|.testa)$

Upvotes: 0

peter_the_oak
peter_the_oak

Reputation: 3710

If I would have to write an algorithm that decides that "www.co.in" belongs to India Top Level Domain (TLD) and "www.google.co.in" belongs to India Second Level Domain (SLD), I would go here and grab the list:

https://wiki.mozilla.org/TLD_List

Then, I would process my URL like this:

  1. Compare the the last part of the URL to all TLDs in the list and find a matching one. [www.google.co.in -> in, www.co.in -> in]
  2. If no TLD was found, the URL is invalid.
  3. If a TLD was found and the URL has three parts or less, return the TLD as result and exit.
  4. If a TLD was found and the URL has more than three parts, do a second search in the list of SLDs. Compare the end of the URL against the pattern ".SLD.TLD".
  5. If no entry was found, return the TLD as result and exit.
  6. If an entry was found, return SLD.TLD as result and exit.

Upvotes: 2

Related Questions