Assaf Lavie
Assaf Lavie

Reputation: 76063

JavaScript function to match only Google URLs

Need a function like:

function isGoogleURL(url) { ... }

that returns true iff URL belongs to Google. No false positives; no false negatives.

Luckily there's this as a reference:

.google.com .google.ad .google.ae .google.com.af .google.com.ag .google.com.ai .google.am .google.it.ao .google.com.ar .google.as .google.at .google.com.au .google.az .google.ba .google.com.bd .google.be .google.bg .google.com.bh .google.bi .google.com.bn .google.com.bo .google.com.br .google.bs .google.co.bw .google.com.by .google.com.bz .google.ca .google.cd .google.cg .google.ch .google.ci .google.co.ck .google.cl .google.cn .google.com.co .google.co.cr .google.com.cu .google.cz .google.de .google.dj .google.dk .google.dm .google.com.do .google.dz .google.com.ec .google.ee .google.com.eg .google.es .google.com.et .google.fi .google.com.fj .google.fm .google.fr .google.ge .google.gg .google.com.gh .google.com.gi .google.gl .google.gm .google.gp .google.gr .google.com.gt .google.gy .google.com.hk .google.hn .google.hr .google.ht .google.hu .google.co.id .google.ie .google.co.il .google.im .google.co.in .google.is .google.it .google.je .google.com.jm .google.jo .google.co.jp .google.co.ke .google.com.kh .google.ki .google.kg .google.co.kr .google.kz .google.la .google.li .google.lk .google.co.ls .google.lt .google.lu .google.lv .google.com.ly .google.co.ma .google.md .google.mn .google.ms .google.com.mt .google.mu .google.mv .google.mw .google.com.mx .google.com.my .google.co.mz .google.com.na .google.com.nf .google.com.ng .google.com.ni .google.nl .google.no .google.com.np .google.nr .google.nu .google.co.nz .google.com.om .google.com.pa .google.com.pe .google.com.ph .google.com.pk .google.pl .google.pn .google.com.pr .google.pt .google.com.py .google.com.qa .google.ro .google.ru .google.rw .google.com.sa .google.com.sb .google.sc .google.se .google.com.sg .google.sh .google.si .google.sk .google.sn .google.sm .google.st .google.com.sv .google.co.th .google.com.tj .google.tk .google.tl .google.tm .google.to .google.com.tr .google.tt .google.com.tw .google.co.tz .google.com.ua .google.co.ug .google.co.uk .google.com.uy .google.co.uz .google.com.vc .google.co.ve .google.vg .google.co.vi .google.com.vn .google.vu .google.ws .google.rs .google.co.za .google.co.zm .google.co.zw .google.cat

Any ideas how to do this elegantly?

Some Clarifications:

Upvotes: 2

Views: 1695

Answers (9)

friol
friol

Reputation: 7096

I wouldn't do this client-side.

The list of Google domains doesn't change so frequently, so you could store a list server-side and then dynamically generate the .js to check it.

Upvotes: 0

Matthew Crumley
Matthew Crumley

Reputation: 102735

All the domains end in either "google.xx", "google.co.xx", or "google.com.xx" except "google.it.ao" and "google.com", so if you just look at the domain, this regular expression should work for most cases (it's not perfect, but it accepts all the listed domains, and rejects most other valid domains that happen to include "google"):

/^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i

As a function you could do something like this:

function isGoogleUrl(url) {
    url = url.replace(/^https?:\/\//i, ''); // Strip "http://" from the beginning
    url = url.replace(/\/.*/, ''); // Strip off the path
    return /^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i.test(url);
}

You could simplify it if you use window.location.hostname:

function isGoogleUrl() {
    return /^(\w+\.)*google\.((com\.|co\.|it\.)?([a-z]{2})|com)$/i.test(window.location.hostname);
}

The only way this should allow a false positive is if there is a "google.(some other TLD)". For example, "google.tv", is not on the list (it redirects to google.com), but it would pass.

Edit: Like Wimmel pointed out, it also accepts invalid domains like "google.com.fr" which are not listed. It will basically accept any "google.whatever" domain name.

Upvotes: 2

wimh
wimh

Reputation: 15232

Here is an updated version of Prestaul's answer which solves the two problems I mentioned in the comment there.

var GOOGLE_DOMAINS = ([
    '.google.com',
    '.google.ad',
    '.google.ae',
    '.google.com.af',
    '.google.com.ag',
    '.google.com.ai',
    '.google.am',
    '.google.it.ao',
    '.google.com.ar',
    '.google.as',
    '.google.at',
    '.google.com.au',
    '.google.az',
    '.google.ba',
    '.google.com.bd'
]).join('\n');

function isGoogleUrl(url) {
    // get the 2nd level domain from the url
    var domain = /^https?:\/\/[^\///]*(google\.[^\/\\]+)\//i.exec(url);
    if(!domain) return false;

    domain = '.'+domain[1];
    // create a regex to check to see if the domain is supported
    var re = new RegExp('^' + domain.replace(/\./g, '\\.') + '$', 'mi');
    return re.test(GOOGLE_DOMAINS);
}

alert(isGoogleUrl('http://www.google.ba/the/page.html')); // true
alert(isGoogleUrl('http://some_mal_site.com/http://www.google.ba/')); // false
alert(isGoogleUrl('https://google.com.au/')); // true
alert(isGoogleUrl('http://www.google.com.some_mal_site.com/')); // false
alert(isGoogleUrl('http://yahoo.com/')); // false

Upvotes: 6

Prestaul
Prestaul

Reputation: 85174

I agree that you probably shouldn't do this... However, if you are going to do it (and you aren't content with the previously offered solutions that just check for a google-like pattern) then this is how I would approach it:

var GOOGLE_DOMAINS = ([
    '.google.com',
    '.google.ad',
    '.google.ae',
    '.google.com.af',
    '.google.com.ag',
    '.google.com.ai',
    '.google.am',
    '.google.it.ao',
    '.google.com.ar',
    '.google.as',
    '.google.at',
    '.google.com.au',
    '.google.az',
    '.google.ba',
    '.google.com.bd'
]).join('\n');

function isGoogleUrl(url) {
    var url = 'http://www.google.ba/the/page.html';

    // get the domain from the url
    var domain = /\.google\.[^\/\\]+/i.exec(url) + '';
    if(!domain) return false;

    // create a regex to check to see if the domain is supported
    var re = new RegExp('^' + domain.replace(/\./g, '\\.') + '$', 'mi');
    return re.test(GOOGLE_DOMAINS);
}

This creates a regex based on the domain your url and uses it to test the list of domains.

Note: The GOOGLE_DOMAINS variable is just a string that holds the contents returned from the url you posted. There is no way for you to retrieve that string via AJAX or iframe because you cannot make such a request across domains. You'll have to hard code it or make a request server-side to retrieve that list.

Upvotes: 1

theraccoonbear
theraccoonbear

Reputation: 4337

You could use a regular expression like....

^https?://[-A-Za-z0-9\.]+(\.google\.com|\.google\.ad|\.google\.ae|\.google\.com\.af|\.google\.com\.ag|\.google\.com\.ai|\.google\.am|\.google\.it\.ao|\.google\.com\.ar|\.google\.as|\.google\.at|\.google\.com\.au|\.google\.az|\.google\.ba|\.google\.com\.bd|\.google\.be|\.google\.bg|\.google\.com\.bh|\.google\.bi|\.google\.com\.bn|\.google\.com\.bo|\.google\.com\.br|\.google\.bs|\.google\.co\.bw|\.google\.com\.by|\.google\.com\.bz|\.google\.ca|\.google\.cd|\.google\.cg|\.google\.ch|\.google\.ci|\.google\.co\.ck|\.google\.cl|\.google\.cn|\.google\.com\.co|\.google\.co\.cr|\.google\.com\.cu|\.google\.cz|\.google\.de|\.google\.dj|\.google\.dk|\.google\.dm|\.google\.com\.do|\.google\.dz|\.google\.com\.ec|\.google\.ee|\.google\.com\.eg|\.google\.es|\.google\.com\.et|\.google\.fi|\.google\.com\.fj|\.google\.fm|\.google\.fr|\.google\.ge|\.google\.gg|\.google\.com\.gh|\.google\.com\.gi|\.google\.gl|\.google\.gm|\.google\.gp|\.google\.gr|\.google\.com\.gt|\.google\.gy|\.google\.com\.hk|\.google\.hn|\.google\.hr|\.google\.ht|\.google\.hu|\.google\.co\.id|\.google\.ie|\.google\.co\.il|\.google\.im|\.google\.co\.in|\.google\.is|\.google\.it|\.google\.je|\.google\.com\.jm|\.google\.jo|\.google\.co\.jp|\.google\.co\.ke|\.google\.com\.kh|\.google\.ki|\.google\.kg|\.google\.co\.kr|\.google\.kz|\.google\.la|\.google\.li|\.google\.lk|\.google\.co\.ls|\.google\.lt|\.google\.lu|\.google\.lv|\.google\.com\.ly|\.google\.co\.ma|\.google\.md|\.google\.mn|\.google\.ms|\.google\.com\.mt|\.google\.mu|\.google\.mv|\.google\.mw|\.google\.com\.mx|\.google\.com\.my|\.google\.co\.mz|\.google\.com\.na|\.google\.com\.nf|\.google\.com\.ng|\.google\.com\.ni|\.google\.nl|\.google\.no|\.google\.com\.np|\.google\.nr|\.google\.nu|\.google\.co\.nz|\.google\.com\.om|\.google\.com\.pa|\.google\.com\.pe|\.google\.com\.ph|\.google\.com\.pk|\.google\.pl|\.google\.pn|\.google\.com\.pr|\.google\.pt|\.google\.com\.py|\.google\.com\.qa|\.google\.ro|\.google\.ru|\.google\.rw|\.google\.com\.sa|\.google\.com\.sb|\.google\.sc|\.google\.se|\.google\.com\.sg|\.google\.sh|\.google\.si|\.google\.sk|\.google\.sn|\.google\.sm|\.google\.st|\.google\.com\.sv|\.google\.co\.th|\.google\.com\.tj|\.google\.tk|\.google\.tl|\.google\.tm|\.google\.to|\.google\.com\.tr|\.google\.tt|\.google\.com\.tw|\.google\.co\.tz|\.google\.com\.ua|\.google\.co\.ug|\.google\.co\.uk|\.google\.com\.uy|\.google\.co\.uz|\.google\.com\.vc|\.google\.co\.ve|\.google\.vg|\.google\.co\.vi|\.google\.com\.vn|\.google\.vu|\.google\.ws|\.google\.rs|\.google\.co\.za|\.google\.co\.zm|\.google\.co\.zw|\.google\.cat)

and I'd imagine generating that in JavaScript (or whatever language you choose) from an array or some other data set would be relatively easy.

Upvotes: 0

Berzemus
Berzemus

Reputation: 3658

If you don't need the test to be 100% accurate, this simple regex would do for all the domains you posted above:

"(http://)?([\w]+)?\.google\.([\w]{2,3})"

Just testing the presence of ".google." would suffice in most cases, although it could easily be fooled by adding a "google" domain in the url (not so easy though, nor quickly done).

Or just wait for google to buy their own google TLD.

Upvotes: 1

luiscubal
luiscubal

Reputation: 25141

A regular expression may be what you need. An example is:

<script>
var elem = document.getElementById("a");
var regex = new RegExp("(http://)?(www\\.)?google\\.com");

elem.innerHTML = regex.test(elem.innerHTML);
</script>

This would get the content of a span element "a", and would change it to "true" if google.com, and "false" otherwise. Note that it doesn't consider all other URLs(although the regex could easily be modified to do so), and "pages.google.com", for example, wouldn't match.

Also, your URLs all have a "." before them(".google.com" instead of "google.com"). Does this have any reason or is it just a mistake?

Upvotes: 0

Echilon
Echilon

Reputation: 10264

Without a regex to individually match each and every TLD, there isn't really an 'elegant way of doing it'.

Upvotes: -1

Jon Skeet
Jon Skeet

Reputation: 1502076

Do you count other Google properties as "belonging to Google"? FeedBurner, Blogger etc?

Can I ask what the purpose of this is? There may be a better way of doing what you want... and if it's reasonable I can ask internally for you.

Upvotes: 1

Related Questions