Reputation: 10049

Javascript: Regex help needed please

When it comes to Regex I am dumber than a door nail, so when making a Firefox extension I asked a friend for help and he gave me this:

if( doc.location.href.match(/(www\.google.*?[?&]q=[^&]+)/i) )

but the AMO editor rejected it saying it was too broad (for instance it would match http://uptime.netcraft.com/up/graph?site=www.google.com), can someone help me / give me a regex that matches the following (according to the editor it should match basically this) : http(s)://www.google.tld/q=*

So for example it should match http or https (normal and secure) as well as any tld after Google (like .ru, .se, .fr, .in etc)

In other words it should only match Google search.

Thanks in advance for your help!

/Ryan

Upvotes: 1

Answers (6)

ʞɔıu

Reputation: 48416

Try

/^https?:\/\/(?:www\.)?google(?:\.[a-z]{2,3}){1,2}\/.*[&\?]q=[^&]+?/i

The (?:\.[a-z]{2,3}){1,2} is to match like .com.au, .co.uk etc.

Upvotes: 2

Mike Samuel

Reputation: 120516

Don't try to tailor a regular expression. It will be unmaintainable -- if you can't find the problem with it today, what hope does the maintainer have to find a problem with it tomorrow?

Parse the URL properly, perhaps by using the regular expression which won't need to be maintained because the core URL syntax doesn't change.

From RFC 3986:

The following line is the regular expression for breaking-down a well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression <n> as $<n>. For example, matching the above expression to
 http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
 $1 = http:
 $2 = http
 $3 = //www.ics.uci.edu
 $4 = www.ics.uci.edu
 $5 = /pub/ietf/uri/
 $6 = <undefined>
 $7 = <undefined>
 $8 = #Related
 $9 = Related

Using that, you can check your URL in JavaScript by doing the following:

var match = url.match(/^(([^:/?#]+):)?(\/\/([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?$/);
if (!match) { throw new Error('not a URL'); }
var url = {
  protocol: match[2],
  authority: match[4],  // host, port, username, password
  path: match[5],
  query: match[6],
  fragment: match[8]
};
if (url.protocol !== 'http' && url.protocol !== 'https') {
  throw new Error('bad protocol');
}
if (!/^www.google.[a-z]+$/.test(url.authority || '')) {
  throw new Error('bad host');
}
if (!/[?&]q=/.test(url.query || '')) {
  throw new Error('bad query');
}

It's more code, but it's much easier to debug, maintain, and as a bonus, you can tailor your explanation of why the URL is problematic.

Upvotes: 2

Milad Naseri

Reputation: 4118

var regex = /^https?:\/\/(www\.)?google\.[a-z]{2,3}\/([^/]*[\&]|[\?])q=.+$/i;

Upvotes: 1

Wouter J

Reputation: 41934

Something like this?

/https?:\/\/(www)?\.google\.[a-z]{2,3}\/[?&]q=.+/

Upvotes: 0

bjelli

Reputation: 10090

Add ^https?:// to the front of the pattern you already have

^ anchors the pattern to the beginning of the string
http is just http
s? means 1 or 0 s's
: is just itself
backslashes need to be escaped

so this is the whole pattern:

(^https?:\/\/www\.google.*?[?&]q=[^&]+)

what i like about the pattern you have: it does not assume that TLDs are two or three characters long.

Upvotes: 1

eis

Reputation: 53482

^https?://www\.google\.[a-z]{2,3}/q=

assuming just 2-3 letters for tld would be ok. If you're using it between forward slashes (/), you'd want to escape them on this regex.

Upvotes: 2

Javascript: Regex help needed please

Answers (6)

Related Questions