shah khan
shah khan

Reputation: 137

Regex to match URLs generating partially invalid output

I am trying to use the following regex code in my Ruby application to match HTTP links, but it generates invalid output, appending a period, sometimes a period and a word, behind the link which, when tested on the web, becomes invalid.

URL_PATTERN  = Regexp.new %r{http://[\w/.%-]+}i
<input>.to_s.scan( URL_PATTERN ).uniq

Is there some problem with the above code for scanning the links?

Code from the app:

require 'bundler/setup'
require 'twitter'

RECORD_LIMIT = 100
URL_PATTERN  = Regexp.new %r{http://[\w/.%-]+}i

def usage
  warn "Usage: ruby #{File.basename $0} <hashtag>"  
  exit 64
end

# Ensure that the hashtag has a hash symbol. This makes the leading '#'
# optional, which avoids the need to quote or escape it on the command line.
def format_hashtag(hashtag)  
  (hashtag.scan(/^#/).empty?) ? "##{hashtag}" : hashtag
end

# Return a sorted list of unique URLs found in the list of tweets.
def uniq_urls(tweets)  
  tweets.map(&:text).grep( %r{http://}i ).to_s.scan( URL_PATTERN ).uniq
end

def search(hashtag)  
  Twitter.search(hashtag, rpp: RECORD_LIMIT, result_type: 'recent')
end

if __FILE__ == $0 usage unless ARGV.size >= 1  
hashtag = format_hashtag(ARGV[0]) 
tweets = search(hashtag) 
puts uniq_urls(tweets)
end

Upvotes: 0

Views: 416

Answers (3)

the Tin Man
the Tin Man

Reputation: 160551

Rather than reinvent the wheel, why not use Ruby's URI.extract? It's bundled with Ruby.

From the documentation:

Synopsis

URI::extract(str[, schemes][,&blk])

Args

str     String to extract URIs from.
schemes Limit URI matching to a specific schemes.

Description

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
Usage

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
# => ["http://foo.example.com/bla", "mailto:[email protected]"]

If you only want HTTP URLs:

[3] (pry) main: 0> URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.", %w[http])
=> ["http://foo.example.org/bla"]

Upvotes: 1

kopischke
kopischke

Reputation: 3413

The problem is that your regex will include a trailing period, as you are indiscriminately checking for an arbitrary sequence of word characters, slashes, percent signs, hyphens (aka “minus”) and periods. This will catch a trailing period that is in fact punctuation when the URL is at the end of a sentence, and, if people omit the space following the period, anything after that – as CodeGnome correctly stated. You can partly alleviate this issue by excluding trailing punctuation like this (note this will still catch punctuation directly followed by non-URL stuff):

http://\w+(?:[./%-]\w+)+$

However, this will still miss a large proportion of existing URLs and catch a lot of invalid stuff: URLs are quite complex beasts. If you want a perfect match, John Gruber posted a regex that matches about anything that is used as a URL today, not just http(s) ones. For a closer match of a large crop of web-only URLs, including the HTTPS variant, making sure you have a well formed domain at the start, and catching queries and fragment identifiers, the regex should look something like this:

https?://[\w-]+(?:\.[\w-]+)+(?:/[\w-]+)*(?:(?:[./%?=&#-]\w+)+)?

– this will still catch invalid stuff, and exclude quite some existing URLs (and an even larger proportion of valid URLs – see the RFC I linked to above), but it will get you closer.

Upvotes: 1

Todd A. Jacobs
Todd A. Jacobs

Reputation: 84343

TL;DR

People post bad links all the time. Links are also subject to bit-rot.

The Likely Answer

Have you verified the Tweets manually? Are you sure that the original Tweet doesn't contain a malformed URL? If someone posts:

http://foo.Any more toast?

then you're certainly going to get an invalid result because the regex requires whitespace around the URL. If you want to prune invalid results, then you will need to use a link-checker that can handle redirects to validate each link you find.

Author's Disclaimer

The code you're posting is mine, from CodeGnome/twitter_url_extractor. I deliberately left out link-checking, because I was interested in extracting URLs, not validating them.

"It works for me; your mileage may vary."℠

Upvotes: 3

Related Questions