Reputation: 137
I am trying to use the following regex code in my Ruby application to match HTTP links, but it generates invalid output, appending a period, sometimes a period and a word, behind the link which, when tested on the web, becomes invalid.
URL_PATTERN = Regexp.new %r{http://[\w/.%-]+}i
<input>.to_s.scan( URL_PATTERN ).uniq
Is there some problem with the above code for scanning the links?
Code from the app:
require 'bundler/setup'
require 'twitter'
RECORD_LIMIT = 100
URL_PATTERN = Regexp.new %r{http://[\w/.%-]+}i
def usage
warn "Usage: ruby #{File.basename $0} <hashtag>"
exit 64
end
# Ensure that the hashtag has a hash symbol. This makes the leading '#'
# optional, which avoids the need to quote or escape it on the command line.
def format_hashtag(hashtag)
(hashtag.scan(/^#/).empty?) ? "##{hashtag}" : hashtag
end
# Return a sorted list of unique URLs found in the list of tweets.
def uniq_urls(tweets)
tweets.map(&:text).grep( %r{http://}i ).to_s.scan( URL_PATTERN ).uniq
end
def search(hashtag)
Twitter.search(hashtag, rpp: RECORD_LIMIT, result_type: 'recent')
end
if __FILE__ == $0 usage unless ARGV.size >= 1
hashtag = format_hashtag(ARGV[0])
tweets = search(hashtag)
puts uniq_urls(tweets)
end
Upvotes: 0
Views: 416
Reputation: 160551
Rather than reinvent the wheel, why not use Ruby's URI.extract? It's bundled with Ruby.
From the documentation:
Synopsis URI::extract(str[, schemes][,&blk]) Args str String to extract URIs from. schemes Limit URI matching to a specific schemes. Description Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches. Usage require "uri" URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.") # => ["http://foo.example.com/bla", "mailto:[email protected]"]
If you only want HTTP URLs:
[3] (pry) main: 0> URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.", %w[http]) => ["http://foo.example.org/bla"]
Upvotes: 1
Reputation: 3413
The problem is that your regex will include a trailing period, as you are indiscriminately checking for an arbitrary sequence of word characters, slashes, percent signs, hyphens (aka “minus”) and periods. This will catch a trailing period that is in fact punctuation when the URL is at the end of a sentence, and, if people omit the space following the period, anything after that – as CodeGnome correctly stated. You can partly alleviate this issue by excluding trailing punctuation like this (note this will still catch punctuation directly followed by non-URL stuff):
http://\w+(?:[./%-]\w+)+$
However, this will still miss a large proportion of existing URLs and catch a lot of invalid stuff: URLs are quite complex beasts. If you want a perfect match, John Gruber posted a regex that matches about anything that is used as a URL today, not just http(s) ones. For a closer match of a large crop of web-only URLs, including the HTTPS variant, making sure you have a well formed domain at the start, and catching queries and fragment identifiers, the regex should look something like this:
https?://[\w-]+(?:\.[\w-]+)+(?:/[\w-]+)*(?:(?:[./%?=&#-]\w+)+)?
– this will still catch invalid stuff, and exclude quite some existing URLs (and an even larger proportion of valid URLs – see the RFC I linked to above), but it will get you closer.
Upvotes: 1
Reputation: 84343
People post bad links all the time. Links are also subject to bit-rot.
Have you verified the Tweets manually? Are you sure that the original Tweet doesn't contain a malformed URL? If someone posts:
http://foo.Any more toast?
then you're certainly going to get an invalid result because the regex requires whitespace around the URL. If you want to prune invalid results, then you will need to use a link-checker that can handle redirects to validate each link you find.
The code you're posting is mine, from CodeGnome/twitter_url_extractor. I deliberately left out link-checking, because I was interested in extracting URLs, not validating them.
"It works for me; your mileage may vary."℠
Upvotes: 3