tapioco123
tapioco123

Reputation: 3545

How to extract URLs from text

How do I extract all URLs from a plain text file in Ruby?

I tried some libraries but they fail in some cases. What's the best way?

Upvotes: 27

Views: 19317

Answers (6)

Jan Klimo
Jan Klimo

Reputation: 4950

If your input looks similar to this:

"http://i.imgur.com/c31IkbM.gifv;http://i.imgur.com/c31IkbM.gifvhttp://i.imgur.com/c31IkbM.gifv"

i.e. URLs do not necessarily have white space around them, can be delimited by any delimiter, or have no delimiter between them at all, you can use the following approach:

def process_images(raw_input)
  return [] if raw_input.nil?
  urls = raw_input.split('http')
  urls.shift
  urls.map { |url| "http#{url}".strip.split(/[\s\,\;]/)[0] }
end

Hope it helps!

Upvotes: 0

santervo
santervo

Reputation: 544

I've used twitter-text gem

require "twitter-text"
class UrlParser
    include Twitter::Extractor
end

urls = UrlParser.new.extract_urls("http://stackoverflow.com")
puts urls.inspect

Upvotes: 14

Keon Cummings
Keon Cummings

Reputation: 1811

require 'uri'    
foo = #<URI::HTTP:0x007f91c76ebad0 URL:http://foobar/00u0u_gKHnmtWe0Jk_600x450.jpg>
foo.to_s
=> "http://foobar/00u0u_gKHnmtWe0Jk_600x450.jpg"

edit: explanation

For those who are having problems parsing URI's through JSON responses or by using a scraping tool like Nokogiri or Mechanize, this solution worked for me.

Upvotes: -2

behe
behe

Reputation: 1368

If you like using what's already provided for you in Ruby:

require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:[email protected] and here also.")
# => ["http://foo.example.org/bla", "mailto:[email protected]"]

Read more: http://railsapi.com/doc/ruby-v1.8/classes/URI.html#M004495

Upvotes: 108

Chubas
Chubas

Reputation: 18053

What cases are failing?

According to the library regexpert, you can use

regexp = /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix

and then perform a scan on the text.

EDIT: Seems like the regexp supports the empty string. Just remove the initial (^$) and you're done

Upvotes: 5

NullUserException
NullUserException

Reputation: 85478

You can use regex and .scan()

string.scan(/(https?:\/\/([-\w\.]+)+(:\d+)?(\/([\w\/_\.]*(\?\S+)?)?)?)/)

You can get started with that regex and adjust it according to your needs.

Upvotes: 9

Related Questions