Reputation: 61
I heard that URI::extract()
only returns links with a :
, however since I am grabbing a tweet, and it does not contain a :
, I believe I would have to use a regular expression. I need to check for a "swoo.sh/whatever" link, and store it to a variable. However, how could I look for the first (which it returns automatically apparently), "swoo.sh/whatever" link, in regards to that I have to maintain everything after the /
. For example, if the tweet says
Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum
How would I grab the swoo.sh link, and all the different things that come directly after the /
?
Upvotes: 1
Views: 685
Reputation: 22315
We can use the fact that URIs can't contain spaces and Ruby has URI::Generic which will parse almost anything that looks URI-ish. Then we just need to filter out non-web-URIs, which I do by assuming that every web URI has to start with something like foo.bar
require 'uri'
require 'pathname'
tweet.
split.
map { |s| URI.parse(s) rescue nil }.
select { |u| u && (u.hostname || Pathname(u.path).each_filename.first =~ /\w\.\w/) }
Example output
tweet = 'foo . < google.com bar swoosh.sh/blah?q=bar http://google.com/bar'
# the above returns
# [#<URI::Generic google.com>, #<URI::Generic swoosh.sh/blah?q=bar>, #<URI::HTTP http://google.com/bar>]
This can't really work in general because of ambiguity. "car.net" looks like a shortened link, but in context it could be "my neighbor threw a baseball through my window so i yanked the hubcabs off his car.net gain!!!", where it's clearly just a missing space.
Upvotes: 1
Reputation: 520878
Here is one approach using match
:
match = /(\w+\.\w+\/\w+)/.match("Lorem ipsum lorem ipsum swoo.sh/12xfsW lorem ipsum")
if match
puts match[1]
else
puts "no match"
end
If you also need the simultaneous ability to capture full URLs, then my answer would have to be updated. This only answers your immediate question.
Upvotes: 1