shanahobo86
shanahobo86

Reputation: 497

improving a regex method

I'm trying to improve this method used to search an actors wiki page and pull all of their film links out. At the moment, I'm using nokogiri to parse the page and regex to retrieve all links with the word "(film)" in the title but that still misses the majority of the links I need. Has anyone got a suggestion to retrieve more relevant links?

 def find_films_by_actor(doca, out = [])
        puts "Entering find_films_by_actor with #{find_name_title(doca)}."
        all_links = doca.search('//a[@href]')
        all_links.each do |link|
            link_info = link['href']
            if link_info.include?("(film)") && !(link_info.include?("Category:") || link_info.include?("php"))
                then out << link_info end
          end
        out.uniq.collect {|link| strip_out_name(link)}
    end

Upvotes: 0

Views: 127

Answers (1)

pguardiario
pguardiario

Reputation: 54984

I find it cleanest to get at the links you want with css:

links = doc.search 'a[title*="(film)"]'

You can even do nodeset math to narrow them down:

links -= doc.search 'a[title*=foo]'

To get unique names (from text):

links.map(&:text).uniq

Upvotes: 1

Related Questions