improving a regex method

Question

I'm trying to improve this method used to search an actors wiki page and pull all of their film links out. At the moment, I'm using nokogiri to parse the page and regex to retrieve all links with the word "(film)" in the title but that still misses the majority of the links I need. Has anyone got a suggestion to retrieve more relevant links?

 def find_films_by_actor(doca, out = [])
        puts "Entering find_films_by_actor with #{find_name_title(doca)}."
        all_links = doca.search('//a[@href]')
        all_links.each do |link|
            link_info = link['href']
            if link_info.include?("(film)") && !(link_info.include?("Category:") || link_info.include?("php"))
                then out << link_info end
          end
        out.uniq.collect {|link| strip_out_name(link)}
    end

pguardiario · Accepted Answer

I find it cleanest to get at the links you want with css:

links = doc.search 'a[title*="(film)"]'

You can even do nodeset math to narrow them down:

links -= doc.search 'a[title*=foo]'

To get unique names (from text):

links.map(&:text).uniq

improving a regex method

Answers (1)

Related Questions