Reputation: 497
I'm trying to improve this method used to search an actors wiki page and pull all of their film links out. At the moment, I'm using nokogiri to parse the page and regex to retrieve all links with the word "(film)" in the title but that still misses the majority of the links I need. Has anyone got a suggestion to retrieve more relevant links?
def find_films_by_actor(doca, out = [])
puts "Entering find_films_by_actor with #{find_name_title(doca)}."
all_links = doca.search('//a[@href]')
all_links.each do |link|
link_info = link['href']
if link_info.include?("(film)") && !(link_info.include?("Category:") || link_info.include?("php"))
then out << link_info end
end
out.uniq.collect {|link| strip_out_name(link)}
end
Upvotes: 0
Views: 127
Reputation: 54984
I find it cleanest to get at the links you want with css:
links = doc.search 'a[title*="(film)"]'
You can even do nodeset math to narrow them down:
links -= doc.search 'a[title*=foo]'
To get unique names (from text):
links.map(&:text).uniq
Upvotes: 1