Reputation: 1966
I am trying to get the contact information in the content pages from a set of web sites (thousands of them). I wanted to ask experts like you guys before scratching my head. All I need is the address, email ids, phone numbers and contact person information if available.
I think you understand the problem already. Yes it is the formatting... since there is no standard format that websites follows, its really hard to pin point the exact information that I need. Some websites are designed with flash contact us pages and some other websites designed the contact information as image types with custom fonts.
And hints/ideas/suggestions are mostly welcome...
Thank you....
Upvotes: 6
Views: 443
Reputation: 8553
This is as you might expect, by no means a trivial task. Here is one way of approaching this:
Use an inverted indexing system such as Lucene/Solr or Sphinx to index the pages. You might need to write your own crawler/spider. Apache Nutch and other crawlers offer spidering out of the box. If the content is fairly static, download them to your system locally.
Once the content is indexed, you could query it for email addresses, telephone numbers, etc. by building a boolean query such as: //for email //for telephone # parentheses Contents:@ AND (Contents:.COM OR Contents:.NET) OR Contents:"(" OR Contents:")"` Important: the foregoing code should not be taken literally. You could get even fancier by using Lucene Regex Query & Span Query which would let you build pretty sophisticated queries.
Finally on the result pages, (a) run a result highlighter to get the snippet(s) around the query term and, (b) on the snippets, run a regex to extract out the fields of interest.
If you have a North American address data set, you could run multiple-passes to validate addresses against i) a mapping provider like Bing Maps, or Google maps to verify addresses. As far as I know, USPS and others offer valid address look-ups for a fee, to validate US zip codes and Canadian Postal codes. or, ii) a reverse DNS look-up for email addresses and so on....
That should get you started....like I said, there is no single best solution here, you will need to try multiple approaches to iterate and get to the accuracy level you desire.
Hope this helps.
Upvotes: 10
Reputation: 104
@Mikos is right, you will definitely need multiple approaches. Another possible tool to consider is Web-Harvest. It is a tool for harvesting Web Data and it allows you to collect websites and extract data you are interested in. All of this is done via XML configuration files. The software has a GUI and a command line interface as well.
It lets you use techniques for text/xml manipulation like XSLT, XQuery and Regular Expressions, you can also build your own plugins. It does however mainly focus on HTML/XML based websites.
Upvotes: 1
Reputation: 328
Conditional Random Fields have been used precisely for tasks like these, and have been fairly successful. You can use CRF++ or the Stanford Named Entity Recognizer. Both can be invoked from command line without you having to write any explicit code.
In short, you need to be able to first train these algorithms by giving them some examples of names, e-mail IDs etc from the webpages so that they learn to recognize these things. Once these algorithms have got smart (because of the examples you gave them), you can run them on your data and see what you get.
Don't get scared looking at the wikipedia page. The packages come with a lot of examples, and you should be up and running in a few hours.
Upvotes: 3