Reputation: 1795
How can i retrieve the Contact us link from any webpage in world wide web from it's "footer" part of the page in JAVA.
E.g. find footer element, or an element with id="footer" or having a footer class?
I had tried retrieving all the links from webpage using JSOUP and then running regex .*contact.*
in it. But I cannot be 100% sure on that the fetched link from this approach is the contact us page of the website.
Q2
Is there any other robust approach or if i could use both footer link and my already completed approach to conclude if a page is certainly a contact us page?
Upvotes: 1
Views: 226
Reputation: 43013
But I cannot be 100% sure on that the fetched link...
You will NEVER be sure.
For a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.
I can see some options in your case:
Option 1: Crowd sourcing
Check if the platform offer an API.
+ work done by human
+ dynamically adapt to unknown pattern
- cost money
- We suck at repetitive tasks
Option 2: IA (patten searching)
Have a look at Weka for instance or Java-ML.
+ Automated task
+ Can perform a repetitive task long time
- May take time to built a robust solution
- Risk of false positive or complete miss
Option 3: Use Jsoup
This option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.
+ Automated task
+ Can perform a repetitive task long time
- Take time for studying, discovering, adding new patterns
- Risk of false positive or complete miss
Option 4: A mix of the three above options
You can have the three options working on the websites you target.
+ Reduce chances of false positive or complete misses
+ More confident final result
- Take time for studying, discovering, adding new patterns
- Cost money
Upvotes: 2