Reputation: 348
I want to index a website into my collection, essentially I want to index my Wordpress website, by looping through all the posts' URLs.
E.g.
url=http://www.szirine.com/blog/2016/02/07/anne-dunn/
Ofcourse, ideally I would want to be able to iteratively index a whole domain or URI, E.g.
url=http://www.szirine.com/
url=http://www.szirine.com/blog/
Upvotes: 0
Views: 180
Reputation: 1579
The best solution at present is to use Data Crawler, available on the Discovery Service dashboard in Bluemix.
Data Crawler as of v1.3.0 does not have a native way to crawl websites over HTTP or HTTPS. This may change in a future version of Data Crawler.
For now, though, it is possible to mimic a web crawl by using GNU wget, a widely available HTTP client with a mirroring mode and great documentation, to download a website locally and upload it to Discovery Service using Data Crawler's filesystem connector.
To mirror a website, use wget --mirror http://www.example.com
. For more information, please read the above-linked documentation.
If native web crawling is something that you very much want, open a ticket so that we can understand how strong of a demand there is for this feature.
One note: wget for Windows exists but is not very valuable at the moment because Data Crawler does not support Windows as of v1.3.0.
Upvotes: 2