remkohdev
remkohdev

Reputation: 348

Can I 'Add Document' of type URL to my Collection?

I want to index a website into my collection, essentially I want to index my Wordpress website, by looping through all the posts' URLs.

E.g.

url=http://www.szirine.com/blog/2016/02/07/anne-dunn/

Ofcourse, ideally I would want to be able to iteratively index a whole domain or URI, E.g.

url=http://www.szirine.com/
url=http://www.szirine.com/blog/

Upvotes: 0

Views: 180

Answers (1)

Colin Dean
Colin Dean

Reputation: 1579

The best solution at present is to use Data Crawler, available on the Discovery Service dashboard in Bluemix.

Data Crawler as of v1.3.0 does not have a native way to crawl websites over HTTP or HTTPS. This may change in a future version of Data Crawler.

For now, though, it is possible to mimic a web crawl by using GNU wget, a widely available HTTP client with a mirroring mode and great documentation, to download a website locally and upload it to Discovery Service using Data Crawler's filesystem connector.

To mirror a website, use wget --mirror http://www.example.com. For more information, please read the above-linked documentation.

If native web crawling is something that you very much want, open a ticket so that we can understand how strong of a demand there is for this feature.

One note: wget for Windows exists but is not very valuable at the moment because Data Crawler does not support Windows as of v1.3.0.

Upvotes: 2

Related Questions