Blackhawk
Blackhawk

Reputation: 57

How to crawl specific data from a website using stormcrawler

I am crawling news websites using stormcrawler(v 1.16) and storing data on Elasticsearch (v 7.5.0). My crawler-conf file is as stormcrawler files.I am using kibana for visualization.My issues are

EDIT: I was thinking to add a field in content index. So i made changes in src/main/resources/parsefilter.json ,ES_IndecInit.sh,and Crawler-conf.yaml. XPATH which i have added is correct . I have added as

"parse.pubDate":"//META[@itemprop=\"datePublished\"]/@content"

in parsefilter.

parse.pubDate =PublishDate

in crawler-conf and added

PublishDate": { "type": "text", "index": false, "store": true}

in properties of ES_IndexInit.sh . But still I am not getting any field named PublishDate in kibana or elasticsearch. ES_IndexInit.sh mapping is as folows:

{
  "mapping": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "PublishDate": {
        "type": "text",
        "index": false,
        "store": true
      },
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "description": {
        "type": "text",
        "store": true
      },
      "domain": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "host": {
        "type": "keyword",
        "store": true
      },
      "keywords": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "title": {
        "type": "text",
        "store": true
      },
      "url": {
        "type": "keyword",
        "store": true
      }
    }
  }
}

Upvotes: 1

Views: 654

Answers (1)

Julien Nioche
Julien Nioche

Reputation: 4864

One approach to indexing only news pages from a site is to rely on sitemaps, but not all sites will provide these.

Alternatively, you'd need a mechanism as part of the parsing, maybe in a ParseFilter, to determine that a page is a news item and filter based on the presence of a key / value in the metadata during the indexing.

The way it is done in the news crawl dataset from CommonCrawl is that the seed URLs are sitemaps or RSS feeds.

To not index the content, simply comment out

  indexer.text.fieldname: "content"

in the configuration.

Upvotes: 2

Related Questions