Reputation: 1
I am crawling intranet websites using stormcrawler(v 2.10) and storing data on Elasticsearch (v 7.8.0). Using kibana for visualization. Intranet pages have custom meta tags as below
<meta name="Article_PublishedDate" content="2023-07-14T00:00:00Z" />
<meta name="Article_Year" content="2023" />
<meta name="Article_Heading" content="AWARDS RELEASE 2023" />
<meta name="Article_Description" content="BUSINESS AWARDS RELEASE 2023" />
<meta name="Article_Type" content="PressRelease" />
which i want to store in elastic search index "crawler-content". But I am not getting any of these field in kibana/elasticsearch.
Updated index script
{
"settings": {
"index": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "5s",
"default_pipeline": "timestamp"
}
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"content": {
"type": "text"
},
"description": {
"type": "text"
},
"domain": {
"type": "keyword"
},
"format": {
"type": "keyword"
},
"keywords": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"timestamp": {
"type": "date",
"format": "date_optional_time"
},
"metatag": {
"properties": {
"article_description": {
"type": "text"
},
"article_heading": {
"type": "text"
},
"article_publisheddate": {
"type": "date"
},
"article_type": {
"type": "text"
},
"article_year": {
"type": "text"
}
}
}
}
}
}
in jsoupfilters.json added
"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"
in crawler-conf.yaml added
indexer.md.mapping:
- parse.title=title
- parse.search=search
- parse.keywords=keywords
- parse.description=description
- parse.article_description=metatag.article_description
- parse.article_heading=metatag.article_heading
- parse.article_publisheddate=metatag.article_publisheddate
- parse.article_type=metatag.article_type
- parse.article_year=metatag.article_year
- domain
- format
Upvotes: 0
Views: 48
Reputation: 4864
I can't see anything obviously incorrect in your setup. You could run the class https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java on a single URL to check the extraction. Would also be useful to test the output of the protocol on the command line, see our recent blog for an example.
Upvotes: 0