Reputation: 131
Working on Storm Crawler 1.13 and Elastic Search 6.5.2. Working in TextExtractor. I am excluding script and style tags similarly I want to remove header tags. I am applying below configuration but its not applying to all results. I want to keep h1 , h2 , h3 only remove header named tags. Any Suggestions.
Webpage:
<header id="section-header" class="section section-header">
</header>
<h1 class="title" id="page-title">Good Morning..</h1>
crawlerconf.yaml
textextractor.include.pattern:
- DIV[id="maincontent"]
- DIV[itemprop="articleBody"]
- ARTICLE
textextractor.exclude.tags:
- STYLE
- SCRIPT
- HEADER
- FOOTER
Upvotes: 0
Views: 350
Reputation: 5751
I could not reproduce your issue on my local machine. It may be a configuration flaw on your side or the websites you referring to are special.
Did you verify, that your custom crawler-conf.yaml
is properly loaded and the textextractor.exclude.tags
are included in the loaded configuration?
I did the following steps trying to reproduce your question:
1.13
release sources of StormCrawler.TextExtractorTest.java
: @Test
public void testRemoveHeaderElements() throws IOException {
Config conf = new Config();
HashSet<String> excluded = new HashSet<>();
excluded.add("HEADER");
excluded.add("FOOTER");
excluded.add("SCRIPT");
excluded.add("STYLE");
conf.put(TextExtractor.EXCLUDE_PARAM_NAME, PersistentVector.create(excluded));
HashSet<String> included = new HashSet<>();
included.add("DIV[id=\"maincontent\"]");
included.add("DIV[itemprop=\"articleBody\"]");
included.add("ARTICLE");
conf.put(TextExtractor.INCLUDE_PARAM_NAME, PersistentVector.create(included));
TextExtractor extractor = new TextExtractor(conf);
String content = "<header id=\"section-header\" class=\"section section-header\"></header><h1 class=\"title\" id=\"page-title\">Good Morning..</h1>";
Document jsoupDoc = Parser.htmlParser().parseInput(content,
"http://stormcrawler.net");
String text = extractor.text(jsoupDoc.body());
assertEquals("Good Morning..", text);
}
This unit test on the TextExtractor
component passes. Next, I did upload a website with the following HTML code to a local deployed web server:
<header id="section-header" class="section section-header">
</header>
Good Morning..
The extracted text content is: Good Morning..
, which should be fine according to your requirements.
Upvotes: 2