an__snatcher
an__snatcher

Reputation: 131

Remove HEADERS from crawl

Working on Storm Crawler 1.13 and Elastic Search 6.5.2. Working in TextExtractor. I am excluding script and style tags similarly I want to remove header tags. I am applying below configuration but its not applying to all results. I want to keep h1 , h2 , h3 only remove header named tags. Any Suggestions.

Webpage:

<header id="section-header" class="section section-header">
</header>

<h1 class="title" id="page-title">Good Morning..</h1>

crawlerconf.yaml

  textextractor.include.pattern:
   - DIV[id="maincontent"]
   - DIV[itemprop="articleBody"]
   - ARTICLE

  textextractor.exclude.tags:
   - STYLE
   - SCRIPT
   - HEADER
   - FOOTER

Upvotes: 0

Views: 350

Answers (1)

rzo1
rzo1

Reputation: 5751

I could not reproduce your issue on my local machine. It may be a configuration flaw on your side or the websites you referring to are special.

Did you verify, that your custom crawler-conf.yaml is properly loaded and the textextractor.exclude.tags are included in the loaded configuration?

I did the following steps trying to reproduce your question:

  1. I checked out the 1.13 release sources of StormCrawler.
  2. I added the following unit test to TextExtractorTest.java:
    @Test
    public void testRemoveHeaderElements() throws IOException {
        Config conf = new Config();
        HashSet<String> excluded = new HashSet<>();
        excluded.add("HEADER");
        excluded.add("FOOTER");
        excluded.add("SCRIPT");
        excluded.add("STYLE");
        conf.put(TextExtractor.EXCLUDE_PARAM_NAME, PersistentVector.create(excluded));

    HashSet&lt;String&gt; included = new HashSet&lt;&gt;();
    included.add("DIV[id=\"maincontent\"]");
    included.add("DIV[itemprop=\"articleBody\"]");
    included.add("ARTICLE");
    conf.put(TextExtractor.INCLUDE_PARAM_NAME, PersistentVector.create(included));

    TextExtractor extractor = new TextExtractor(conf);

    String content = "&lt;header id=\"section-header\" class=\"section section-header\"&gt;&lt;/header&gt;&lt;h1 class=\"title\" id=\"page-title\"&gt;Good Morning..&lt;/h1&gt;";

    Document jsoupDoc = Parser.htmlParser().parseInput(content,
            "http://stormcrawler.net");
    String text = extractor.text(jsoupDoc.body());

    assertEquals("Good Morning..", text);
}

This unit test on the TextExtractor component passes. Next, I did upload a website with the following HTML code to a local deployed web server:

<header id="section-header" class="section section-header">
</header>



Good Morning..


The extracted text content is: Good Morning.., which should be fine according to your requirements.

Upvotes: 2

Related Questions