Blackhawk
Blackhawk

Reputation: 57

How to add more XPATH in parsefilter.json in stormcrawler

I am using stormcrawler (v 1.16) & Elasticsearch(v 7.5.0) for extracting data from about 5k news websites. I have added some XPATH patterns for extracting author name in parsefilter.json. Parsefilter.json is as shown below:

{

  "com.digitalpebble.stormcrawler.parse.ParseFilters": [
    {
      "class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
      "name": "XPathFilter",
      "params": {
        "canonical": "//*[@rel=\"canonical\"]/@href",
        "parse.description": [
            "//*[@name=\"description\"]/@content",
            "//*[@name=\"Description\"]/@content"
         ],
        "parse.title": [
            "//TITLE",
            "//META[@name=\"title\"]/@content"
         ],
         "parse.keywords": "//META[@name=\"keywords\"]/@content",
        "parse.datePublished": "//META[@itemprop=\"datePublished\"]/@content",
        "parse.author":[
        "//META[@itemprop=\"author\"]/@content",
        "//input[@id=\"authorname\"]/@value",
        "//META[@name=\"article:author\"]/@content",
        "//META[@name=\"author\"]/@content",
        "//META[@name=\"byline\"]/@content",
        "//META[@name=\"dc.creator\"]/@content",
        "//META[@name=\"byl\"]/@content",
        "//META[@itemprop=\"authorname\"]/@content",
        "//META[@itemprop=\"article:author\"]/@content",
        "//META[@itemprop=\"byline\"]/@content",
        "//META[@itemprop=\"dc.creator\"]/@content",
        "//META[@rel=\"authorname\"]/@content",
        "//META[@rel=\"article:author\"]/@content",
        "//META[@rel=\"byline\"]/@content",
        "//META[@rel=\"dc.creator\"]/@content",
        "//META[@rel=\"author\"]/@content",
        "//META[@id=\"authorname\"]/@content",
        "//META[@id=\"byline\"]/@content",
        "//META[@id=\"dc.creator\"]/@content",
        "//META[@id=\"author\"]/@content",
        "//META[@class=\"authorname\"]/@content",
        "//META[@class=\"article:author\"]/@content",
        "//META[@class=\"byline\"]/@content",
        "//META[@class=\"dc.creator\"]/@content",
        "//META[@class=\"author\"]/@content"
]



}
    },

I have also made change in crawler-conf.yaml and it is as shown below.

    indexer.md.mapping:
    - parse.author=author
    metadata.persist:
    - author

The issue i am facing is : I am getting result only for 1st pattern (i.e. "//META[@itemprop="author"]/@content") of "parse.author". What changes I should do so that all patterns can be taken as input.

Upvotes: 0

Views: 101

Answers (1)

Tomalak
Tomalak

Reputation: 338406

What changes I should do so that all patterns can be taken as input.

I read this as "How can I make a single XPath expression that tries all different ways an author can appear in the document?"

Simplest approach: Join the all expressions you already have into a single one with the XPath Union operator |:

input[...]|meta[...]|meta[...]|meta[...]

And since this potentially selects more than one node, we could state explicitly that we only care for the first match:

(input[...]|meta[...]|meta[...]|meta[...])[1]

This probably works but it will be very long and hard to read. XPath can do better.

Your expressions are all pretty repetitive, that's a good starting point to reduce the size of the expression. For example, those two are the same, except for the attribute value:

//meta[@class='author']/@content|//meta[@class='authorname']/@content

We could use or and it would get shorter already:

//meta[@class='author' or @class='authorname']/@content

But when you have 5 or 6 potential values, it still is pretty long. Next try, a predicate for the attribute:

//meta[@class[.='author' or .='authorname']]/@content

A little shorter, as we don't need to type @class all the time. But still pretty long with 5 or 6 potential values. How about a value list and a substring search (I'm using / as a delimiter character):

//meta[contains(
    '/author/authorname/',
    concat('/', @class, '/')
)]/@content

Now we can easily expand the list of valid values, and even look at different attributes, too:

//meta[contains(
    '/author/authorname/article:author/',
    concat('/', @class|@id , '/')
)]/@content

And since we're looking for almost the same possible strings across multiple possible attributes, we could use a fixed list of values that all possible attributes are checked against:

//meta[
    contains(
        '/author/article:author/authorname/dc.creator/byline/byl/',
        concat('/', @name|@itemprop|@rel|@id|@class, '/')
    )
]/@content

Combined with the first two points, we could end up with this:

(
    //meta[
        contains(
            '/author/article:author/authorname/dc.creator/byline/byl/',
            concat('/', @name|@itemprop|@rel|@id|@class, '/')
        )
    ]/@content
    |
    //input[
        @id='authorname'
    ]/@value
)[1]

Caveat: This only works as expected when a <meta> will never have both e.g. @name and @rel, or if, that they at least both have the same value. Otherwise concat('/', @name|@itemprop|@rel|@id|@class, '/') might pick the wrong one. It's a calculated risk, I think it's not usual for this to happen in HTML. But you need to decide, you're the one who knows your input data.

Upvotes: 2

Related Questions