Storm Crawler to fetch urls with query string

Question

I am new to storm crawler. I could configure storm crawler to fetch and parse the url "https://pubmed.ncbi.nlm.nih.gov/18926286/". But, my need is to crawl https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed But, when I give this url (tried both Memory Spout and URL Frontier), the output is same as for https://pubmed.ncbi.nlm.nih.gov/18926286/. Is there any specific setting to be done for accepting query strings?

I was expecting the same output when I put https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed in browser

The urlfilters.json -> I do not think I made any changes from defaults. This is the content.

{
    "com.digitalpebble.stormcrawler.filtering.URLFilters": [
        {
            "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter",
            "name": "BasicURLFilter",
            "params": {
                "maxPathRepetition": 3,
                "maxLength": 1024
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
            "name": "MaxDepthFilter",
            "params": {
                "maxDepth": 0
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer",
            "name": "BasicURLNormalizer",
            "params": {
                "removeAnchorPart": true,
                "unmangleQueryString": true,
                "checkValidURI": true,
                "removeHashes": true,
                "hostIDNtoASCII": true
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
            "name": "HostURLFilter",
            "params": {
                "ignoreOutsideHost": true,
                "ignoreOutsideDomain": true
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer",
            "name": "RegexURLNormalizer",
            "params": {
                "regexNormalizerFile": "default-regex-normalizers.xml"
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter",
            "name": "RegexURLFilter",
            "params": {
                "regexFilterFile": "default-regex-filters.txt"
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter",
            "name": "SelfURLFilter"
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter",
            "name": "SitemapFilter"
        }
    ]
}

Storm Crawler to fetch urls with query string

Answers (1)

Related Questions