Biju George
Biju George

Reputation: 1

Storm Crawler to fetch urls with query string

I am new to storm crawler. I could configure storm crawler to fetch and parse the url "https://pubmed.ncbi.nlm.nih.gov/18926286/". But, my need is to crawl https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed But, when I give this url (tried both Memory Spout and URL Frontier), the output is same as for https://pubmed.ncbi.nlm.nih.gov/18926286/. Is there any specific setting to be done for accepting query strings?

I was expecting the same output when I put https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed in browser

The urlfilters.json -> I do not think I made any changes from defaults. This is the content.

{
    "com.digitalpebble.stormcrawler.filtering.URLFilters": [
        {
            "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter",
            "name": "BasicURLFilter",
            "params": {
                "maxPathRepetition": 3,
                "maxLength": 1024
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter",
            "name": "MaxDepthFilter",
            "params": {
                "maxDepth": 0
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer",
            "name": "BasicURLNormalizer",
            "params": {
                "removeAnchorPart": true,
                "unmangleQueryString": true,
                "checkValidURI": true,
                "removeHashes": true,
                "hostIDNtoASCII": true
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
            "name": "HostURLFilter",
            "params": {
                "ignoreOutsideHost": true,
                "ignoreOutsideDomain": true
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer",
            "name": "RegexURLNormalizer",
            "params": {
                "regexNormalizerFile": "default-regex-normalizers.xml"
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter",
            "name": "RegexURLFilter",
            "params": {
                "regexFilterFile": "default-regex-filters.txt"
            }
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter",
            "name": "SelfURLFilter"
        },
        {
            "class": "com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter",
            "name": "SitemapFilter"
        }
    ]
}

Upvotes: 0

Views: 80

Answers (1)

Julien Nioche
Julien Nioche

Reputation: 4864

I have just merged a PR, see https://github.com/DigitalPebble/storm-crawler/pull/1081.

This should help you test the filtering steps on the URL passed as input. It should be possible to pass a source URL as a 2nd argument.

Maybe try pulling the main branch, recompiling SC and make your topo code depend on the snapshot version.

You can test the metadata from the fetching with

storm local target/xxxx-1.0-SNAPSHOT.jar com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol https://pubmed.ncbi.nlm.nih.gov/18926286/?format=pubmed

Just to check, what do you use for indexing the content? Is the URL looking incorrect there or in the status (i.e URLFrontier)?

Upvotes: 0

Related Questions