Luis Solorzano
Luis Solorzano

Reputation: 1

Issues with Mapping Discovered URLs to Original URLs in Katana CLI Batch Processing

I'm using the Katana CLI for web crawling, with a Python wrapper to manage batch processing and output parsing. My goal is to map all discovered URLs back to their original URLs, but I'm facing issues where some discovered URLs don't map correctly, especially when domains are similar or subdomains are involved.

Here’s the setup I have:

class KatanaData:
    def __init__(self, domain: str, original_url: str, id: str):
        self._original_url = original_url
        self._domain = domain
        self._id = id
        self._discovered_urls = []
        self._error = None
        self._processing_time = None

def run_katana(batch: Dict, timeout=120):
    url_list = create_url_list(batch)

    tmp = 'tmp'
    output_dir = f'{tmp}/SRD_output'
    
    cmd = [
        'katana', '-u', url_list, '-headless', '-headless-options', '--disable-gpu', '-field-scope', 'dn', '-depth', '5',
         '-extension-filter', 'css', '-timeout', '10', '-crawl-duration', f'{timeout}s', '-srd', output_dir
    ]
    error_message = None
    start_time = time.time()
    process = None
    try:
        process = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, timeout=timeout)
        processing_time = round(time.time() - start_time, 2)

        if process.returncode == 0:

        else:
            log_with_location(f'Batch for {url_list} failed in {processing_time}')
            error_message = process.stderr.strip()
    except subprocess.TimeoutExpired:
        if process:
            process.kill()
        log_with_location(f'Batch for {url_list} timed out after {timeout} seconds')
        error_message = "Process timed out"
    finally:
        try:

            parse_srd(output_dir, batch)
        except (FileNotFoundError, ValueError) as e:
            error_message = str(e)
        kill_katana_processes()
        cleanup_temp(tmp)
    return error_message

def parse_srd(output_dir, batch: Dict):
    log_with_location(f'Starting parse_srd for {len(batch)} urls')
    output_file = f'{output_dir}/index.txt'
    if not os.path.exists(output_file):
        log_with_location(f'{output_file} not found in {output_dir}', logging.ERROR)
        raise FileNotFoundError(f'{output_file} not found')

    with open(output_file, 'r') as file:
        lines = file.readlines()
        for line in lines:
            parts = line.split()
            if len(parts) >= 3:
                file_path = parts[0]
                discovered_url = parts[1]
                status = parts[2]
                original_url = find_original_url(discovered_url, batch)
                if original_url:
                    batch[original_url]._discovered_urls.append((discovered_url, status))
                else:
                    print(f"Warning: Original URL not found for discovered URL {discovered_url}")

def find_original_url(discovered_url, batch: Dict):
    discovered_netloc = urlparse(discovered_url).netloc
    for domain, katanaData in batch.items():
        original_netloc = urlparse(katanaData._original_url).netloc
        if discovered_netloc == original_netloc or discovered_netloc.endswith(f".{original_netloc}"):
            return domain
    return None

def create_url_list(batch: Dict):
    return ','.join([data._original_url for data in batch.values()])

Input: powerui.foo.com, acnmll-en.foo.com

The -srd flag for Katana will store the http requests/responses to a custom directory. It also creates an index.txt file where it has three columns: the location of the http request/response, the discovered url, and the status.

ex: tmp/SRD_output/powerui.foo.com/87ef37260d0375e204e8e16d00768920fb2bc5eb.txt https://powerui.foo.com/powerui/vendor/bower-asset/masonry/dist/masonry.pkgd.min.js?v20240706122948 (OK) tmp/SRD_output/newsroom.foo.com/85e443613511a3a41f1d20568d5ecc8506b57e43.txt https://newsroom.foo.com/scripts/scripts.js (OK) tmp/SRD_output/investor.foo.com/78e3fd823de018fa2323b59ae7ce34b54d8e90d0.txt https://investor.foo.com/javascripts/home-banner.js?revision=d6483dcf-da80-4ab1-a5b7-a507a5dcd426 (OK)

How can I ensure that each discovered URL is correctly mapped back to its original URL, especially when dealing with similar domains? Are there better strategies for handling domain matching in this scenario? Is there a way to enhance the find_original_url function to handle these cases more accurately?

Simple Domain Matching - Using exact domain matching or checking if the discovered domain ends with the original domain. Subdomain Handling - Considering subdomains, but still facing issues when domains are similar. JSONL flag - This outputs in jsonl format, where I can sometimes see source URLs. Responses can i either be valid or an error for each jsonl line. Processing one at a time - This is extremely slow and I want to leverage the concurrency and parallelism that this tool has.

Warning: Original URL not found for discovered URL https://newsroom.foo.com/scripts/scripts.js Warning: Original URL not found for discovered URL https://investor.foo.com/javascripts/home-banner.js?revision=d6483dcf-da80-4ab1-a5b7-a507a5dcd426

The issue is that some discovered URLs do not map back to the original URLs correctly, especially when domains are similar. For example, if I have two original URLs with domains of acnml-en.foo and powerui.foo, discovered URLs from subdomains or similar domains are sometimes not mapped correctly.

Upvotes: 0

Views: 23

Answers (0)

Related Questions