William Johnson
William Johnson

Reputation: 53

How to find sitemap in each domain and sub domain using python

I want to know how to find sitemap in each domain and sub domain using python? Some examples:

abcd.com/sitemap.xml
abcd.com/sitemap.html
abcd.com/sitemap.html
sub.abcd.com/sitemap.xml

And etc.

What is the most probable sitemap names, locations and also extensions?

Upvotes: 1

Views: 1442

Answers (3)

Tajs
Tajs

Reputation: 621

I've used a small function to find sitemaps by the most common name.

Sitemap naming stats: https://dret.typepad.com/dretblog/2009/02/sitemap-names.html

def get_sitemap_bruto_force(website):
    potential_sitemaps = [
        "sitemap.xml",
        "feeds/posts/default?orderby=updated",
        "sitemap.xml.gz",
        "sitemap_index.xml",
        "s2/sitemaps/profiles-sitemap.xml",
        "sitemap.php",
        "sitemap_index.xml.gz",
        "vb/sitemap_index.xml.gz",
        "sitemapindex.xml",
        "sitemap.gz"
    ]

    for sitemap in potential_sitemaps:
        try:
            sitemap_response = requests.get(f"{website}/{sitemap}")
            if sitemap_response.status_code == 200:
                return [sitemap_response.url]
            continue
        except:
            continue

As I retrieve sitemap index I'll send it to a recursive function to find all links from all sitemaps.

def dig_up_all_sitemaps(website):
    sitemaps = []
    index_sitemap = get_sitemap_paths_for_domain(website)

    def recursive(sitemaps_to_crawl=index_sitemap):    
        current_sitemaps = []

        for sitemap in sitemaps_to_crawl:
            try:
                child_sitemap = get_sitemap_links(sitemap)
                current_sitemaps.append([x for x in child_sitemap if re.search("\.xml|\.xml.gz|\.gz$",x)])
            except:
                continue
        current_sitemaps = list(itertools.chain.from_iterable(current_sitemaps))
        sitemaps.extend(current_sitemaps)
        if len(current_sitemaps) == 0:
            return sitemaps
        return recursive(current_sitemaps)
    return recursive()

get_sitemap_paths_for_domain returns a list of sitemaps

Upvotes: 2

Bhavesh Mandloi
Bhavesh Mandloi

Reputation: 29

You should try using the URLLIB robotsparser

import urllib.robotparser

robots = "branndurl/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots)
    rp.read()
    rp.site_maps()

This will give you all the sitemaps in the robots.txt

Most of the sites are havig the sitemaps present there.

Upvotes: 0

Klaus-Dieter Warzecha
Klaus-Dieter Warzecha

Reputation: 2335

Please take a look at the robots.txt file first. That's what I usually do.

Some domains do offer more than one sitemap and there are cases with more than 200 xml files.

Please remember that according to the FAQ on sitemap.org, a sitemap file can be gzipped. Consequently, you might want to consider sitemap.xml.gz too!

Upvotes: 1

Related Questions