Swissa Yaakov
Swissa Yaakov

Reputation: 196

Save Sitemap XML Files Limit by 1000 URLs per each File

How can I save several sitemap files, limited by 1000 URL's each file, like sitemap1.xml, sitemap2.xml?

Basically I want to limit the foreach each file by put_file_content.

My code is:

$sitemap = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
    <urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
    <url>
    <loc>". Yii::app() -> getBaseUrl(true) ."</loc>
    <priority>1</priority>
    </url>
";
foreach($websites as $website) {
    $sitemap .= "<url>
        <loc>".$website['domain']."</loc>
        <priority>0.5</priority>
        </url>
    ";
}
$sitemap .= "</urlset>";
file_put_contents("sitemap.xml", $sitemap, LOCK_EX);

Upvotes: 1

Views: 1952

Answers (2)

hakre
hakre

Reputation: 197712

Let's create that application quickly:

  1. Create some template XML where you add the websites to.
  2. Chunk the $websites with the help of a NoRewindIterator and a LimitIterator

Let's start with the second point and create this faking the URLs as well as the XML just to see if this is easy to wire-up:

$limit = 3;

$urls = new ArrayIterator(range(0, 9)); // 10 Fake URLs
$urls->rewind();

$it = new NoRewindIterator($urls);

First we set a limit per file (here three to keep it low for testing) and then we setup the data-source for the URLs. Here those are 10 fake URLs, that are just the numbers from zero to nine.

Those URLs are rewound because they are then wrapped into a NoRewindIterator and that one never rewinds but we want to rewind the data-source once (this is not necessary for all iterators, but for quite some so we do this correct).

The rewind operation is blocked by the NoRewindIterator so that we can continue to get X chunks by the size of $limit. And that is exactly what is done now:

$fileCounter = 0;
while ($it->valid()) {    
    $fileCounter++;

    printf("File %d:\n", $fileCounter);

    $websites = new LimitIterator($it, 0, $limit);
    foreach($websites as $website) {
        printf(" * Website: %s\n", $website);
    }
}

As long as $it is valid (read: as long as there are URLs to output), a new file is created (starting at one) and then three websites are foreach-ed via the LimitIterator. When that iteration is done, it is continued until all website URLs have been consumed. The output is as following:

File 1:
 * Website: 0
 * Website: 1
 * Website: 2
File 2:
 * Website: 3
 * Website: 4
 * Website: 5
File 3:
 * Website: 6
 * Website: 7
 * Website: 8
File 4:
 * Website: 9

This so far show how to do the chunking (or sometimes this is also called pagination). As the example shows, only the part about creating the XML documents is missing.

For creating an XML documention you could concatenate a string, however, we don't do that. We use an existing library for it that does this all perfectly well. That library is called DOMDocument, and here is an example how to create a sitemap file with two exemplary locations within the urlset:

$doc = new DOMDocument();
$doc->formatOutput = TRUE;

$nsUri    = 'http://www.sitemaps.org/schemas/sitemap/0.9';
$urlset = $doc->appendChild($doc->createElementNS($nsUri, 'urlset'));

$url = $doc->createElementNS($nsUri, 'url');
$location = $url->appendChild($doc->createElementNS($nsUri, 'loc', 'BASEURL'));
$priority = $url->appendChild($doc->createElementNS($nsUri, 'priority', '1'));

$urlset->appendChild(clone $url);

$priority->nodeValue = '0.5';
$location->nodeValue = 'TEST';
$urlset->appendChild(clone $url);

echo $doc->saveXML();

This code-example shows how to create the document and then how to add the elements with their proper namespaces to it. It also shows how create a boilerplate <url> element that can be modified and added easily by cloning it.

The output of this example then is:

<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>BASEURL</loc>
    <priority>1</priority>
  </url>
  <url>
    <loc>TEST</loc>
    <priority>0.5</priority>
  </url>
</urlset>

So now all general problems have been solved. All that is needed is to web these two together and to store to disk. I spare the later part for this examples sake (you just pass a filename as parameter into saveXML) and output the XMLs instead:

<?php
/**
 * Save Sitemap XML Files Limit by 1000 URLs per each File
 *
 * @link https://stackoverflow.com/q/19750485/367456
 */

$limit = 3;

$urls = new ArrayIterator(range(0, 9)); // 10 Fake URLs
$urls->rewind();

$it = new NoRewindIterator($urls);

$fileCounter = 0;

$baseDoc               = new DOMDocument();
$baseDoc->formatOutput = TRUE;

$nsUri = 'http://www.sitemaps.org/schemas/sitemap/0.9';

while ($it->valid()) {
    $fileCounter++;

    $doc = clone $baseDoc;

    $urlset = $doc->appendChild($doc->createElementNS($nsUri, 'urlset'));
    $url    = $doc->createElementNS($nsUri, 'url');

    $location = $url->appendChild($doc->createElementNS($nsUri, 'loc', 'BASEURL'));
    $priority = $url->appendChild($doc->createElementNS($nsUri, 'priority', '1'));

    $urlset->appendChild(clone $url);
    $priority->nodeValue = '0.5';

    printf("File %d:\n", $fileCounter);

    $websites = new LimitIterator($it, 0, $limit);
    foreach ($websites as $website) {
        $location->nodeValue = $website;
        $urlset->appendChild(clone $url);
    }

    echo $doc->saveXML();
}

The output then is in XML instead of plain text:

File 1:
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>BASEURL</loc>
    <priority>1</priority>
  </url>
  <url>
    <loc>0</loc>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>1</loc>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>2</loc>
    <priority>0.5</priority>
  </url>
</urlset>
File 2:
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>BASEURL</loc>
    <priority>1</priority>
  </url>
  <url>
    <loc>3</loc>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>4</loc>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>5</loc>
    <priority>0.5</priority>
  </url>
</urlset>
File 3:
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>BASEURL</loc>
    <priority>1</priority>
  </url>
  <url>
    <loc>6</loc>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>7</loc>
    <priority>0.5</priority>
  </url>
  <url>
    <loc>8</loc>
    <priority>0.5</priority>
  </url>
</urlset>
File 4:
<?xml version="1.0"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>BASEURL</loc>
    <priority>1</priority>
  </url>
  <url>
    <loc>9</loc>
    <priority>0.5</priority>
  </url>
</urlset>

So all that is left to do now is that you offer the original data-source as an iterator at the very beginning, that you increase the number of URLs (the limit) to your own value and that you add the correct Base-URL per each file (if you really needs that).

As far as XML Sitemaps are concerned, you can also create one file that links the other files. The limits are a bit higher IIRC, compare with: Multiple Sitemap: entries in robots.txt?.

I hope this helps you to achieve what you're looking for in a well established way.

Upvotes: 5

Andrei Stanca
Andrei Stanca

Reputation: 908

you can try a for loop ( for ( $x = 0 ; $x < 1000 ; $x++ ) { $websites[$x] } ) or you can exit the foreach loop with an external variable like so:

$i = 1;
foreach ($websites as $website)
{
if ($i === 1000) break;
$i++;

#do your thing

}

Upvotes: 1

Related Questions