yeller
yeller

Reputation: 31

Robots.txt pattern-based matching using data driven results

Is there a way to create pattern based rule in the robots.txt file that search engines can index?

Our website has millions of records that we'd like search engines to index.

The indexing should be based on data-driven results, following a simple pattern: City + Lot Number.

The webpage loaded shows the city lot and related info.

Unfortunately, there are too many records to simply put them in the robots.txt file (over 21MB), where google has a 500KB robots file limit.

Upvotes: 1

Views: 42

Answers (1)

Stephen Ostermiller
Stephen Ostermiller

Reputation: 25524

The default permissions from robots.txt are that bots are allowed to crawl (and index) everything unless you exclude it. You shouldn't need any rules at all. You could have no robots.txt file or it could be as simple as this one that allows all crawling (disallows nothing):

User-agent: *
Disallow:

Robots.txt rules are all "Starts with" rules. So if you did want to disallow a specific city, you could do it like this:

User-agent: *
Disallow: /atlanta

Which would disallow all the following URLs:

  • /atlanta-100
  • /atlanta-101
  • /atlanta-102

But allow crawling for all other cities, including New York.


As an aside, it is a big ask for search engines to index millions of pages from a site. Search engines will only do so if the content is high quality (lots of text, unique, well written,) your site has plenty of reputation (links from lots of other sites,) and your site has good information architecture (several usable navigation links to and from each page.) Your next question is likely to be Why aren't search engines indexing my content?

You probably want to create XML sitemaps with all of your URLs. Unlike robots.txt, you can list each of your URLs in a sitemap to tell search engines about them. A sitemap's power is limited, however. Just listing a URL in the sitemap is almost never enough to get it to rank well, or even to get it indexed at all. At best sitemaps can get search engine bots to crawl your whole site, give you extra information in webmaster tools, and are a way of telling search engines about your preferred URLs. See The Sitemap Paradox for more information.

Upvotes: 1

Related Questions