This document outlines the facilities provided by the "robots.txt" file and the "robots" meta tag for excluding search engine indexing of web documents.
These mechanisms are purely voluntary, relying on search engines to recognise the exclusions and act accordingly, so they are best suited to situations where it is desirable but not absolutely essential to prevent indexing by search engines.
The University's site-wide search engine (and many other search engines) will take note of exclusions specified through robots.txt or the robots meta tag, but bear in mind that as designed these mechanisms are inherently "fail unsafe". If an exclusion is recognised, the search engine will act upon it, but the slightest mistake or other problem and the exclusion may not be recognised as matching the intended documents and they will be indexed as if there were no attempt at exclusion. See What can go wrong?.
Comments and questions about the topics covered in this document can be sent to search-support@uis.cam.ac.uk
The robots.txt file
This section is an introduction to using a robots.txt file to exclude particular web resources from indexing by search engines. The Standard for Robot Exclusion is the definitive specification (though not a formal standard). The facilities offered by robots.txt are limited and inflexible, and with hindsight, poorly designed. However, attempts at improvement have met with little support. [See also The Web Robots Pages for further information.]
When a search engine's indexing "robot" visits a Web site, say http://www.inst.cam.ac.uk/, it starts by trying to fetch http://www.inst.cam.ac.uk/robots.txt - note that the URL path is specified in lowercase, so the filename should be lowercase. Such checks are the reason why requests for /robots.txt may appear in a web server's log files even though no such file has ever existed. The frequency with which an indexing robot will check for changes to robots.txt while actively indexing a server is not specified by the Robot Exclusion Standard.
The only place that search engines will look for robots.txt is the server's "root" document directory, with URL path /robots.txt. Either the webmaster must coordinate exclusions with the content providers and update robots.txt as required, or some other approach must be used. One option (though less flexible) would be the "robots" meta tag described below; another would be to set up a means of collating requirements and updating robots.txt automatically from details maintained by the content providers.
The exclusions in robots.txt must be specified in a rigidly defined format, and any errors are likely to result in the definitions being ignored and the corresponding web documents indexed, unnoticed except from web server log entries or their unexpected appearance in search results.
Comments (e.g. notes to people editing the file in future) can be added to lines in the file after a "#" character - that and the rest of the line will be ignored. Apart from comments, the file must contain only a sequence of one of more groups of lines comprising User-agent lines, one or more Disallow lines, and when another group follows, one or more blank lines. The User-agent and Disallow labels are followed by ":" and then a user agent name or excluded URL path. Spaces are allowed after the ":" and at the end of the line. The User-agent and Disallow labels are case-insensitive.
The Robot Exclusion Standard does not specify whether the user agent names and URL paths should be treated as case-sensitive when matching, leaving the choice to the search engine. For the User-agent, simply specifying it exactly as seen e.g. in log file entries should avoid any problems. For the URL path in Disallow exclusions, there is no obviously "right" choice (since URLs are defined to be case-sensitive, but in reality some servers are case-insensitive). Pragmatically, if the use of uppercase and lowercase in exclusions matches the file and directory names as they are in the web server's filesystem, that should work in all situations where the exclusions could work reliably. The Ultraseek server used by the University's site-wide search engine uses case-insensitive matching for Disallow lines.
A group of exclusions can apply either to a specific search engine (or group of related search engines), or to all search engines not named in other definitions. This is specified by the User-agent lines, which must be the first lines in a group. The user agent must either be "*" (meaning all except explicitly-named user agents) or else the user agent name (or a distinctive part of it) that the search engine uses in the HTTP User-agent: header in its web requests (and often included in the corresponding web server log file entries). The definitions for "*" are completely ignored by search engines that find a matching named User-agent line (which may require some exclusions to be repeated, for each named User-agent).
For the University's site-wide search engine, the required User-agent line is
User-agent: University of Cambridge search (search-support@ucs.cam.ac.uk)
There must be at least one Disallow line following a User-agent line, and each Disallow line normally specifies one URL path prefix - if a document's URL starts with the path from any Disallow line for the matched User-agent, the document will not be indexed. The URL path for directory exclusions should end with "/" in order to avoid any possibility of false matches. For example, an exclusion for /help would match /helpful.html as well as everything in a directory /help, but using /help/ as the exclusion would match only the directory and its contents. As a special case, a User-agent line may be followed by a single Disallow line without a URL path, meaning that all URLs may be indexed by relevant search engines. That could be used to allow a named User-agent to index the whole of a server in spite of default exclusions applying to all other search engines.
NB In the following examples, the contents of the robots.txt would start in the first column of each line - any indentation by the web browser should not be copied.
To discourage indexing of the whole web server by all search engines, robots.txt need only contain
User-agent: * Disallow: /
since all URL paths start with "/".
To exclude a selection of files and directories from indexing by all search engines, a single User-agent line followed by the required Disallow lines defining exclusions would be needed, for example:
# A comment line just to show what one looks like User-agent: * Disallow: /cgi/ Disallow: /tmp/ Disallow: /~ Disallow: /stats/ Disallow: /junk/
Note that you can specify the complete URL path for specific files if necessary, and that "/~" will match any "user pages" accessed using URL paths such as /~username/thing.html. Ending the Disallow line for a directory with "/" avoids false matches against similar but unrelated URLs.
Providing access controls can be trusted, most University sites should only need to use a robots.txt file, similar to the one above, to discourage all search engines from indexing unhelpful content. However for a belt-and-braces approach, webmasters might choose to provide separate configurations for the University's two indexing engines, and for all other search engines.
For example, the following robots.txt file discourages all search engines from indexing assorted unhelpful content, discourages everything except the engine maintaining the index used by University users from indexing /internal-only/, and discourages everything other than the engines maintaining both University indexes from accessing /local-only/. Note that some duplication is required because the 'User-agent: *' block does not apply at all to the explicitly-listed engines.
# local search engine User-agent: University of Cambridge search (search-support@ucs.cam.ac.uk) Disallow: /cgi/ Disallow: /tmp/ Disallow: /~ Disallow: /stats/ Disallow: /junk # Everything else User-agent: * Disallow: /cgi/ Disallow: /tmp/ Disallow: /~ Disallow: /stats/ Disallow: /junk Disallow: /internal-only/ Disallow: /local-only/
To discourage indexing by all search engines other than the University's, replace the final block with:
# Everything else User-agent: * Disallow: /
Adding a robots.txt file to a Falcon site
By default a Falcon site has a (nearly) blank robots.txt file – to modify if you need to, do the following:
- Go to ‘Site setup’ > Zope Management Interface > portal_skins > plone_templates and click on robots.txt
- Click on the ‘Customize’ button
- You can now customize that file. As you see at the top of the page, the location for the editable file is /portal_skins/custom/robots.txt
- Remember to save changes. To get back to site pages, either edit the URL or click on the first ‘/’ of the path at the top of the page and select the ‘View’ tab.
The robots meta tag
If the information providers can neither update the robots.txt file nor request changes to it, they can use robots meta tag to specify within an HTML page whether indexing robots may index the contents of the document and/or follow links from it to other documents. This is of limited use, since it can only be used in HTML documents, but does not require changes to any robots.txt file. Additionally, each HTML document has to be fetched by a search engine before the meta tags can be seen and acted on - by contrast documents excluded by a robots.txt file don't even need to be fetched. If there is also a robots.txt file, the exclusions there are processed first. The robots meta tag is much less widely supported by search engines than robots.txt.
All meta tags must be placed within the <head> section of the HTML. The name attribute must be "robots", and the content attribute contains a comma-separated list of directives to control indexing, chosen from
- INDEX or NOINDEX - allow or exclude indexing of the containing HTML page.
- FOLLOW or NOFOLLOW - allow or exclude following links from the containing HTML page.
- ALL - allow all indexing (same as INDEX,FOLLOW)
- NONE - no indexing allowed (same as NOINDEX,NOFOLLOW)
In Falcon, robots meta tags can be added to pages and folders using the 'SEO Properties' tab in the editing toolbar.
The values of the name and content attributes are case-insensitive. Repeated or contradictory values should be avoided. The defaults are INDEX,FOLLOW, i.e. all indexing is allowed. Note that INDEX and/or FOLLOW cannot override exclusions specified in a robots.txt file, since an excluded document would not be fetched and the tag would not be seen. Also, the NOFOLLOW exclusion applies only to access through links on the page containing the tag - the target documents may still be indexed if the search engine finds links to them elsewhere.
Ignoring the "shorthand" ALL and NONE variants, the following examples show all the possible combinations:
<meta name="robots" content="index,follow"> <meta name="robots" content="noindex,follow"> <meta name="robots" content="index,nofollow"> <meta name="robots" content="noindex,nofollow">
Last updated: April 2021