Introduction
This document discusses how documents from web servers operated by the University of Cambridge are selected for indexing by the University's site-wide search engine and how the indexing "robots" (or "spiders") collect the information. Much of this will apply, to varying extents, in relation to other search engines.
While most of the eligible servers will be located and included in the search index automatically, action by a server's webmaster may be desirable or essential - e.g. to improve the effectiveness of searching, or to exclude documents which for one reason or another should not be included in the search engine's databases. See What webmasters could or should do.
Comments and questions about the topics covered in this document can be sent to search-support@uis.cam.ac.uk
Which servers are indexed?
The site-wide search engine is intended, as far as possible, to index and provide search facilities covering all the "official" web servers operated by the University - each institution's main web server, and any subsidiary servers that may exist (e.g. run by individual research groups or other sub-divisions of the institution).
The search engine's indexing "robots" (or "spiders") will examine and index suitable documents from web servers that are added explicitly to its configuration
- other local web servers that it discovers through links from servers that it is already indexing and which have hostnames (as seen in the URLs) in recognised cam.ac.uk domains (excluding servers with *.private.cam.ac.uk hostnames),
Documents from normal web servers (http: URL prefix; whether running on the default TCP port (80) or not) and SSL secure web servers (https: prefix) may be indexed (unless that is prevented in some way).
Servers with names in the private.cam.ac.uk domain, and which therefore have CUDN-wide private addresses, are not eligible for indexing. Such servers are not accessible from outside the University and so including their content in an externally-visible index would be misleading and confusing for many people, with browsers asserting correctly that those sites do not exist in the DNS (which is true, they are visible only from within the CUDN).
Sites that use hostnames in domains other than recognised cam.ac.uk domains (including new domains) can be added to the index but will always need explicit configuration before they can be indexed (by request to search-support - see email address above), since their names will not match any of the standard patterns. Such servers may be located on, or could be moved to, networks other than the CUDN (for example they could be provided by a web site hosting company). Before they can be included in the index it must be obvious that such sites were created and are maintained by or on behalf of the University only (or a joint venture in which the University is the major party).
Which documents will be indexed?
Documents are automatically candidates for indexing if
- they are on an eligible web server, and
- the search engine is allowed access by the web server (without needing a password other than standard Raven authentication), and
- the search engine is not discouraged from indexing the documents via the robot exclusion mechanisms (robots.txt file and robots meta tag), and
- their content is of a type that the search engine is able to handle, and
- they can be found via links from other pages that are being indexed from eligible servers (with explicitly configured web site URLs as the ultimate starting point, e.g. if the search database has to be rebuilt from scratch).
- if authentication (user login) is required to access a document, that must currently be Raven login (though it is expected that at some point in the future, Shibboleth login will also be supported), and access must either be allowed to any/all authenticated users, or must allow the search engine's special user "spider99" to see the pages. Previously the user was "spider" but was changed to make it CRSid format for Shibboleth.
NB Documents that require authentication are only fetched and included in the authenticated index, not included in the public index (hence are visible only to people who have Raven accounts and have chosen to log in to search.cam.ac.uk).
The search engine can index a fairly wide range of document types, not just HTML and plain text. Whether a document is suitable for indexing is determined by the Content-Type specified in the HTTP response headers and/or the filename suffix. It is possible that it may be (or become) possible to index other document types - so it could be worth asking about any that are significant on a site and are not listed below.
HTML Content-type value | Conventional file name suffix(es) | handled as |
---|---|---|
text/html | .htm, .html | HTML |
text/plain | plain text | |
text/xml | .xml | XML |
application/msword |
.doc,.dot, .docx,.docm |
Microsoft Word |
application/vnd.ms-excel | .xsl,.xslx | Microsoft Excel |
application/vnd.ms-powerpoint | .ppt,.pptx | Microsoft PowerPoint |
application/vnd.visio | .vsd | Microsoft Visio |
application/pdf | Adobe Acrobat | |
application/vnd.oasis. opendocument.text |
.odt | OpenOffice/LibreOfffice Writer |
application/vnd.oasis. opendocument.spreadsheet |
.ods | OpenOffice/LibreOfffice Calc |
application/-vnd.oasis. opendocument.presentation |
.odp | OpenOffice/LibreOfffice Impress |
image/gif | .gif | GIF images |
image/jpeg | .jpg, .jpeg | JPEG images |
image/png | .png | PNG images |
image/tiff | .tif, .tiff | TIFF images |
application/epub+zip | .epub | EPUB-format e-books |
application/vnd.ms-outlook | .msg | Microsoft Outlook mail message |
What gets indexed from a document?
The text of each document is extracted and included in the search database, organised for efficient searching and ignoring the majority of "extraneous" material such as HTML tags. Punctuation is ignored completely, but the order of words is recorded so that searching for particular phrases is possible. For English text, the search engine applies knowledge of the language structure to allow varying forms of words to be matched automatically by search terms (e.g singular/plural and other variations in word endings).
The search engine may find and attempt to follow relative or absolute URLs found in JavaScript scripting within a web page, but will not actually execute the JavaScript and therefore may see and attempt to use only fragments of URLs that the JavaScript would combine into a full URL if run in a browser. This may result in pages (or parts of a web site) that are not found, or bizarre requests (for incomplete or wildly implausible URLs) in the web server log files.
Better page design is the best solution to such problems, since those pages would be equally unhelpful to people using browsers lacking JavaScript (or with it turned off, e.g. for security reasons). For example, a <noscript> section in a page using JavaScript extensively could help both normal browsers and search engines. It may be possible to use meta tags (see below) to help improve access through search engines when page redesign is not an option.
In document types which distinguish a title (e.g. HTML <title>), that is used as the heading for the section in search results describing the document. The summary of the document for display in search results is derived by default from the initial text of the document, but ignoring components that are recognisably "structural" and unlikely to help in distinguishing documents. Alternatively, matching search terms may be shown with context from the containing document.
Note that non-HTML document types (e.g. Microsoft Word documents or PDF files) may have values for meta-data such as author and creation date that will be extracted and used by the search engine, which may be inherited from an unrelated (and possibly even confidential) document that was used as a basis for writing the current document. It is a good idea to check that settings in document metadata are not wildly misleading (e.g. a title copied across from something unrelated), or embarrassing, or leaking confidential details. Of course, such problems are not restricted to web search engines - anyone could look at such metadata in a document that is distributed in an electronic form.
The University search engine supports a special form of HTML comment to identify parts of a document that should be ignored when indexing. Some other search engines follow the same convention. HTML text following a <!--stopindex--> comment will be ignored until the next <!--startindex--> comment. This also has the effect of omitting that text from an automatically generated summary shown in search results. Links in sections marked in this way will still be followed. Note that the special comments must be exactly as shown - using e.g. <!-- stopindex --> with spaces between the comment delimiters and the stopindex or startindex will cause it to be ignored.
The "keywords" and "description" HTML meta tags have special significance to the University's site-wide search engine and some others. The former provides a list of keywords that are likely to be used in searches for which the document would be an especially relevant match, and the latter provides a summary of the document which may be shown in search results, though normally extracts with words matching the search terms will be shown. See Using <meta> tags to enhance effectiveness for details.
The text of alt= attributes for <img> tags is included in the index, and the names and values of any <meta> tags are also recorded (not just meta tags with specific meaning to search engines).
Words found in the document title and the keywords or description meta tags are given extra weight (i.e. make the document a better match) compared with words in the document text, as they will or should be especially descriptive of the document.
When are documents (re-)indexed?
When a new link is found while processing an HTML page, it is added to a queue for crawling (and subsequent indexing). When a new server is being indexed for the first time, it may take some hours to index completely if it has a very large number of documents (and that time may be spread over several days, interspersed with fetching pages from other sites to avoid excessive load on individual sites).
Once a document has been fetched and indexed for the first time, it will be re-checked periodically. If the document has been deleted (or access is no longer allowed) it will be dropped from the index, though a record will be maintained that it once existed. If it has been updated, it will be reindexed. Changes to server access controls or exclusions in a robots.txt file will not have any effect until a particular document is next re-checked.
The initial and subsequent requests to a particular web server will be spaced out in time (rather than one request immediately following another) and may also be alternated with requests to other servers, spreading them out further. For documents which are served with Last-Modified HTTP headers, the frequency of re-checking is varied according to how recently the currently-indexed copy was modified (newer documents checked sooner), and the indexer will use conditional GET HTTP requests so that only documents that have changed are actually fetched. Overall, this results in routine indexing being a low-intensity "background" activity except when a large collection of new or updated documents is discovered.
As currently configured, all documents should normally be re-checked (for changes to content, accessibility, etc.) daily with checks spread out through the day. Documents served without a meaningful modification timestamp (e.g. giving a "now" timestamp in all responses) will be subject to "best guess" handling of how often they should b re-checked.
What can go wrong?
Potential concerns fall into a number of categories:
- documents not being fetched for indexing, or not found by searches even though web server logs show them being fetched by the search engine.
- documents being fetched (or rejected attempts at fetching) in spite of attempts at excluding them from indexing.
- "overenthusiastic" or otherwise "odd" indexing behaviour
Web server log files may provide useful evidence of what indexing activity has been happening, in particular the HTTP status code for requests can be very helpful (e.g. HTTP status 200 for normal, complete fetches; 304 for documents unchanged since last fetched; 4xx for failed requests, etc.).
Queries about problems or oddities with how the University's site-wide search engine is indexing local web servers can be sent to search-support@uis.cam.ac.uk
Documents not indexed, or not found by searches
Possible explanations for documents not being indexed by the search engine include:
- The web-site's hostname (as used in links to it) does not match any of the patterns allowed for automatic discovery, and explicit addition to the search engine configuration has not been requested. No requests would be seen in the web server log files.
- For a new web site or newly added/updated documents, the search engine has not yet found or visited them (but may do so later). No requests would be seen in web server logs files.
- The web site refused, or was unable, to provide a copy of the document. In web server log entries, the type of HTTP status code e.g. 400 (bad request), 403 (forbidden) or 404 (not found) could help distinguish the type of problem. Likely causes include web server configuration errors, missing documents, or invalid links, among others.
- If user authentication(login) is required for access to a page, our search engine can only use Raven (and at some point in the future, Shibboleth) authentication, and access must be allowed either to all authenticated users or the special user "spider99" must be allowed access.
- Documents may be fetched by the search engine (as shown by web server logs), but are then found to be duplicates of documents already seen elsewhere (on the same server or a different one). In that case, only one copy is included in the search index, and other copies will not be shown in search results. This may result in searches that are limited to a specific server or part of a server not finding a document even though it exists there, if the sole indexed copy is elsewhere.
- Correctly or not, the search engine decided that a robots.txt file or robots meta tag indicated it should not index the documents even though web server access controls would allow fetching. For documents excluded by a robots.txt file, there would be no requests in the web server logs; for exclusions by robots meta tags, the documents would be fetched but omitted from the search database.
- Indexing large documents (especially e.g. PDF or Microsoft Word) can be very resource-intensive, and (generous) limits are imposed on the size of document that will be fetched and the amount of CPU time that can be used for indexing a single document. This may lead to incomplete indexing (or prevent indexing totally).
- HTML <link rel="canonical" > elements in a page's <head> section may cause confusion (to humans, not computers!), with pages being fetched but not then indexed. It overrides the URL that will be stored by search engines for the containing page (notionally pointing to a preferred or definitive page for anyone looking for the current page's content, or similar). Also, sites may use it to declare an index page to be canonical for all the pages linked to from that index page - which is fine until the site is reorganised and the old index page is left in existence but no longer used or maintained, so that it becomes irrelevant to the pages for which it is the declared master resource (such that the pages cannot be found themselves).
Note that web pages are fetched independently for the "internal" and "external" indexes (and potentially also by the test/development search server), and the fact that a document has been fetched for one index does not mean it will become available at same time in the both indexes. A newly-added or updated document may not appear in search results until the next day after it is fetched from the web site (as the indexes are updated only after all the relevant pages (from all web sites contributing to that index) have been fetched. Documents that have been deleted (or rendered inaccessible to the search engine) will not disappear from search results until the index is next rebuilt.
Documents being indexed (or fetches attempted) unexpectedly
The search engine will attempt to fetch for indexing any documents on eligible web servers to which it finds links. This may lead to periodic attempts to fetch documents that the search engine is not allowed to fetch (and possibly also pages that have been deleted, based on links seen the last time the site was examined). If a page has been successfully fetched previously, it is likely to be fetched again the next time it is due for re-checking, without reference to any links on the web site itself. Consequently, removing all links to a page but leaving the page itself accessible on the site will not prevent it continuing to be found and indexed by search engines. Either remove it, or block access to it.
If documents are being successfully fetched and indexed (as deduced from web server log files or search results) in spite of server access controls (by hostname/IP address, password, etc.) that are expected to deny the search engine access, it usually implies the access controls are not set up correctly.
There is one situation where robots.txt etc., is specifically ignored by the Funnelback search software - the explicitly configured start URLs for web-crawling will always be tried, even if robots.txt would exclude them, on the grounds that that fetching the start URLs is more like a user telling a browser "fetch this page", rather than a web spider finding the URL itself and deciding independently whether to fetch the page, taking robots.txt exclusions into consideration.
The search engine has no special, privileged means to bypass a web server's access controls, though if amended access controls were only imposed very recently, documents could remain in the search database until re-checked, when access would be refused and they would be dropped from the index.
If documents are being included in the search database and found by searches in spite of entries in a robots.txt file or use of robots meta tags, the most likely explanation is that there are errors in the file or tags so that they are not being recognised by the search engine. NB Many ad hoc extensions to the robots.txt file format are not widely supported, and may be ignored.
Alternatively, maybe the robots.txt file is in the wrong place (must be in the "root" document directory, with URL path /robots.txt, it will not be recognised elsewhere in a web site) or access to robots.txt is being denied by file permissions or server access controls (check the request status in the web server access and error logs).
Another possibility (seen with some local servers in the past) is that the web server may do something strange when responding to a request for robots.txt (for example muddling the characters used to mark line boundaries or treating the 'plain text' URL exclusion rules as HTML) such that the file contents appear garbled.
A further problem is that the final line of a robots.txt file may be ignored if it does not include a line end marker before the actual end of the file (many text editors prevent that happening, but others need great care to avoid it). This is difficult to detect, and one way to ensure that this can't have a significant effect would be to include a blank line, or a final comment, as the last actual line in the file.
One problem specific to the "robots" meta tag is that the "nofollow" option will only have any effect in relation to the links in that HTML page. Links from other pages to the same target pages may be followed as normal; if there's any route that isn't marked as nofollow, the target page will be found (and indexed, unless it is marked noindex). The nofollow option is therefore not a very reliable way of discouraging search engines. For content providers who cannot influence the server's robots.txt file contents, it may be the only option for non-HTML documents (which cannot themselves supply robots META tags with the "noindex" option). Also, since pages have to be fetched before any robots meta-tags within them can be seen, the web site logs will inherently report the pages as fetched, even if they specify that they should not be indexed (whether or not that works).
The fundamental point to remember about robots.txt and the robots meta tags is that they were ad hoc inventions to deal with problems caused by early search engines roaming uncontrolled around the early (much slower!) web servers. They were not intended as, and never could be, reliable access controls since compliance is voluntary on the part of the search engines. In short, if leakage of information really matters, proper access controls are needed. robots.txt and the meta tag are more appropriate where exclusions are purely to avoid inappropriate clutter in search results or overloading the web site with unnecessary requests.
On a related point, absence of links to a document will not prevent it remaining in the search database once it has been found, if the search engine is allowed to fetch and index it (not prevented by server access controls or the robot exclusion mechanisms). When checking for changes to a document, the search engine simply uses the URL by which it was fetched originally. This can be a particular problem for sites, or sections of sites, or individual pages, that have been replaced but which remain accessible under their old URLs for archive or reference purposes. The search engine will retain the old pages in its index even if there are no direct links to them, and may return references to them in search results. This can be very misleading if the pages become out-of-date or inaccurate and steps should be taken to exclude such material from indexing - see What should be excluded from indexing?.
"Over-enthusiastic" indexing
There are several situations in which the search engine may appear to be indexing "over-enthusiastically", either fetching the same documents repeatedly or fetching a seemingly endless series of bizarre variant URLs. In the cases seen, the explanation has been due in one way or another to the same document being accessible through multiple distinct URLs.
A fundamental point to bear in mind is that except for some limited assumptions that can be made about "/" dividing an HTTP URL path into distinct components, the URL - as defined in the HTTP standards - is a case-sensitive "black box". Only the web server can know what it really means, and the components in the URL path may not correspond to real directories and files in a filesystem. To a web client (browser, search engine, whatever), two URLs can be treated as equivalent only if they are textually identical (after insertion of implicit defaults such as ":80" for the TCP port, and canonicalisation of any /../ or /./ components).
For a search engine, all textually distinct links must be followed (subject to the usual controls on what is allowed to be fetched and indexed). While the University's site-wide search engine recognises duplicate documents and omits them from the search database, relative links from a document found through a variant URL will be interpreted in the context of that URL and may themselves appear to be distinct new links that should be followed. Overall, this may have the appearance of a very large or infinite number of URLs to fetch and index the same pages repeatedly.
The typical causes seen for "repeated indexing" problems are:
- UNIX symbolic links (or equivalent) providing multiple routes to the same directories.
- Web servers which have case-insensitive filesystems and carry that through to treating URLs as case-insensitive are potentially a big problem, since links on such servers are rarely consistent about their use of upper/lowercase (since they are interchangeable - on that site), leading to the appearance of multiple distinct routes throughout the server.
- Web server hostnames providing multiple routes to the same directories. Less likely to be used extensively, but may exacerbate problems due to other causes if aliashostnames provide multiple routes to directories suffering from the other problems.
The subtly distinct "endless indexing" problems seen so far have been due to a related cause. Some web servers support the inclusion of extra "dummy" URL path components after the part identifying a particular web page, CGI script, etc. The server matches the URL to what is in the filesystem, and the part beyond the "real" path is passed to whatever is generating the document and may be used to decide what content will be shown.
The examples seen all involved more-or-less "dynamic" web pages, generated by "active" components such as CGI scripts, or by the web server (e.g. pages with "server-side include" (SSI) processing). Adding path components to the URL of a static document (corresponding directly to a file) is likely to be rejected by the web server. The Apache web server allows extra path components for dynamic pages, including SSI pages as well as CGI scripts etc., and passes the extra components as environment variable PATH_INFO. Authors of active web components need to ensure unwanted PATH_INFO data is handled as an error - though achieving that could be difficult for SSI pages.
Serious problems can arise if a search engine finds a mangled link somewhere, with "junk" already appended to the valid URL of a dynamic HTML document. Unless requests with unexpected and irrelevant PATH_INFO data (or whatever the particular web server calls it) are rejected, the search engine will interpret any relative links in the resulting HTML as relative to the full URL, including the junk. It has no way to know what is meaningful in the URL path and what should be omitted, it can only follow the standard rules for interpreting relative URLs. This results in an infinite series of requests with ever longer URLs, as "yet another" duplicate document is found - the same one as for all the earlier requests - and the same relative links are followed but interpreted relative to the newly-extended and never-before-seen URL.
In reality, that would probably not continue forever - due to search engine limits on either the URL length or the number of components in a URL path. It can easily result in tens or hundreds of thousands of pointless requests before being that happens, though.
There may be no obvious, simple solution to "repeated or endless indexing" problems. It may be best to seek advice as soon as the problem is noticed (not least to get the indexing stopped and allow time for thought!).
Last updated: April 2021