This document discusses what action may be needed by local webmasters to ensure that the search facilities provided by the University's site-wide web search engine work well for the documents on their server, while also ensuring the exclusion of documents which (for one reason or another) should not be found by searches. Many of the issues apply, to varying extents, in relation to other search engines.
While most of the eligible servers (see [will need fixing]) will be located and included in the search index automatically, action by a server's webmaster may be desirable or essential - e.g. to improve the effectiveness of searching, or to exclude documents which for one reason or another should not be included in the search engine's database. The issues that need to be considered include:
- Ensuring the server will get indexed without being dependent on discovery through links from other servers, and also that it will get a meaningful description in search results for site-specific searching.
- Ensuring the exclusion of documents which should not be indexed for a variety of reasons.
- Ensuring that any major reorganisation of part or all the server content is reflected quickly by the search engine rather than leaving it with results containing outdated and invalid link URLs.
That assumes the server should be indexed, but there may be situations where a server should not be indexed (e.g. the server is "under construction", or is an institutional "intranet" server that rejects all requests from outside the institution, making indexing attempts a waste of time). In such cases the site-wide search engine can be configured to exclude the server from indexing.
Ensure your pages show a 'Last modified' date
In order to show a 'last modification' date in the search results, there will need to be either a 'last updated' date in a metadata field in the <head> part of the page, or the page needs to be served with a LastModified HTTP header. The latter is typically added automatically by your web server for static content, but may be missing for content created dynamically (including much content served from content management systems). Date metadata fields recognised, include either:
<meta name="date" content="2013-09-01T12:00:00Z">
<meta name="dc.date" content="2013-09-01T12:00:00Z">
We strongly recommend contacting us when setting up a new web server that will be indexed by the search engine. We would then add it to both the search engine's list of starting points for indexing (so it will not depend on links being found from other servers), and to the Search individual servers page providing searches limited to particular servers. The latter ensures meaningful descriptions in search results using search links there or equivalent "packaged" searches using search links or forms from other servers (see ).
If the server is still "under construction", we can exclude it from indexing so it will not be discovered and then publicised accidentally in search results, though it may be preferable to use server access controls to limit access. In such cases, remember to tell us when to remove the exclusion and set it up for indexing!
When a server (or part of a server) is reorganised so that many documents are moved or deleted it is helpful if the server is re-indexed immediately rather than waiting for routine re-checking to discover the changes. See Adding or deleting documents explicitly for details of how to achieve this.
NB: Search engines vary in how they applying "weighting" to parts of documents, and how (or if) they use information from meta tags. The comment here apply specifically to the University's site-wide search engine, but to varying extents will also help indexing by other search engines.
Words in the title of HTML pages (and other document types with distinct titles) are given extra "weight" (importance) when matching search terms to documents in the search database, and assigning relevance rankings ("scores") to the matches. The title and document summary also provide the only information in the search results to help users identify which documents are relevant and worth looking at more closely. Therefore, the title should be worded both to identify the scope of the document and to help distinguish it from other documents. For the distinctive words in the title, it would be best to use "obvious" words (likely to be used as search terms for which the document would be a good match) rather than more obscure synonyms, though perhaps avoiding words where variant spellings (e.g. UK versus US English) could result in searches failing to find the document.
The search engine can use information from two HTML meta tags ("keywords" and "description") to provide additional keywords and to provide a document summary to be used instead of one derived from the start of the document text. The default handling is sufficient in many cases (i.e. document summary derived from the start of the document, and searches simply matching against the words of the document), and there is little point adding meta tags that simply duplicate what is already there. In other cases, a replacement summary may help with deciding which results to look at more closely (e.g. if the start of the document has little coherent text), and adding extra keywords (e.g. synonyms or near synonyms for technical terms relating to the document's subject matter) may help people to find documents that are relevant in spite of not mentioning their search terms in the text. The content of these meta tag is also given extra weight when matching, as they are expected to be especially descriptive of the document's content.
There are several issues to consider:
- There may be documents which would contribute little or nothing useful when matched by searches, serving only to clutter the results.
- Indexing old, superseded or archive pages may give a misleading impression if such pages are found 'out of context' as a result of a search. If the status of all such pages is not clearly indicated then it might be best to exclude them from indexing. Note that the search engine may continue to index pages that it has previously found even if there are no-longer any links to those pages.
- Indexing URLs that invoke CGI scripts etc., may be pointless and/or wasteful of resources, though that depends on the circumstances and scripts. Dynamically generated pages with links to further dynamic pages would often be particularly important candidates for exclusion.
There may be documents which should not be included in the search database for "policy" reasons (e.g. sensitive documents not intended for Internet-wide access). This is by far the most important consideration since, unlike the others, it may be difficult or impossible to undo the harm caused by a mistake. However this consideration should almost always be addressed by controlling access to the document rather than by simply manipulating indexing. Providing access controls have been configured correctly, the search will automatically end up accessing only appropriate information.
It is important to keep these points in mind when updating the content or configuration of a web server, and not forget about them after setting it up initially. Do not, for example, reorganise the server content without amending access and/or indexing controls to match!
The robots.txt file and robots meta tag (see ) provide the most straightforward way of excluding particular documents from indexing while allowing normal, non-password access by web browsers. However, they are voluntary mechanisms, relying on the search engines to comply with the exclusions. There are a variety of pitfalls which can render robots.txt and/or robots meta tags ineffective (see [link needs to be fixed]).
Once you have arranged to exclude documents from indexing, there are several ways to check that the exclusion has worked, initially or to spot-check subsequently (in case something has changed and the exclusion is no longer working). Bear in mind that if the documents have already been indexed, any changes to exclusions or access control will not take effect until they are next reindexed (routinely or in response to a request for immediate re-indexing).
Last updated: August 2015