Webmasters, and others maintaining information within the University, need to understand more about how our search service is provided than do other users. The search service is actually provided by one or other of a pair of servers running the Funnelback search software (and accessed via the web user-interface on search.cam.ac.uk. Another server is used for test and development work (and does not provide any public facilities, though it may be seen in web server logs making requests to web sites).
In contrast to the search servers underlying our previous-generation Ultraseek-based seach service prior to February 2014, all the web-crawling (fetching pages from web sites) and index-building is done by one or other of the pair of live servers, with one normally doing all of that work (and also handling search requests received via search.cam.ac.uk) and the other having copies of the latest search index files so that it can take over handling search requests temporarily, if that is required.
The choice of which Funnelback server to use is made by search.cam.ac.uk, depending on their availability. The test/development servers does similar web-crawling and indexing, though on a much smaller scale.
The User-Agent string used by the Funnelback servers in their HTTP(S) requests to web sites is
University of Cambridge search (firstname.lastname@example.org)
In contrast to the previous-generation (Ultraseeek-based) search servers, the same User-Agent string is used for both "internal" requests (from within the CUDN) and "external" requests.
IP addresses and access controls
Web-crawling (fetching web pages for indexing) is normally done by the primary member of the pair of servers providing the live search service, and on a much smaller scale by the test/dev server. The hostnames and IP addresses are may change over time, but as of July 2016 are
- live "internal" web-crawling (from within the CUDN) - usually the primary server fb1.search.cam.ac.uk = 184.108.40.206 (or very rarely, the backup server fb2.search.cam.ac.uk =220.127.116.11). Such traffic might also be seen as coming from fb.search.cam.ac.uk and its address 18.104.22.168.
- test/development "internal" web-crawling - fbdev.search.cam.ac.uk = 22.214.171.124
For "external" web-crawling (to find web pages that are accessible to the general public from outside the CUDN, and without authentication), an alternative pair of addresses are used - 126.96.36.199 and 188.8.131.52, sharing a dummy place-holder name (ext-proxy.web-search.invalid in DNS reverse lookup). Those "external" addresses are actually associated with the University, but are within an address block used for infrastructure, and not advertised as being part of the CUDN. They should not be treated as part of the CUDN (or as *.cam.ac.uk hosts) for access control purposes.
Note that there are some documents that the search engines will not be able to index. There can be many reasons for this (see , but in particular this will be the case for documents that can only be accessed from particular department or college networks, or which are password protected and limit access to specific people. Such documents will not appear in either index and as a result will be difficult for University users to find. Wherever possible and appropriate, there are significant advantages to allowing at least the "internal" index to include restricted material - but note that extracts of the document text may be included in search results, to allow its relevance to the search to be assessed.
Last updated: July 2016