Introduction
In this educational mystery you will learn about search engines and how search engines use inverted indexes. You will retrieve all the information you need to do that on this very page with no hidden content. Any links given are just there for the benefit of the interested puzzle solver.
This mystery cache comes from the joint effort of Abelmann and Palindromus.
Uses of a search engine
A search engine is well known today to all users of the Internet. But Palindromus still remembers holding his first Internet courses in 1994/1995, explaining the wonders of Gopher to people. Ah, those were the days :-)
Examples of search engines
Common web search engines used today are Google and Bing. When using these web sites, you can type in one or more keywords and retrieve hits in a number of web sites all over the Internet that contain these search words. One early web search engine worthy of mention is AltaVista, which started up in 1995. AltaVista quickly took over as the most popular way to search from Gopher (1991).
Other web sites that heavily use search engines are Amazon and Best Buy. These are typical search-based commerce sites where the company owning the web page is selling something and making those products available to consumers mainly through search. A good search experience can make a huge impact on the commercial success of these companies and they have invested heavily in search technology over the last ten years.
A third type of search engines are search engines used in yellow pages solutions such as "Gule Sider". They allow for efficient search in telephone directories, usually organized by categories.
Finally we have intranet search engines (aka enterprise search engines). Any medium or large corporation will have tens or hundreds of different backend computer systems (CRM systems, billing, order, warehouses, employee databases, HR, finance, etc). Having a centralized solution that provides search results across all these can provide significant productivity improvements to the employees.
Search engines in Norway
Fast Search & Transfer (FAST) was founded in Trondheim, Norway, in 1997. FAST created a successful competitor to AltaVista and Google called AllTheWeb. This site and accompanying technology was sold to a company called Overture in 2003, which later was acquired by Yahoo. From 2003 to 2008, FAST focused on enterprise search solutions, and in April 2008 FAST was aqcuired by Microsoft.
Kvasir is another Norwegian search engine. It was founded in 1995 in Oslo, and has grown significantly over the years. Today it is owned by the Swedish Eniro AB.
Intellisearch is yet another Norwegian company in the search business, and focuses on enterprise search. It was founded in 2003 and has customers such as Norconsult and Expert.
Overview of a search engine
A search engine typically consists of the main components shown in the below diagram.
A crawler gathers documents and binary files from various sources such as web servers, file servers, database and similar. The output from crawling goes into a document processing component. The document processing component extracts all useful text and metadata from the documents (e.g. a 1 MB .DOC file might be turned into a 50 kilobyte text chunk instead). This raw text is then send to an indexing component. The indexing component takes that raw text and generates inverted indexes, which is the data structure required to respond to searches.
Whenever a user (real person or other computer system) types or sends in a query to a search engine, a query processing component first examines the user input and rewrites the query. This may include expanding synonyms into multiple terms, such that if you type in "lorry" you can also get a hit for "truck", or adding plural forms so that a query for "car" also can find documents containing "cars". A rewritten form of the query is then send to a searching component. The search component uses the inverted index produced by the indexing component earlier to find all documents in the index that match the given query. It then sorts these documents with the presumed best ones first (this process is known as ranking) and returns a result set with hits that are presented to the user.
Inverted index
The inverted index is a structure used during search evaluation. It contains a mapping from words to documents. This is where the name inverted comes from. Using this structure it is possible to perform quick full text searches, i.e. finding all documents that contain certain words.
The basic inverted index is built as a sorted list of words, each with a list of the documents that contain that word.
Consider the following documents:
Document 1 = "A quick brown fox"
Document 2 = "A dark horse"
Document 3 = "A quick horse"
The inverted index for this set of documents could be
Word | Documents
=================
a | 1,2,3
brown | 1
dark | 2
fox | 1
horse | 2,3
quick | 1,3
By using the above inverted index it is possible to find documents that contain both of the words "dark" and "horse". This would be calculated by first finding the list for "dark", i.e. {2} and intersecting this with the list for "horse", i.e. {2,3}, yielding the resulting list {2}.
However, the above inverted index cannot handle phrase searches such as "dark horse" where those words need to be adjacent. To deal with such queries it is necessary to extend the inverted index to not only contain the document ids, but also the word positions of all word occurrences in those documents. Such an inverted index for our example would look like this:
Word | Documents and positions
===============================
a | (1,1),(2,1),(3,1)
brown | (1,3)
dark | (2,2)
fox | (1,4)
horse | (2,3),(3,3)
quick | (1,2),(3,2)
It will usually be more efficient to store the words of the inverted index in a separate structure called dictionary and just use the word numbers in the inverted index itself. This allows for loading the dictionary into memory as a binary tree and have pointers from there into the list of words.
Side note on performance of inverted indexes: The downside of an inverted index is that adding, deleting or updating one or more documents is an expensive operation, much more expensive than doing the same operations in a typical relational database. Inverted indexes are usually created as read-only structures, so to add a document one would instead create an additional smaller inverted index containing this document (hopefully a bunch of documents) and later merge the indexes. Deletes may be handled by an additional list of deleted documents accompanying each index, and updates may be handled as a combination of add and delete.
The cache - your inheritance
You recently inherited an old PC from your eccentric uncle. He used to work for Norsk Data and later Kvasir.

At first you are rather disappointed with the old crappy hardware, and plan to recycle it. But, then the following note glued to the cabinet catches your attention and you decide to check it out anyway:

The machine won't start up, but you mount the unencrypted harddrive into another machine and examine the files. Inside a folder called "mytrea~1" you find some interesting files that resemble the inverted index file structures you just have read up on.
dict.txt
No | Word
=================
01 | .
02 | ?
03 | and
04 | at
05 | be
06 | cache
07 | can
08 | count
09 | do
10 | eight
11 | five
12 | for
13 | four
14 | fun
15 | happy
16 | have
17 | is
18 | nine
19 | not
20 | one
21 | remember
22 | searching
23 | seven
24 | six
25 | ten
26 | the
27 | three
28 | to
29 | two
30 | worry
31 | you
32 | zero
inverse.txt
No | Occurrences
=================
01 | (1,4),(1,7),(3,21)
02 | (2,4)
03 | (3,12)
04 | (3,4)
05 | (1,5)
06 | (1,13),(3,2)
07 | (2,1)
08 | (2,3)
09 | (1,1)
10 | (2,12)
11 | (2,9),(3,5),(3,7),(3,16),(3,20)
12 | (1,11)
13 | (2,8)
14 | (1,9),(3,25)
15 | (1,6)
16 | (1,8),(3,24)
17 | (3,3)
18 | (2,13),(3,6),(3,19)
19 | (1,2)
20 | (2,5),(3,14),(3,17),(3,18)
21 | (3,22)
22 | (1,10)
23 | (2,11)
24 | (2,10)
25 | (2,14)
26 | (1,12),(3,1)
27 | (2,7),(3,8),(3,9),(3,11)
28 | (3,23)
29 | (2,6)
30 | (1,3)
31 | (2,2)
32 | (3,10),(3,13),(3,15)