This description is best viewed in a web browser on a laptop or desktop.
In this puzzle, we will explore my favorite data structure, the Bloom filter.
Basics
A set is a collection of objects. The objects in a set are called members or elements. Examples of sets include:
- the jellybeans in a jar
- the vehicles in a garage
- the numbers between 0 and 10
There are many things you may want to do with a set. You may want to add an element to a set (put a jellybean in the jar). Once you've added an element, you may want to remove it (take a jellybean out of the jar and eat it). You may also want to count all of the elements in a set (how many jelly beans do I have?).
Sets in computing
Sets are common not only in real life, but in the field of computer science as well. In addition to the above operations, sets are useful for keeping track of elements you have encountered, in order to determine whether a new element has been seen before. When your browser loaded this page, it checked whether the URL was in its cache (a set of web pages) and then decided to either retrieve it from the cache or download it over the network.
An obvious way to implement a set is to just store all of the elements in the set, perhaps in an array. This mostly works fine: insertion takes constant time. Querying whether an element is in the array takes time proportional to the number of elements in the set; this can be improved using hash-based data structures. Enumerating the elements is straight-forward. Deleting an element is possible, but requires care so that it doesn't take time proportional to the number of elements in the set.
What happens when we want a computer to store a very, very large set? There may be more elements than would fit in memory. Burton Howard Bloom, an employee of the Computer Usage Company (sometimes credited as being the world's first computer software company), came up with a novel way to represent sets. The specific problem he addressed was space-efficient set membership: can we design a set that allows you to determine if an element has been seen before, without having to store all of the elements? In 1970, he published Space/Time Trade-offs in Hash Coding with Allowable Errors, describing the eponymous data structure.
Bloom filters
A Bloom filter is an array of m bits along with k hash functions. Each hash function takes an element as input and outputs an integer from 0 to m-1.
To add an element to a Bloom filter, hash that element with each of the k hash functions. Take the resulting k integers and treat them as indices into the array. Set the bits at each of those indices to 1.
To check if an element is in a Bloom filter, again hash that element with each if the k hash functions and treat the resulting k integers as indices into the array. If the bit stored at any of those indices is 0, the element is not in the set. If all of the bits are 1, the element is probably in the set.
Bloom filters have several interesting properties. They have a constant size, regardless of how many elements they contain. The time needed to insert an element into a Bloom filter or check whether an element is in the set is also constant. The union of two sets represented as Bloom filters is the logical OR of the bit arrays. Similarly, the intersection of two sets is the logical AND.
The space efficiency of Bloom filters comes with several downsides:
- They can have false positives.
- Elements cannot be removed from a Bloom filter. The naive approach of removing an element by setting all of its bits to 0 has the risk of removing other elements as well. There is an extension of Bloom filters called counting Bloom filters that can support removal in some circumstances.
- The elements of a Bloom filter cannot be listed/enumerated.
Bloom filters in action
In this example, m is 16 and k is 3. The elements to be inserted into the set are ASCII strings. Characters in this style are in hex and are either byte strings or integers. The specific hash functions we will use are:
- Prepend 00 to the bytes of the element and take the SHA-256 hash of this byte string. Treat the first two bytes of the hash as a 16-bit integer (the reason for two bytes will be apparent later). This integer mod m is the value of the first index.
- Prepend 01 to the bytes of the element and repeat the process above.
- Prepend 02 to the bytes of the element and repeat the process above.
In pseudocode:
| Indices(m, k, e) |
| indices = [] |
| for i in range k |
| h = SHA-256(i || e) |
| n = h[0] << 8 | h[1] |
| indices[i] = n % m |
| return indices |
The empty Bloom filter looks like this:

To insert the element "geo", construct the byte string 0067656f. The SHA-256 hash of this string is 51edb06e4fd67031460e516f12f628dfea0f6e770827b66b7cee8de8f4434d8d. 51ed mod 16 = 13. Likewise, the 16-bit integer from hashing 0167656f is 226f and the index is 15. The 16-bit integer from hashing 0267656f is 6bfb and the index is 11. After inserting "geo" in the Bloom filter, the bit array looks like this:

The indices for the element "cache" are 12, 15, and 0. Inserting "cache" into the Bloom filter results in a bit array that looks like this:

What happens when we query whether the set contains "geocache"? The indices are 14, 4, and 2. The bit at index 14 is 0, so we know "geocache" is not in the set.
What about the word "tricked"? The indices are 15, 11, and 15 again. The bits at each of the indices are 1, so this is a false positive.
The puzzle
To solve this puzzle, implement a Bloom filter as described above, with m = 1024 and k = 4. Perform the following operations on your Bloom filter:
insert board
insert the
insert eye
insert plant
insert mind
insert management
insert break
insert patient
insert station
insert however
insert society
insert ground
insert analysis
insert group
insert cost
insert manage
insert many
insert although
insert some
insert top
insert quite
insert hospital
insert dead
insert identify
insert entire
insert cover
insert data
insert lead
insert process
insert finish
insert any
insert accept
insert receive
insert treatment
insert yet
insert increase
insert scientist
insert pick
insert can
insert here
insert blue
insert system
insert summer
insert have
insert card
insert politics
insert political
insert minute
insert item
insert herself
insert past
insert tend
insert about
insert wear
insert few
insert capital
insert above
insert despite
insert north
insert school
insert land
insert study
insert computer
insert two
insert its
insert network
insert leader
insert social
insert hang
insert cancer
insert picture
insert site
insert local
insert ok
insert risk
insert treat
insert crime
insert evening
insert begin
insert economic
insert strong
insert major
insert trouble
insert sometimes
insert every
insert final
insert left
insert camera
insert fight
insert rise
insert news
insert offer
insert tree
insert light
insert customer
insert author
insert leg
insert bit
insert imagine
insert television
insert often
insert size
insert until
insert find
insert oh
insert kid
insert right
insert strategy
insert discover
insert project
insert throughout
insert finger
insert able
insert forward
insert interesting
insert voice
insert weapon
insert both
insert style
insert generation
insert indeed
query nor
insert oil
query born
insert physical
query mean
insert much
query century
insert measure
query amount
insert share
query care
insert act
query protect
insert goal
query military
insert check
query simply
insert expect
query language
insert bill
query support
insert should
query call
insert understand
query exactly
insert rate
query happen
insert wife
query more
insert space
query tax
insert bed
query hand
insert better
query huge
insert dark
query whole
insert by
query security
insert think
query choose
insert violence
query water
insert four
query six
insert move
query indicate
insert keep
query collection
insert community
query important
insert type
query ahead
insert each
query with
insert late
query man
insert against
query avoid
insert letter
query training
insert what
query total
insert particular
query drop
insert from
query go
insert you
query theory
insert turn
query significant
insert long
query sell
insert half
query give
insert could
query answer
insert through
query those
insert food
query appear
insert shot
query loss
insert thus
query send
insert southern
query certainly
insert action
query home
insert growth
query instead
insert various
query today
insert yeah
query tell
insert response
query stock
insert private
query hope
insert program
query save
insert contain
query out
insert death
query a
insert sign
query art
insert trip
query democratic
insert base
query memory
insert who
query during
insert problem
query week
insert inside
query rule
insert beyond
query money
|
Each query operation will return true or false, 1 or 0. Concatenate the results of four consecutive query operations together to form a four-bit number. Repeat for every group of four query operations.
You can validate your puzzle solution with
certitude.