Implementing a Lightweight Google-Like Search Using .Net
One of the requirements for my current project was to implement searching capabilities that were “Google-like”. When I heard this, I immediately thought about solr, which some colleagues have recommended as an enterprise search engine. And it does look good. However, after reading the description, I realized that this wasn’t a good fit for my application. The main strength of solr is its textual indexing. It is meant to index text documents. The application that I am developing is an image catalog. Each image gets tagged with predefined attributes. Implemented in a relational database, the only connection between the image and the attributes is a linking table. There is no text to index. I also quickly figured out that trying to shoehorn solr for my application would be complex and error prone.
So I decided to do research into search engines to see if I could adapt searching techniques for the needs of my application. I used Google to look for this information, and I found an academic paper called The Anatomy of a Large-Scale Hypertextual Web Search Engine, which is the paper that describes the original architecture of the Google search engine. I scanned over their relevance algorithm, web crawler, and forward index, and then found the section on their inverted index.
An inverted index is the computer data structure version of a book index. As you recall, in a book index there is a list of terms followed by the pages where the term appears. Part of the speed in retrieval while searching Google comes from using this data structure. The Wikipedia article confirms the popularity of this data structure for searching.
I saw how I could use this structure for my application, and that .Net makes the implementation of an object supporting the inverted index to be extremely easy. The names of the predefined attributes would be the keys, and the record ids of the image records would be the values. The index can easily be implemented in C# using a generic Dictionary object.
Dictionary<string, List<int>> index;
Now that I had my index created, I decided to create an object that would have useful methods to work with the index. I created the following interface:
public interface IInvertedIndex
{
void Add(string key, int record_id);
int IndexSize { get; }
void Load(string path);
void Remove(string key);
void Remove(int record_id);
void Save(string path);
List<int> Search(string search_term);
}
The most important methods are Add(), Remove(), and Search(). There are two Remove() methods, one for removing a key and the second one to delete occurrences of a record id. Load() and Save() are utility methods used to save the index and retrieve previously made ones. Finally, I added the property IndexSize to make testing easier.
Implementing these methods is easy. Here is the implementation for adding a record.
public void Add(string key, int record_id)
{
if(index.ContainsKey(key))
{
index[key].Add(record_id);
}
else
{
index.Add(key, new List<int>() { record_id });
}
We must check to see if the key exists. If it does, then we add the record_id to the List<int> list. If the key doesn’t exist, then we create a new entry with the key and a new list as the value of the key.
Below are the implementations of the Remove() methods. Removing a key is a direct action, but we must search every key to check whether a key contains a record_id.
public void Remove(string key)
{
index.Remove(key);
}
public void Remove(int record_id)
{
foreach (string key in index.Keys)
{
if (index[key].Contains(record_id))
{
index[key].Remove(record_id);
}
}
}
Most readers will notice that as the total number of records increase in the index, removing record ids from the index will take longer. Reducing deletion time of record ids, once it become an issue, shouldn’t be hard. A possible good solution to this is to keep a forward index where the key are the record ids and instead of a List<int> we have a List<string> holding the search terms. This wouldn’t be hard to implement at all; in fact, it is almost the same implementation of an Inverted index. However, adhering to the principle of avoiding premature optimizations, I decided to wait until it was a noticeable problem in the application. As it worked out, the users practically never delete images, and after six month when I delete images while developing the app I haven’t noticed a long time for deleting images, which seems to say that a potential optimization is still in the future.
Finally, we get to the best part, the search implementation.
public List<int> Search(string query)
{
var terms = prepare(query);
var result = executeSearch(terms);
return result;
}
private List<string> prepare(string query)
{
var keywords = query.Split(' ').ToList();
if (keywords.Count > 1)
keywords.Add(query);
return keywords;
}
private List<int>
executeSearch(List<string> keywords)
{
var result = new List<int>();
var subResult = new List<int>();
foreach (string term in keywords)
{
var termResult = searchTerm(term);
subResult.AddRange(termResult);
}
result = subResult.Distinct().ToList();
return result;
}
private List<int> searchTerm(string term)
{
var result = new List<int>();
if (index.ContainsKey(term))
{
result = index[term];
}
return result;
}
Here we have the public method Search doing two actions: preparing the query for the search, and then executing the search. The preparation of the query is done by prepare(), and it consists in breaking down the string query into each term by splitting the string by space character. This produces a List<string> object. We add to this list object the full query in case that the key is a multiword term.
executeSearch() takes the list of terms and searches for them in the index. As it gets results, it adds the results together to the subresult string. Before returning, it omits redundant data. Notice how it executes a logical OR search by default. When I first implemented this object, I defaulted to a logical AND searches. And it was accurate if you are looking for exact searches, but it is less useful if you are looking for a range of possible answers for your query, so I changed it.
And that is it. This small object is all what I needed to begin providing google-like searching for my application . It is very lightweight, it took me less time to implement than it would have taken me to configure solr , and it was a great starting point to further evolve the search functionality as the client asks for specific search features. It feels fast and accurate. I found that it from this object it has been simple to add improvements such as allowing for quoted query search, special option characters, allowing for logical AND searches, or returning each match with a score.
The source code (MIT license) for this post is available at github.
This post was written by:
Hugo
Associate
For more information on this post, the technologies discussed here, or Zekiah’s system integration and software development services, please e-mail us at contact@zekiah.com