Login


Easy Full-Text Search Queries

By Jonathan Wood on 6/22/2011
Language: C#
Technology: .NETSQL
Platform: Windows
License: CPOL
Views: 2,385
General Programming » SQL & Other Databases » Full-Text Search » Easy Full-Text Search Queries

Test Project Screenshot

Download Source Code Download Source Code

Introduction

SQL Server's Full-Text Search offers powerful and efficient ways to search your data. However, the query syntax can be rather cryptic.

It's bad enough for SQL programmers, but what if you want to allow your users to enter queries? Unless all your users are database experts, they won't know or understand syntax like FORMSOF(INFLECTIONAL, dogs) AND FORMSOF(THESAURUS, cats).

Some time ago, I ran into the excellent article A Google-like Full Text Search by Michael Coles. This article presented code that converted a Google-like query syntax into Full-Text Search syntax.

I really liked the idea. However, I decided against using the code presented in the article for two reasons. First, it relied on the Irony language implementation kit. While I'm sure this is a very fine bit of code, it's a large project capable of implementing custom languages and just seemed like overkill for my needs. And second, the code presented was kind of picky about the syntax and would halt with a syntax error if it encountered something it didn't like. When processing a user's search query, I would prefer to interpret it as best I could without throwing back cryptic error messages. Afterall, when was the last time you got a query error while searching Google?

So I decided to implement my own code that was more forgiving and didn't rely on large third-party libraries.

A Simple Search Syntax

My code takes a simple search query and translates it into a Full-Text Search query for SQL Server. The syntax for the input query is similar to that supported by search engines like Google. I also added a few enhancements, some lifted from Coles' article.

If the input query contains syntax errors, such as missing the closing quote, my code will simply translate the query as best it can and produce a valid output.

Table 1 shows a number of examples of the translations my code will perform. Note that parentheses can be used to control operator precedence, as demonstrated in the last example.

Table 1: Sample Conversions from Simple Search Syntax to Full-Text Search Syntax

Input Output
abc FORMSOF(INFLECTIONAL, abc)
~abc FORMSOF(THESAURUS, abc)
+abc "abc"
"abc" "abc"
-abc NOT FORMSOF(INFLECTIONAL, abc)
abc def (FORMSOF(INFLECTIONAL, abc) AND FORMSOF(INFLECTIONAL, def))
abc or def (FORMSOF(INFLECTIONAL, abc) OR FORMSOF(INFLECTIONAL, def))
<abc def> (FORMSOF(INFLECTIONAL, abc) NEAR FORMSOF(INFLECTIONAL, def))
abc and (def or ghi) (FORMSOF(INFLECTIONAL, abc) AND (FORMSOF(INFLECTIONAL, def) OR FORMSOF(INFLECTIONAL, ghi)))

The EasyFts Class

Listing 1 shows my EasyFts class. Note that it makes use of my handy TextParser class to simplify parsing the input text.

The ToFtsQuery method takes a simple search query and converts in into a Full-Text search query.

The code builds a very simple expression tree consisting of terminal nodes (search terms) and internal nodes (conjunctions). Whenever a search term is added as a terminal node, an internal node is added to control how the search term is joined to the rest of the query. This makes it easy to implement operator precedence with parentheses. It also makes it easy to form the final result once the expression tree has been built.

Listing 1: The EasyFts Class

class EasyFts
{
    /// <summary>
    /// Query term forms.
    /// </summary>
    protected enum TermForms
    {
        Inflectional,
        Thesaurus,
        Literal,
    }

    /// <summary>
    /// Term conjunction types.
    /// </summary>
    protected enum ConjunctionTypes
    {
        And,
        Or,
        Near,
    }

    /// <summary>
    /// Expression node abstract base class
    /// </summary>
    protected abstract class NodeBase
    {
        public abstract bool IsTerminal();
    }

    /// <summary>
    /// Terminal (leaf) expression node class.
    /// </summary>
    private class TerminalNode : NodeBase
    {
        public string Term { get; set; }
        public TermForms TermForm { get; set; }
        public bool Exclude { get; set; }

        public override bool IsTerminal()
        {
            return true;
        }

        // Convert node to string
        public override string ToString()
        {
            string fmt = null;
            if (TermForm == TermForms.Inflectional)
                fmt = "{0}FORMSOF(INFLECTIONAL, {1})";
            else if (TermForm == TermForms.Thesaurus)
                fmt = "{0}FORMSOF(THESAURUS, {1})";
            else if (TermForm == TermForms.Literal)
                fmt = "{0}\"{1}\"";
            return String.Format(fmt,
                Exclude ? "NOT " : String.Empty,
                Term);
        }
    }

    /// <summary>
    /// Internal (non-leaf) expression node class
    /// </summary>
    private class InternalNode : NodeBase
    {
        public NodeBase Child1 { get; set; }
        public NodeBase Child2 { get; set; }
        public ConjunctionTypes Conjunction { get; set; }

        public override bool IsTerminal()
        {
            return false;
        }

        // Convert node to string
        public override string  ToString()
        {
            return String.Format("({0} {1} {2})",
                Child1.ToString(),
                Conjunction.ToString().ToUpper(),
                Child2.ToString());
        }
    }

    // Characters not allowed in unquoted search terms
    protected const string Punctuation = "~\"`!@#$%^&*()-+=[]{}\\|;:,.<>?/";

    /// <summary>
    /// Collection of stop words. These words will not
    /// be included in the resulting query unless quoted.
    /// </summary>
    public HashSet<string> StopWords { get; set; }

    // Class constructor
    public EasyFts()
    {
        StopWords = new HashSet<string>();
    }

    /// <summary>
    /// Converts an "easy" search term to a full-text search term.
    /// </summary>
    /// <param name="query">Search term to convert</param>
    /// <returns>A valid full-text search query</returns>
    public string ToFtsQuery(string query)
    {
        NodeBase node = ParseNode(query, ConjunctionTypes.And);
        return (node != null) ? node.ToString() : String.Empty;
    }

    /// <summary>
    /// Parses a query segment and converts it to an expression
    /// tree.
    /// </summary>
    /// <param name="query">Query segment to convert</param>
    /// <param name="defaultConjunction">Implicit conjunction type</param>
    /// <returns>Root node of expression tree</returns>
    private NodeBase ParseNode(string query, ConjunctionTypes defaultConjunction)
    {
        TermForms termForm = TermForms.Inflectional;
        bool termExclude = false;
        ConjunctionTypes conjunction = defaultConjunction;
        bool resetState = true;
        NodeBase root = null;
        NodeBase node;
        string term;

        TextParser parser = new TextParser(query);
        while (!parser.EndOfText)
        {
            if (resetState)
            {
                // Reset modifiers
                termForm = TermForms.Inflectional;
                termExclude = false;
                conjunction = defaultConjunction;
                resetState = false;
            }

            parser.MovePastWhitespace();
            if (!parser.EndOfText &&
                !Punctuation.Contains(parser.Peek()))
            {
                // Extract query term
                int start = parser.Position;
                parser.MoveAhead();
                while (!parser.EndOfText &&
                    !Punctuation.Contains(parser.Peek()) &&
                    !Char.IsWhiteSpace(parser.Peek()))
                    parser.MoveAhead();

                // Allow trailing wildcard
                if (parser.Peek() == '*')
                {
                    parser.MoveAhead();
                    termForm = TermForms.Literal;
                }
                term = parser.Extract(start, parser.Position);

                if (termForm != TermForms.Literal &&
                    String.Compare(term, "AND", true) == 0)
                    conjunction = ConjunctionTypes.And;
                else if (termForm != TermForms.Literal &&
                    String.Compare(term, "OR", true) == 0)
                    conjunction = ConjunctionTypes.Or;
                else if (termForm != TermForms.Literal &&
                    String.Compare(term, "NEAR", true) == 0)
                    conjunction = ConjunctionTypes.Near;
                else if (termForm != TermForms.Literal &&
                    String.Compare(term, "NOT", true) == 0)
                    termExclude = true;
                else
                {
                    root = AddNode(root, term, termForm, termExclude, conjunction);
                    resetState = true;
                }
                continue;
            }
            else if (parser.Peek() == '"')
            {
                // Match next term exactly
                termForm = TermForms.Literal;
                // Extract quoted term
                term = ExtractQuote(parser);
                root = AddNode(root, term.Trim(), termForm, termExclude, conjunction);
                resetState = true;
            }
            else if (parser.Peek() == '(')
            {
                // Parse parentheses block
                term = ExtractBlock(parser, '(', ')');
                node = ParseNode(term, defaultConjunction);
                root = AddNode(root, node, conjunction);
                resetState = true;
            }
            else if (parser.Peek() == '<')
            {
                // Parse angle brackets block
                term = ExtractBlock(parser, '<', '>');
                node = ParseNode(term, ConjunctionTypes.Near);
                root = AddNode(root, node, conjunction);
                resetState = true;
            }
            else if (parser.Peek() == '-')
            {
                // Match when next term is not present
                termExclude = true;
            }
            else if (parser.Peek() == '+')
            {
                // Match next term exactly
                termForm = TermForms.Literal;
            }
            else if (parser.Peek() == '~')
            {
                // Match synonyms of next term
                termForm = TermForms.Thesaurus;
            }
            // Advance to next character
            parser.MoveAhead();
        }
        return root;
    }

    /// <summary>
    /// Creates an expression node and adds it to the
    /// give tree.
    /// </summary>
    /// <param name="root">Root node of expression tree</param>
    /// <param name="term">Term for this node</param>
    /// <param name="termForm">Indicates form of this term</param>
    /// <param name="termExclude">Indicates if this is an excluded term</param>
    /// <param name="conjunction">Conjunction used to join with other nodes</param>
    /// <returns>The new root node</returns>
    protected NodeBase AddNode(NodeBase root, string term, TermForms termForm,
        bool termExclude, ConjunctionTypes conjunction)
    {
        if (term.Length > 0 && !IsStopWord(term))
        {
            NodeBase node = new TerminalNode()
            {
                Term = term,
                TermForm = termForm,
                Exclude = termExclude
            };
            root = AddNode(root, node, conjunction);
        }
        return root;
    }

    /// <summary>
    /// Adds an expression node to the given tree.
    /// </summary>
    /// <param name="root">Root node of expression tree</param>
    /// <param name="node">Node to add</param>
    /// <param name="conjunction">Conjunction used to join with other nodes</param>
    /// <returns>The new root node</returns>
    protected NodeBase AddNode(NodeBase root, NodeBase node, ConjunctionTypes conjunction)
    {
        if (node != null)
        {
            if (root != null)
            {
                root = new InternalNode()
                {
                    Child1 = root,
                    Child2 = node,
                    Conjunction = conjunction
                };
            }
            else
            {
                root = node;
            }
        }
        return root;
    }

    /// <summary>
    /// Extracts a block of text delimited by the specified open and close
    /// characters. It is assumed the parser is positioned at an
    /// occurrence of the open character. The open and closing characters
    /// are not included in the returned string. On return, the parser is
    /// positioned at the closing character or at the end of the text if
    /// the closing character was not found.
    /// </summary>
    /// <param name="parser">TextParser object</param>
    /// <param name="openChar">Start-of-block delimiter</param>
    /// <param name="closeChar">End-of-block delimiter</param>
    /// <returns>The extracted text</returns>
    protected string ExtractBlock(TextParser parser, char openChar, char closeChar)
    {
        // Track delimiter depth
        int depth = 1;

        // Extract characters between delimiters
        parser.MoveAhead();
        int start = parser.Position;
        while (!parser.EndOfText)
        {
            if (parser.Peek() == openChar)
            {
                // Increase block depth
                depth++;
            }
            else if (parser.Peek() == closeChar)
            {
                // Decrease block depth
                depth--;
                // Test for end of block
                if (depth == 0)
                    break;
            }
            else if (parser.Peek() == '"')
            {
                // Don't count delimiters within quoted text
                ExtractQuote(parser);
            }
            // Move to next character
            parser.MoveAhead();
        }
        return parser.Extract(start, parser.Position);
    }

    /// <summary>
    /// Extracts a block of text delimited by double quotes. It is
    /// assumed the parser is positioned at the first quote. The
    /// quotes are not included in the returned string. On return,
    /// the parser is positioned at the closing quote or at the end of
    /// the text if the closing quote was not found.
    /// </summary>
    /// <param name="parser">TextParser object</param>
    /// <returns>The extracted text</returns>
    protected string ExtractQuote(TextParser parser)
    {
        // Extract contents of quote
        parser.MoveAhead();
        int start = parser.Position;
        while (!parser.EndOfText && parser.Peek() != '"')
            parser.MoveAhead();
        return parser.Extract(start, parser.Position);
    }

    /// <summary>
    /// Determines if the given word has been identified as
    /// a stop word.
    /// </summary>
    /// <param name="word">Word to check</param>
    protected bool IsStopWord(string word)
    {
        return StopWords.Any(s => String.Compare(s, word, true) == 0);
    }
}

Problem with Stop Words

One thing that has been a thorn in my side is SQL Server 2008's handling of stop words. Stop words are words such as a, and, and the that are not included in the Full-Text index. Having stop words makes sense because these words are very common and don't really add to the context of the search. Since these words are not in the index, SQL Server cannot include these words in the search.

The problem is that SQL Server considers a stop word to have no results. So, for example, if you searched for black and white, SQL Server would never return a single result even if every row in the database contained all three words! This is because and is a stop word a stop words have no matches.

Since this behavior is completely counter-intuitive, my EasyFts class allows you to specify a list of stop words via the StopWords property. Unlike SQL Server, my code will simply not include these words in the search and so their presence will not block all results.

I know SQL Server has a number of options for modifying or turning off stop words, but doesn't it make sense for SQL Server to allow an option such as the one my class provides that will prevent these words from blocking all results? I've posted a request for this feature on Microsoft Connect. If you'd like to see this feature added, please vote on my request.

Using the Code

To use the EasyFts class, simply declare an instance of the class and call the ToFtsQuery() method. If you'd like the class to avoid passing along certain stop words to SQL Server, simply add those stop words to the StopWords collection property.

Note that the text returned is just a portion of the actual SQL Server query. You should incorporate the text returned, preferrably as a SQL Parameter, into your full query, which might look something like this:

SELECT * FROM Articles
WHERE CONTAINS(Articles.*, @FtsQuery)

The attached download includes the full source code for this article and demonstrates using the EasyFts class.

Conclusion

In my view, Full-Text Search is a great addition to SQL Server. It provides an easy and efficient way to search for words in your database. But some of the advantages are lost if the required syntax is too cryptic for your users.

Being able to convert from a more friendly syntax that your users will understand is necessary to gain full advantage offered by Full-Text Search.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software and website developer working out of the greater Salt Lake City area of Utah. I've developed many websites including Black Belt Coder.

I hike each week with my dog Suki. You can see my hiking blog at Hiking Salt Lake.