A Text Parsing Helper Class

Screenshot of Test Project

Introduction

It seems like, lately, I've been writing a lot of code to parse text.

Using C#, the .NET platform offers a pretty rich collection of classes and methods that I can use to help with this task.

In addition to all the standard methods of the String class, there is the String.Split() method and regular expression classes, which can be applied specifically to parsing text.

However, I've found that in order to have full control and to make your parsing code as robust as possible, it's necessary to write code that scans the input text character by character.

TextParser Class

As I set out writing classes to parse text, I noticed that a lot of time was spent on common tasks that seemed a distraction from what I was really trying to accomplish. So I decided to write a generalized helper class, which could be used by itself or as a base class for a more sophisticated parsing class.

The class I developed is shown in Listing 1. The TextParser class maintains a string and a position within that string. It is designed primarily to handle some of the more mundane tasks involved in processing a string of text from start to finish.

Listing 1: The TextParser Class

using System;

namespace SoftCircuits
{
    public class TextParser
    {
        private string _text;
        private int _pos;

        public string Text { get { return _text; } }
        public int Position { get { return _pos; } }
        public int Remaining { get { return _text.Length - _pos; } }
        public static char NullChar = (char)0;

        public TextParser()
        {
            Reset(null);
        }

        public TextParser(string text)
        {
            Reset(text);
        }

        /// <summary>
        /// Resets the current position to the start of the current document
        /// </summary>
        public void Reset()
        {
            _pos = 0;
        }

        /// <summary>
        /// Sets the current document and resets the current position to the start of it
        /// </summary>
        /// <param name="html"></param>
        public void Reset(string text)
        {
            _text = (text != null) ? text : String.Empty;
            _pos = 0;
        }

        /// <summary>
        /// Indicates if the current position is at the end of the current document
        /// </summary>
        public bool EndOfText
        {
            get { return (_pos >= _text.Length); }
        }

        /// <summary>
        /// Returns the character at the current position, or a null character if we're
        /// at the end of the document
        /// </summary>
        /// <returns>The character at the current position</returns>
        public char Peek()
        {
            return Peek(0);
        }

        /// <summary>
        /// Returns the character at the specified number of characters beyond the current
        /// position, or a null character if the specified position is at the end of the
        /// document
        /// </summary>
        /// <param name="ahead">The number of characters beyond the current position</param>
        /// <returns>The character at the specified position</returns>
        public char Peek(int ahead)
        {
            int pos = (_pos + ahead);
            if (pos < _text.Length)
                return _text[pos];
            return NullChar;
        }

        /// <summary>
        /// Extracts a substring from the specified position to the end of the text
        /// </summary>
        /// <param name="start"></param>
        /// <returns></returns>
        public string Extract(int start)
        {
            return Extract(start, _text.Length);
        }

        /// <summary>
        /// Extracts a substring from the specified range of the current text
        /// </summary>
        /// <param name="start"></param>
        /// <param name="end"></param>
        /// <returns></returns>
        public string Extract(int start, int end)
        {
            return _text.Substring(start, end - start);
        }

        /// <summary>
        /// Moves the current position ahead one character
        /// </summary>
        public void MoveAhead()
        {
            MoveAhead(1);
        }

        /// <summary>
        /// Moves the current position ahead the specified number of characters
        /// </summary>
        /// <param name="ahead">The number of characters to move ahead</param>
        public void MoveAhead(int ahead)
        {
            _pos = Math.Min(_pos + ahead, _text.Length);
        }

        /// <summary>
        /// Moves to the next occurrence of the specified string
        /// </summary>
        /// <param name="s">String to find</param>
        /// <param name="ignoreCase">Indicates if case-insensitive comparisons
        /// are used</param>
        public void MoveTo(string s, bool ignoreCase = false)
        {
            _pos = _text.IndexOf(s, _pos, ignoreCase ?
                StringComparison.OrdinalIgnoreCase : StringComparison.Ordinal);
            if (_pos < 0)
                _pos = _text.Length;
        }

        /// <summary>
        /// Moves to the next occurrence of the specified character
        /// </summary>
        /// <param name="c">Character to find</param>
        public void MoveTo(char c)
        {
            _pos = _text.IndexOf(c, _pos);
            if (_pos < 0)
                _pos = _text.Length;
        }

        /// <summary>
        /// Moves to the next occurrence of any one of the specified
        /// characters
        /// </summary>
        /// <param name="chars">Array of characters to find</param>
        public void MoveTo(char[] chars)
        {
            _pos = _text.IndexOfAny(chars, _pos);
            if (_pos < 0)
                _pos = _text.Length;
        }

        /// <summary>
        /// Moves to the next occurrence of any character that is not one
        /// of the specified characters
        /// </summary>
        /// <param name="chars">Array of characters to move past</param>
        public void MovePast(char[] chars)
        {
            while (IsInArray(Peek(), chars))
                MoveAhead();
        }

        /// <summary>
        /// Determines if the specified character exists in the specified
        /// character array.
        /// </summary>
        /// <param name="c">Character to find</param>
        /// <param name="chars">Character array to search</param>
        /// <returns></returns>
        protected bool IsInArray(char c, char[] chars)
        {
            foreach (char ch in chars)
            {
                if (c == ch)
                    return true;
            }
            return false;
        }

        /// <summary>
        /// Moves the current position to the first character that is part of a newline
        /// </summary>
        public void MoveToEndOfLine()
        {
            char c = Peek();
            while (c != '\r' && c != '\n' && !EndOfText)
            {
                MoveAhead();
                c = Peek();
            }
        }

        /// <summary>
        /// Moves the current position to the next character that is not whitespace
        /// </summary>
        public void MovePastWhitespace()
        {
            while (Char.IsWhiteSpace(Peek()))
                MoveAhead();
        }
    }
}

Exploring the Class

The constructor takes an optional string argument, which is the string you'll be working with. In addition, you can call the Reset() method to set the string (or call the version with no argument to return to the beginning of the current string).

The EndOfText property returns true when the current position has reached the end of the current string. Note that none of the methods will move beyond the end-of-text position. They all work to ensure the current position remains valid.

For example, the MoveTo() methods all set the current position to the end of the string if the characters they are searching for cannot be found. Contrast this with methods like String.IndexOf(), which return a position of -1 if the search was unsuccessful.

Other methods to notice are the MoveAhead() methods, which move the current position ahead one or more characters. And the Peek() methods, which return the character at the current position or the position you specify. Note that Peek() returns (char)0 when the current position is at the end of the string. So you don't need to add additional tests to prevent exceptions caused by trying to read past the end of the string.

Making Use of the Class

As mentioned earlier, this class can be used "as-is" to perform simple text parsing. However, I've used it several times as the base class for more sophisticated parsing classes, including several that I've described in articles posted on the Black Belt Coder website.

The downloadable project associated with this article simply demonstrates using the class to parse a string into words. Listing 2 shows the code that performs this task. This is a simple task but the code shows how using the TextParser class can simplify the logic needed to process text. This really comes in handy for more complex parsing jobs.

Listing 2: Code that Parses Words from a String

private void btnParse_Click(object sender, EventArgs e)
{
    TextParser parser = new TextParser(txtTextToParse.Text);

    lstResults.Items.Clear();
    while (!parser.EndOfText)
    {
        while (!parser.EndOfText && !Char.IsLetterOrDigit(parser.Peek()))
            parser.MoveAhead();

        int start = parser.Position;

        while (Char.IsLetterOrDigit(parser.Peek()))
            parser.MoveAhead();

        if (parser.Position > start)
            lstResults.Items.Add(parser.Extract(start, parser.Position));
    }
}

Conclusion

The main difficulty I had in creating this class was just making it generic. That is, determining which methods would be most useful in the largest number of cases. After working on several projects that use this class, I think I found a pretty good mix. If you need to write code to parse text, you may want to give this class a try.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.

Language:	C#
Technology:	WinForms
Platform:	Windows
License:	CPOL
Views:	39,074