Parsing HTML Tags in C#

Demo Program

Introduction

The .NET framework provides a plethora of tools for generating HTML markup, and for generating and parsing XML markup. However, it provides very little in the way of support for parsing HTML markup.

I had some old code (written in classic Visual Basic) for spidering websites and I had ported it over to C#. Spidering generally involves parsing out all the links on a particular web page and then following those links and doing the same for those pages. Spidering is how companies like Google scour the Internet.

My ported code worked pretty well, but it wasn't very forgiving. For example, I had a website that allowed users to enter a URL of a page that had a link to our site in return for a free promotion. The code would scan the given URL for a backlink. However, sometimes it would report there was no backlink when there really was.

The error was caused when the user's web page contained syntax errors. For example, an attribute value that had no closing quote. My code would skip ahead past large amounts of markup, looking for that quote. Because it was now confused about what markup was inside and outside of quotes, my code missed some links that it thought were inside quotes.

So I rewrote the code to be more flexible--as most browsers are. In the case of attribute values missing closing quotes, my code assumes the value has terminated whenever it encounters a line break. I made other changes as well, primarily designed to make the code simpler and more robust.

HTML Tag Parser

Listing 1 shows my HtmlTag and HtmlParser class. The ParseNext() method is called to find the next occurrence of a tag and returns an HtmlTag object that describes the tag. The caller indicates the type of tag it wants returned (or "*" if it wants all tags returned).

Note that HtmlTag does not contain any information about text within the tag, which is sometimes called inner text. Parsing the inner text gets a little more complex because inner text can contain nested tags. My code doesn't delve into anything too deep. It only parses the tags and attributes,w which is perfect for tasks like spidering the links in a page.

Listing 1: The HtmlTag and HtmlParser class

public class HtmlTag
{
    /// <summary>
    /// Name of this tag
    /// </summary>
    public string Name { get; set; }

    /// <summary>
    /// Collection of attribute names and values for this tag
    /// </summary>
    public Dictionary<string, string> Attributes { get; set; }

    /// <summary>
    /// True if this tag contained a trailing forward slash
    /// </summary>
    public bool TrailingSlash { get; set; }

    /// <summary>
    /// Indicates if this tag contains the specified attribute. Note that
    /// true is returned when this tag contains the attribute even when the
    /// attribute has no value
    /// </summary>
    /// <param name="name">Name of attribute to check</param>
    /// <returns>True if tag contains attribute or false otherwise</returns>
    public bool HasAttribute(string name)
    {
        return Attributes.ContainsKey(name);
    }
};

public class HtmlParser : TextParser
{
    public HtmlParser()
    {
    }

    public HtmlParser(string html) : base(html)
    {
    }

    /// <summary>
    /// Parses the next tag that matches the specified tag name
    /// </summary>
    /// <param name="name">Name of the tags to parse ("*" = parse all tags)</param>
    /// <param name="tag">Returns information on the next occurrence of the
    /// specified tag or null if none found</param>
    /// <returns>True if a tag was parsed or false if the end of the document was reached</returns>
    public bool ParseNext(string name, out HtmlTag tag)
    {
        // Must always set out parameter
        tag = null;

        // Nothing to do if no tag specified
        if (String.IsNullOrEmpty(name))
            return false;

        // Loop until match is found or no more tags
        MoveTo('<');
        while (!EndOfText)
        {
            // Skip over opening '<'
            MoveAhead();

            // Examine first tag character
            char c = Peek();
            if (c == '!' && Peek(1) == '-' && Peek(2) == '-')
            {
                // Skip over comments
                const string endComment = "-->";
                MoveTo(endComment);
                MoveAhead(endComment.Length);
            }
            else if (c == '/')
            {
                // Skip over closing tags
                MoveTo('>');
                MoveAhead();
            }
            else
            {
                bool result, inScript;

                // Parse tag
                result = ParseTag(name, ref tag, out inScript);
                // Because scripts may contain tag characters, we have special
                // handling to skip over script contents
                if (inScript)
                    MovePastScript();
                // Return true if requested tag was found
                if (result)
                    return true;
            }
            // Find next tag
            MoveTo('<');
        }
        // No more matching tags found
        return false;
    }

    /// <summary>
    /// Parses the contents of an HTML tag. The current position should be at the first
    /// character following the tag's opening less-than character.
    /// 
    /// Note: We parse to the end of the tag even if this tag was not requested by the
    /// caller. This ensures subsequent parsing takes place after this tag
    /// </summary>
    /// <param name="reqName">Name of the tag the caller is requesting, or "*" if caller
    /// is requesting all tags</param>
    /// <param name="tag">Returns information on this tag if it's one the caller is
    /// requesting</param>
    /// <param name="inScript">Returns true if tag began, and did not end, and script
    /// block</param>
    /// <returns>True if data is being returned for a tag requested by the caller
    /// or false otherwise</returns>
    protected bool ParseTag(string reqName, ref HtmlTag tag, out bool inScript)
    {
        bool doctype, requested;
        doctype = inScript = requested = false;

        // Get name of this tag
        string name = ParseTagName();

        // Special handling
        if (String.Compare(name, "!DOCTYPE", true) == 0)
            doctype = true;
        else if (String.Compare(name, "script", true) == 0)
            inScript = true;

        // Is this a tag requested by caller?
        if (reqName == "*" || String.Compare(name, reqName, true) == 0)
        {
            // Yes
            requested = true;
            // Create new tag object
            tag = new HtmlTag();
            tag.Name = name;
            tag.Attributes = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase);
        }

        // Parse attributes
        MovePastWhitespace();
        while (Peek() != '>' && Peek() != NullChar)
        {
            if (Peek() == '/')
            {
                // Handle trailing forward slash
                if (requested)
                    tag.TrailingSlash = true;
                MoveAhead();
                MovePastWhitespace();
                // If this is a script tag, it was closed
                inScript = false;
            }
            else
            {
                // Parse attribute name
                name = (!doctype) ? ParseAttributeName() : ParseAttributeValue();
                MovePastWhitespace();
                // Parse attribute value
                string value = String.Empty;
                if (Peek() == '=')
                {
                    MoveAhead();
                    MovePastWhitespace();
                    value = ParseAttributeValue();
                    MovePastWhitespace();
                }
                // Add attribute to collection if requested tag
                if (requested)
                {
                    // This tag replaces existing tags with same name
                    if (tag.Attributes.ContainsKey(name))
                        tag.Attributes.Remove(name);
                    tag.Attributes.Add(name, value);
                }
            }
        }
        // Skip over closing '>'
        MoveAhead();

        return requested;
    }

    /// <summary>
    /// Parses a tag name. The current position should be the first character of the name
    /// </summary>
    /// <returns>Returns the parsed name string</returns>
    protected string ParseTagName()
    {
        int start = Position;
        while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>')
            MoveAhead();
        return Substring(start, Position);
    }

    /// <summary>
    /// Parses an attribute name. The current position should be the first character
    /// of the name
    /// </summary>
    /// <returns>Returns the parsed name string</returns>
    protected string ParseAttributeName()
    {
        int start = Position;
        while (!EndOfText && !Char.IsWhiteSpace(Peek()) && Peek() != '>' && Peek() != '=')
            MoveAhead();
        return Substring(start, Position);
    }

    /// <summary>
    /// Parses an attribute value. The current position should be the first non-whitespace
    /// character following the equal sign.
    /// 
    /// Note: We terminate the name or value if we encounter a new line. This seems to
    /// be the best way of handling errors such as values missing closing quotes, etc.
    /// </summary>
    /// <returns>Returns the parsed value string</returns>
    protected string ParseAttributeValue()
    {
        int start, end;
        char c = Peek();
        if (c == '"' || c == '\'')
        {
            // Move past opening quote
            MoveAhead();
            // Parse quoted value
            start = Position;
            MoveTo(new char[] { c, '\r', '\n' });
            end = Position;
            // Move past closing quote
            if (Peek() == c)
                MoveAhead();
        }
        else
        {
            // Parse unquoted value
            start = Position;
            while (!EndOfText && !Char.IsWhiteSpace(c) && c != '>')
            {
                MoveAhead();
                c = Peek();
            }
            end = Position;
        }
        return Substring(start, end);
    }

    /// <summary>
    /// Locates the end of the current script and moves past the closing tag
    /// </summary>
    protected void MovePastScript()
    {
        const string endScript = "</script";

        while (!EndOfText)
        {
            MoveTo(endScript, true);
            MoveAhead(endScript.Length);
            if (Peek() == '>' || Char.IsWhiteSpace(Peek()))
            {
                MoveTo('>');
                MoveAhead();
                break;
            }
        }
    }
}

Generic Text Parser Base Class

HtmlParser derives from the base class TextParser, which is my generic text parsing class. By organizing my code this way, my HTML parser code is cleaner and easier to understand. In addition, the TextParser class could be used for other applications that require text parsing. Listing 2 shows the TextParser base class.

Listing 2. Generic TextParser Class

public class TextParser
{
    private string _text;
    private int _pos;

    public string Text { get { return _text; } }
    public int Position { get { return _pos; } }
    public int Remaining { get { return _text.Length - _pos; } }
    public static char NullChar = (char)0;

    public TextParser()
    {
        Reset(null);
    }

    public TextParser(string text)
    {
        Reset(text);
    }

    /// <summary>
    /// Resets the current position to the start of the current document
    /// </summary>
    public void Reset()
    {
        _pos = 0;
    }

    /// <summary>
    /// Sets the current document and resets the current position to the start of it
    /// </summary>
    /// <param name="html"></param>
    public void Reset(string text)
    {
        _text = (text != null) ? text : String.Empty;
        _pos = 0;
    }

    /// <summary>
    /// Indicates if the current position is at the end of the current document
    /// </summary>
    public bool EndOfText
    {
        get { return (_pos >= _text.Length); }
    }

    /// <summary>
    /// Returns the character at the current position, or a null character if we're
    /// at the end of the document
    /// </summary>
    /// <returns>The character at the current position</returns>
    public char Peek()
    {
        return Peek(0);
    }

    /// <summary>
    /// Returns the character at the specified number of characters beyond the current
    /// position, or a null character if the specified position is at the end of the
    /// document
    /// </summary>
    /// <param name="ahead">The number of characters beyond the current position</param>
    /// <returns>The character at the specified position</returns>
    public char Peek(int ahead)
    {
        int pos = (_pos + ahead);
        if (pos < _text.Length)
            return _text[pos];
        return NullChar;
    }

    /// <summary>
    /// Extracts a substring from the specified position to the end of the text
    /// </summary>
    /// <param name="start"></param>
    /// <returns></returns>
    public string Substring(int start)
    {
        return Substring(start, _text.Length);
    }

    /// <summary>
    /// Extracts a substring from the specified range of the current text
    /// </summary>
    /// <param name="start"></param>
    /// <param name="end"></param>
    /// <returns></returns>
    public string Substring(int start, int end)
    {
        return _text.Substring(start, end - start);
    }

    /// <summary>
    /// Moves the current position ahead one character
    /// </summary>
    public void MoveAhead()
    {
        MoveAhead(1);
    }

    /// <summary>
    /// Moves the current position ahead the specified number of characters
    /// </summary>
    /// <param name="ahead">The number of characters to move ahead</param>
    public void MoveAhead(int ahead)
    {
        _pos = Math.Min(_pos + ahead, _text.Length);
    }

    /// <summary>
    /// Moves to the next occurrence of the specified string
    /// </summary>
    /// <param name="s">String to find</param>
    /// <param name="ignoreCase">Indicates if case-insensitive comparisons are used</param>
    public void MoveTo(string s, bool ignoreCase = false)
    {
        _pos = _text.IndexOf(s, _pos,
            ignoreCase ? StringComparison.OrdinalIgnoreCase : StringComparison.Ordinal);
        if (_pos < 0)
            _pos = _text.Length;
    }

    /// <summary>
    /// Moves to the next occurrence of the specified character
    /// </summary>
    /// <param name="c">Character to find</param>
    public void MoveTo(char c)
    {
        _pos = _text.IndexOf(c, _pos);
        if (_pos < 0)
            _pos = _text.Length;
    }

    /// <summary>
    /// Moves to the next occurrence of any one of the specified
    /// characters
    /// </summary>
    /// <param name="carr">Array of characters to find</param>
    public void MoveTo(char[] carr)
    {
        _pos = _text.IndexOfAny(carr, _pos);
        if (_pos < 0)
            _pos = _text.Length;
    }

    /// <summary>
    /// Moves the current position to the first character that is part of a newline
    /// </summary>
    public void MoveToEndOfLine()
    {
        char c = Peek();
        while (c != '\r' && c != '\n' && !EndOfText)
        {
            MoveAhead();
            c = Peek();
        }
    }

    /// <summary>
    /// Moves the current position to the next character that is not whitespace
    /// </summary>
    public void MovePastWhitespace()
    {
        while (Char.IsWhiteSpace(Peek()))
            MoveAhead();
    }
}

Using the Code

Using these classes is very easy. Listing 3 shows sample code that scans a web page for all the HREF values in A (anchor) tags. It downloads a URL and loads the contents into an instance of the HtmlParser class. It then calls ParseNext() with a request to return information about all A tags.

When ParseNext() returns, tag is set to an instance of the HtmlTag class with information about the tag that was found. This class includes a collection of attribute values, which my code uses to locate the value of the HREF attribute.

When ParseNext() returns false, the end of the document has been reached.

Listing 3: Code that demonstrates using the HtmlParser class

  protected void ScanLinks(string url)
  {
    // Download page
    WebClient client = new WebClient();
    string html = client.DownloadString(url);

    // Scan links on this page
    HtmlTag tag;
    HtmlParser parse = new HtmlParser(html);
    while (parse.ParseNext("a", out tag))
    {
      // See if this anchor links to us
      string value;

      if (tag.Attributes.TryGetValue("href", out value))
      {
        // value contains URL referenced by this link
      }
    }
  }

Conclusion

Parsing HTML tags is fairly simple. As I mentioned, much of my time was spent making the code handle markup errors intelligently.

There were a few other special considerations as well. For example, if the code finds a <script> tag, it automatically scans to the closing </script> tag, if any. This is because some scripting can include HTML markup characters that can confuse the parser so I just jump over them. I take similar action with HTML comments and have special handling for !DOCTYPE tags as well.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.

Language:	C#
Technology:	.NET
Platform:	Windows
License:	CPOL
Views:	80,032