Quick and Easy Method to Remove Html Tags

Introduction

Recently, I had to make some changes to a website that uses SQL Server Full-Text search. Originally, the columns being searched contained plain text, but I now needed these same columns to contain HTML markup to give users the ability to format the text.

Full-Text search doesn't care how the text being searched is used. I could've simply changed the column to store HTML markup and be done with it. But then I'd also be searching the HTML tags within that text. So I decided to create two columns: one is HtmlDescription, which contains the HTML markup, and the other is Description, which contains a plain text version of the first field. So then the question became how I'd create the two versions of the same text.

I was using a text editor control that supports HTML. So I already had the HTML version. What was left was to create a plain text version from the HTML markup.

Stripping HTML Tags from Text

I've already published the article Convert HTML to Text. That article describes a way to get a reasonably formatted plain text document from HTML markup.

However, in this case, I didn't care how the text was formatted. Afterall, no one would really see it. It was simply a column to contain text that SQL Server Full-Text search could index and search. So I create a minimal class that did nothing but strip out the HTML tags.

The RemoveHtmlTags() Method

Listing 1 shows my TextHelper class. It contains the RemoveHtmlTags() method, which will strip HTML tags from HTML markup as discussed above.

Listing 1: TextHelper Class

public static class TextHelper
{
    /// <summary>
    /// Returns a copy of this string with any HTML markup tags removed. Also converts any
    /// encoded characters to their unencoded version.
    /// </summary>
    /// <remarks>
    /// The resulting string may contain extra spacing and may not be suitably formatted
    /// for display. It is designed primarily for removing markup so, for example, a
    /// string can be correctly full-text indexed.
    /// </remarks>
    public static string RemoveHtmlTags(this string html)
    {
        int pos = 0;
        StringBuilder builder = new StringBuilder();
        while (pos < html.Length)
        {
            if (html[pos] == '<')
            {
                pos = SkipTag(html, pos);
                builder.Append(' ');
            }
            else builder.Append(html[pos++]);
        }
        return WebUtility.HtmlDecode(builder.ToString());
    }

    // Skips over an HTML tag
    private static int SkipTag(string html, int pos)
    {
        pos++;
        while (pos < html.Length)
        {
            if (html[pos] == '"' || html[pos] == '\'')
                pos = SkipString(html, pos);
            else if (html[pos++] == '>')
                break;
        }
        return pos;
    }

    // Skips over a quoted string
    private static int SkipString(string html, int pos)
    {
        char quote = html[pos++];
        while (pos < html.Length)
        {
            if (html[pos++] == quote)
                break;
        }
        return pos;
    }
}

In this implementation, the methods are static so they can easily be called without creating an instance of the class. Of the three methods, only the RemoveHtmlTags() method is public. So we can simply call this method to strip HTML tags from a string.

Listing 2: Using the RemoveHtmlTags() Method

string plainText = TextHelper.RemoveHtmlTags(html);

As you can see, the code is quite simple. When a tag is found, it is skipped over using the SkipTag() method. Similarly, SkipTag() will skip over quoted values by calling SkipString(). This prevents a closing tag character (>) that is embedded within a quoted attribute value from being interpreted as the end of the tag. The code supports quoted values using either single or double quotes. My code does not handle the case when you have nested tags (outside of quoted values), but that's really bad and invalid markup anyway.

Note that, when a tag is detected and skipped, the code inserts a space at the current location. This is to prevent words on both sides of a tag from being joined into one word. That would affect which words were stored in the full-text index. The result is that the output may contain extra spaces. But, as mentioned above, formatting of the plain text was not a concern here.

I should also point out that my code leaves text between tags intact. For example, if I have the markup <head><title>This is my title</title><head>, the text within the title tag would not be stripped out. In my case, I was only working with HTML snippets and not entire documents and so this was not an issue. If this is a problem for you, you might review my other article mentioned above or try one of the other suggestions described below.

Conclusion

I've seen other code do similar things with regular expressions. However, I generally find regular expressions to be unsatisfactory for many parsing tasks. The versions I've seen were all confused when a closing tag character was embedded within quoted attribute values. It may be possible to create a reliable approach using regular expressions, but I know the code above works and it's seems very simple.

I should also point out that SQL Server supports the concept of Full-Text search filters. This allows you to create, for example, a column of type varbinary(max) and then tell SQL Server that the colum represents an HTML document. For me, this seemed like overkill. But depending on your needs, that may be another option worth exploring.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.

Language:	C#
Technology:	.NET HTML
Platform:	Windows
License:	CPOL
Views:	21,862