Login


Colorizing Source Code

By Jonathan Wood on 12/20/2010
Language: C#
Technology: WinForms
Platform: Windows
License: CPOL
Views: 2,784
General Programming » Text Handling » HTML & URLs » Colorizing Source Code

Screenshot of Demo Program

Screenshot of Colorized Code in a Browser

Download Project Source Code Download Project Source Code

Introduction

Most modern source code editors provide some sort of color coding where different language components are displayed using different colors. For example, language keywords may appear in one color while string literals appear in another

This can make the code easier to read. So it follows that it also makes sense to colorize your source code when you display it on a web page.

It's very tedious, to say the least, to go through and manually editing the HTML for each word in your source code so they will appear with the appropriate colors. More than likely, you'll want some software that can take your source code as input, and produce colorized HTML as output.

My Approach to Coloring Source Code

When I started developing the Black Belt Coder website, I knew I'd need to support coloring source code. In my case, my priority was to easily support many different programming languages. And I wasn't as concerned about getting each and every nuance of each language exactly correct.

So my approach was to create a tokenizer that could tokenize the input in a very generic way according to a set of language rules. Moreover, I decided to store the language rules in a plain-text file where I could easily add and modify the languages that my code would support. The tokenizer would read the rules for the current language, and then process the source code using those rules.

What I came up with wasn't terribly complex, but there are several classes involved. It may be a little ambitious to present all the code within a single article.

I will take a top-down approach by showing how you would call some of the classes. I'll also discuss some of the higher-level classes. But I'll sort of gloss over some of the lower level routines. All my source code is included in the downloadable project associated with this article, so you can examine the lower-level routines at any depth you like.

Looking at the Code

I'll start by describing my test application. This is a WinForms application with a large, multi-line TextBox control. To use the program, paste your source code into this TextBox and click the Colorize button.

The program will read the LanguageRules.txt file. I've placed this file in the bin\Debug directory so that it will be in the same directory as the executable file when you run the program in debug mode. If, for whatever reason, this file is not found in the same directory as the executable, the program displays an appropriate error message. If you change this configuration, you'll need to modify the code in Form1.cs so that it is able to locate this file.

The LanguageRules.txt file I'm including only supports one language, C#, which is titled "cs" in the file. You can easily modify support for C# or add additional languages by editing this file. The comments at the top of this file describe the format of the data it contains.

When you run the program, it will write an HTML file to disk and then open it with your browser. This is done in the btnColorize_Click() event handler in Form1.cs (see Listing 1). Edit this file to change what is done with the colorized code.

Listing 1: The btnColorize_Click() Event Handler in Form1.cs

private void btnColorize_Click(object sender, EventArgs e)
{
    // Display message if language rules files not
    // found in application directory
    if (!File.Exists(_languageRulesFile))
    {
        MessageBox.Show(
            String.Format("The language rules file, \"{0}\" must exist in the application directory.",
            _languageRulesFile));
        return;
    }

    try
    {
        // Create Colorizer instance
        Colorizer colorizer = new Colorizer(_languageRulesFile);

        // Set token class names
        colorizer.CssClassKeyword = "code_key";
        colorizer.CssClassSymbol = "code_sym";
        colorizer.CssClassString = "code_str";
        colorizer.CssClassOperator = "code_op";
        colorizer.CssClassComment = "code_com";

        // For better formatting control, convert tabs to spaces
        string html = txtSourceCode.Text;
        html = html.Replace("\t", "    ");

        // Write results to HTML file
        using (StreamWriter writer = File.CreateText(_htmlOutputFile))
        {
            writer.Write("<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"");
            writer.WriteLine("\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">");
            writer.WriteLine();
            writer.WriteLine("<html>");
            writer.WriteLine("<head>");
            writer.WriteLine("<title>Code Colorizer Demo</title>");
            writer.WriteLine("<style type=\"text/css\">");
            writer.Write("body { font-size:small;font-family:sans-serif;width: 800px;");
            writer.WriteLine(" margin-left:auto;margin-right:auto;}");
            writer.WriteLine(".code { border:2px solid black;padding: 12px 12px 12px 12px; }");
            writer.WriteLine(".code_key { color:Blue; }");        // Keyword
            writer.WriteLine(".code_sym { color:Teal;font-weight:bold; }");    // Symbol
            writer.WriteLine(".code_str { color:DarkRed; }");    // String constant
            writer.WriteLine(".code_op { color:Black;background-color:Yellow; }");    // Operator
            writer.WriteLine(".code_com { color:Green; }");    // Comment
            writer.WriteLine("</style>");
            writer.WriteLine("</head>");
            writer.WriteLine();
            writer.WriteLine("<body>");
            writer.WriteLine("<p>Formatted code:</p>");

            // Write formatted code
            writer.WriteLine("<pre class=\"code\">");
            writer.Write(colorizer.ColorizeCode(html, cboLanguage.Text));
            writer.WriteLine("</pre>");

            writer.WriteLine("</body>");
            writer.WriteLine("</html>");
            writer.Close();
        }

        // Open up file in browser
        System.Diagnostics.Process.Start(_htmlOutputFile);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Stop);
    }
}

After the check to verify _languageRulesFile exists, the code creates an instance of the Colorizer class and sets the CssClass properties. It then starts writing an HTML file, including the HTML headers and CSS code that controls how the various CSS classes will appear. Next, it calls Colorizer.ColorizeCode() to colorize the source code and writes the output to the HTML file. After the file is finished, it attempts to load the HTML file using the current browser.

The Colorizer Class

The Colorizer class is shown in Listing 2. Near the top of the class, we can see the CssClass properties, which must be set by the caller. These are the CSS class names applied to each type of source code token. This is the recommended approach to control HTML formatting. Normally, your CSS would be in a separate file. This way, you could easily modify the different styles for all your HTML files by simply modifying the CSS file. And because CSS is used, you can change more than just the color. You could modify the font, font weight, text decoration and so forth.

The Colorizer constructor attempts to load the language rules file that was passed as an argument. The LanguageRulesCollection class (code not shown) holds data for all the languages in the language rules file. They are all loaded by calling the LoadFromFile() method.

The ColorizeCode method does the actual colorizing. It starts by trying to load the LanguageRules (code not shown) for the language specified in the arguments. It then creates an instance of the LanguageTokenizer class (code not shown), passing the language rules to the constructor. Using those rules, it then starts to process the input.

Here's where the method enters the main loop. Each time through the loop, it parses the next token from the source code by calling LanguageTokenizer.ParseNext() until this method returns null.

If the token is one that needs to be colorized, it wraps the token in <span> tags with the appropriate CSS class. Otherwise, no <span> tags are added. Either way, it HTML encodes the original input before appending it to the results.

Listing 2: The Colorizer Class

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using SoftCircuits;

namespace CodeColorizer
{
    public class Colorizer
    {
        // Language rules collection
        LanguageRulesCollection _rulesCollection;

        // Token class names
        public string CssClassKeyword { get; set; }
        public string CssClassSymbol { get; set; }
        public string CssClassString { get; set; }
        public string CssClassOperator { get; set; }
        public string CssClassComment { get; set; }

        public Colorizer(string languageRulesFile)
        {
            // Load language rules
            _rulesCollection = new LanguageRulesCollection();
            _rulesCollection.LoadFromFile(languageRulesFile);
        }

        /// <summary>
        /// Color codes a block of source code using the specified language
        /// </summary>
        /// <param name="code">Source code to format</param>
        /// <param name="language">Language to use for formatting</param>
        /// <returns></returns>
        public string ColorizeCode(string code, string language)
        {
            StringBuilder builder = new StringBuilder();

            // Load rules for the specified language
            LanguageRules rules = _rulesCollection.LookupLanguage(language);
            if (rules == null)
                throw new Exception(String.Format("Language \"{0}\" undefined in rules file", language));

            // Now prepare to tokenize source code according to specified language rules
            LanguageTokenizer tokenizer = new LanguageTokenizer(rules, code);
            Token token = tokenizer.ParseNext();
            while (token.Class != TokenClass.Null)
            {
                string style = String.Empty;

                switch (token.Class)
                {
                    case TokenClass.Keyword:
                        if (!String.IsNullOrEmpty(CssClassKeyword))
                            style = CssClassKeyword;
                        break;
                    case TokenClass.Symbol:
                        if (!String.IsNullOrEmpty(CssClassSymbol))
                            style = CssClassSymbol;
                        break;
                    case TokenClass.String:
                        if (!String.IsNullOrEmpty(CssClassString))
                            style = CssClassString;
                        break;
                    case TokenClass.Operator:
                        if (!String.IsNullOrEmpty(CssClassOperator))
                            style = CssClassOperator;
                        break;
                    case TokenClass.Comment:
                        if (!String.IsNullOrEmpty(CssClassComment))
                            style = CssClassComment;
                        break;
                }

                if (style.Length > 0)
                    builder.AppendFormat("<span class=\"{0}\">", style);
                builder.Append(HtmlEncode(token.Value));
                if (style.Length > 0)
                    builder.Append("</span>");

                token = tokenizer.ParseNext();
            }
            return builder.ToString();
        }

        /// <summary>
        /// HTML-encodes the given string so that it would appear as expected
        /// on an HTML page.
        /// 
        /// Calls to this method may be replaced with calls to HttpUtility.HtmlEncode().
        /// Since HttpUtility is not included in a WinForms application by default, I
        /// just thought it might be easier for some users if we wrote our own.
        /// </summary>
        /// <param name="s">String to encode</param>
        /// <returns></returns>
        public static string HtmlEncode(string s)
        {
            StringBuilder builder = new StringBuilder(s.Length);

            foreach (char c in s)
            {
                if (c <= '>')
                {
                    if (c == '&')
                        builder.Append("&amp;");
                    else if (c == '\'')
                        builder.Append("&#39;");
                    else if (c == '"')
                        builder.Append("&quot;");
                    else if (c == '<')
                        builder.Append("&lt;");
                    else if (c == '>')
                        builder.Append("&gt;");
                    else
                        builder.Append(c);
                }
                else if ((c >= '\x00a0') && (c < 'A'))
                {
                    builder.AppendFormat("&#{0};", (int)c);
                }
                else
                {
                    builder.Append(c);
                }
            }
            return builder.ToString();
        }
    }
}

That's pretty much all the Colorizer class does. To delve deeper into the support classes, you'll need to download the source code. Among other classes, the code relies on my TextParser class (code not shown). I use this class as a base class for most of my code that performs parsing of one kind or another.

Symbol Names

Note that, in addition to language keywords, the language rules file also supports symbols. Symbols are the names of known classes, constants, or variables. However, the list of symbols in a modern framework such as .NET could grow endlessly. In fact, each type declared in the source code should probably also be a symbol.

For this reason, you may decide to leave the symbol word list in the language rules file empty. I think I'll put, like, one word in the Symbols section so you can see how it looks in the output. But unless you want to manage hundreds or thousands of symbol names, you'll probably be satisfied to have symbol names appear using the default color.

Limitations

As I mentioned before, the approach I've taken can easily handle any number of different languages. However, it may not get all the nuances of every language exactly correct. As a result, there are some limitations.

For example, there would be no way to support VB classic's REM statement for comments because the language rules require that comment delimiters do not use symbol characters. Also, as written, there is no way to colorize numeric constants. I just haven't come up with a generic way to detect numbers in all possible languages due to different number bases, prefixes, suffixes, etc.

In addition, it would be hard to handle HTML and XML colorizing where tag names were a different color from the attribute names without making substantial changes to the code.

Conclusion

That's about it. Again, there are additional classes in the source code that I haven't published here. They provide lower-level support for the code I've presented.

This is potentially a very useful task. I hope you can benefit from it.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software and website developer working out of the greater Salt Lake City area of Utah. I've developed many websites including Black Belt Coder, Trail Calendar, and others.

I hike each week with my dogs Suki and Sasha. You can see my hiking blog at Hiking Salt Lake.