Colorizing Source Code

Screenshot of Demo Program

Screenshot of Colorized Code in a Browser

Introduction

Most modern source code editors provide some sort of color coding (syntax highlighting) where different language components are displayed using different colors. For example, language keywords may appear in one color while string literals appear in another

This can make the code easier to read. So it follows that it also makes sense to colorize your source code when you display it on a web page.

It's very tedious, to say the least, to go through and manually editing the HTML for each word in your source code so they will appear with the appropriate colors. More than likely, you'll want some software that can take your source code as input, and produce colorized HTML as output.

My Approach to Coloring Source Code

When I started developing the Black Belt Coder website, I knew I'd need to support coloring source code. In my case, my priority was to easily support many different programming languages. And I wasn't as concerned about getting each and every nuance of each language exactly correct.

So my approach was to create a data-driven tokenizer that could tokenize the input in a very generic way according to a set of language rules. The tokenizer would read the rules for the current language, and then process the source code using those rules. As a result, the colorizer will handle the most common cases for any language you can write the rules for, making the code ideal for situations that need to support a number of different languages.

I will take a top-down approach by showing how you would call some of the classes. I'll also discuss some of the higher-level classes. But I'll sort of gloss over some of the lower level routines. All my source code is included in the downloadable project associated with this article, so you can examine the lower-level routines at any depth you like.

Looking at the Code

I'll start by describing my test application. This is a WinForms application with a large, multi-line TextBox control. To use the program, paste your source code into this TextBox, select the language from the drop-down list (you may need to edit the language file if your language isn't listed) and then click the Colorize button.

The program will read the LanguageRules.xml file. I've placed this file in the bin\Debug directory so that it will be in the same directory as the executable file when you run the program in debug mode. If, for whatever reason, this file is not found in the same directory as the executable, the program displays an appropriate error message. If you change this configuration, you'll need to modify the code in Form1.cs so that it is able to locate this file.

When you run the program, it will write an HTML file to disk and then open it with your browser. This is done in the btnColorize_Click() event handler in Form1.cs (see Listing 1). Edit this file to change what is done with the colorized code.

Listing 1: The btnColorize_Click() Event Handler in Form1.cs

private void btnColorize_Click(object sender, EventArgs e)
{
    try
    {
        // Colorize code
        Colorizer colorizer = new Colorizer(LanguagesFile);
        colorizer.CssClassKeyword = "code_key";
        colorizer.CssClassSymbol = "code_sym";
        colorizer.CssClassString = "code_str";
        colorizer.CssClassOperator = "code_op";
        colorizer.CssClassComment = "code_com";

        // Write results to HTML file
        using (StreamWriter writer = File.CreateText(HtmlOutputFile))
        {
            writer.WriteLine("<!DOCTYPE html>");
            writer.WriteLine("<html>");
            writer.WriteLine("<head>");
            writer.WriteLine("<title>Code Colorizer Demo</title>");
            writer.WriteLine("<style type=\"text/css\">");
            writer.WriteLine("body{font-size:small;font-family:sans-serif;");
            writer.WriteLine("width:800px;margin-left:auto;margin-right:auto;}");
            writer.WriteLine(".code{border:1px dashed silver;background-color:#f2f8fd;");
            writer.WriteLine("padding:12px 12px 12px 12px;}");
            writer.WriteLine(".code_key{color:Blue;}");                          // Keyword
            writer.WriteLine(".code_sym{color:Teal;font-weight:bold;}");         // Symbol
            writer.WriteLine(".code_str{color:DarkRed;}");                       // String constant
            writer.WriteLine(".code_op{color:Black;background-color:Yellow;}");  // Operator
            writer.WriteLine(".code_com{color:Green;}");                         // Comment
            writer.WriteLine("</style>");
            writer.WriteLine("</head>");
            writer.WriteLine("<body>");
            writer.WriteLine("<p>Formatted code:</p>");

            // Write formatted code
            writer.WriteLine("<pre class=\"code\">");
            string code = txtSourceCode.Text.Replace("\t", "    ");
            writer.Write(colorizer.ColorizeCode(code, cboLanguage.Text));
            writer.WriteLine("</pre>");

            writer.WriteLine("</body>");
            writer.WriteLine("</html>");
            writer.Close();
        }

        // Open up file in browser
        System.Diagnostics.Process.Start(HtmlOutputFile);
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

The code starts by creating an instance of the Colorizer class and setting the CSS class properties to the classes defined in the markup for each type of token. This is how the colorizer knows what markup to insert around those tokens. If any of these properties are not set, the corresponding tokens will appear without additional markup.

Next, the code creates an HTML file and writes markup to that file, including CSS styling that defines how each of the CSS classes that were assigned to the colorizer will appear. It then writes a <pre> tag to hold the colorized source code. It is not necessary to use a <pre> tag to display colorized code, but code is generally much easier to read when a monospace font is used and all whitespace characters are retained. The code also preprocesses the source code by changing each tab to four spaces. This last step makes the spacing more predictable.

Next, the code calls the ColorizeCode() method on the instance of the Colorizer class created earlier. It passes the source code and the selected language in the drop-down list. The ColorizerCode() method will parse the source code according to the rules for the specified language, and insert markup needed to assign CSS classes to tokens within the source code.

Once the colorized code has been written to the file, the code above finishes off the markup, closes the HTML file, and attempts to load the new file in the user's browser.

The Colorizer Class

My Colorizer class is shown in Listing 2. Near the top of the class, we can see the CssClass properties, which should be set by the caller. These are the CSS class names applied to each type of source code token. Normally, your CSS would be in a separate file. This way, you could easily modify the different styles for all your HTML files by simply modifying the CSS file. And because CSS is used, you can change more than just the color. You could modify the font, font weight, text decoration and so forth.

The Colorizer constructor attempts to load the language rules file that was passed as an argument. The LanguageRulesCollection class (code not shown) holds data for all the languages in the language rules file. That class will load the specified file if it can.

The ColorizeCode method does the actual colorizing. It starts by trying to load the LanguageRules (code not shown) for the language specified in the arguments. It then creates an instance of the LanguageTokenizer class (code not shown), passing the language rules to the constructor. Using those rules, it then starts to process the input.

Here's where the method enters the main loop. Each time through the loop, it parses the next token from the source code by calling the ParseNext() method of the LanguageTokenizer instance until this method returns with token.Class set to TokenClass.Null. .

If the token is one that needs to be colorized, it wraps the token in <span> tags with the appropriate CSS class. Otherwise, no <span> tags are added. Either way, it HTML encodes the original input before appending it to the results.

Listing 2: The Colorizer Class

/// <summary>
/// Class to colorize source code by inserting HTML markup around language tokens.
/// </summary>
public class Colorizer
{
    // Language rules collection
    public LanguageRulesCollection Languages { get; set; }

    // Token class names
    public string CssClassKeyword { get; set; }
    public string CssClassSymbol { get; set; }
    public string CssClassString { get; set; }
    public string CssClassOperator { get; set; }
    public string CssClassComment { get; set; }

    /// <summary>
    /// Colorizer construction
    /// </summary>
    /// <param name="languageRulesFile">Name of XML file that contains all language rules</param>
    public Colorizer(string languageRulesFile)
    {
        // Load language rules
        Languages = new LanguageRulesCollection(languageRulesFile);
    }

    /// <summary>
    /// Color codes a block of source code using the specified language
    /// </summary>
    /// <param name="code">Source code to format</param>
    /// <param name="language">Language to use for formatting</param>
    /// <returns></returns>
    public string ColorizeCode(string code, string language)
    {
        // Load rules for the specified language
        LanguageRules rules = Languages.GetLanguageRules(language);
        if (rules == null)
            throw new Exception(String.Format("Undefined language \"{0}\" was specified", language));

        // CSS class lookup table
        Dictionary<TokenClass, string> cssClasses = new Dictionary<TokenClass, string>()
        {
            { TokenClass.Keyword, CssClassKeyword },
            { TokenClass.Symbol, CssClassSymbol },
            { TokenClass.String, CssClassString },
            { TokenClass.Operator, CssClassOperator },
            { TokenClass.Comment, CssClassComment },
        };

        // Tokenize source code according to specified language rules
        StringBuilder builder = new StringBuilder();
        LanguageTokenizer tokenizer = new LanguageTokenizer(rules, code);
        for (Token token = tokenizer.ParseNext(); token.Class != TokenClass.Null;
            token = tokenizer.ParseNext())
        {
            token.Value = WebUtility.HtmlEncode(token.Value);
            string style;
            if (cssClasses.TryGetValue(token.Class, out style) && !String.IsNullOrWhiteSpace(style))
                builder.AppendFormat("<span class=\"{0}\">{1}</span>", style, token.Value);
            else
                builder.Append(token.Value);
        }

        return builder.ToString();
    }
}

That's pretty much all the Colorizer class does. To delve deeper into the support classes, you'll need to download the source code. Among other classes, the code relies on my TextParser class (code not shown). I use this class as a base class for most of my code that performs parsing of one kind or another.

Symbol Names

Note that, in addition to language keywords, the language rules file also supports symbols. Symbols are the names of known classes, constants, or variables. However, the list of symbols in a modern framework such as .NET could grow endlessly. In fact, each type declared in the source code should probably also be a symbol.

For this reason, you may decide to leave the symbol word list in the language rules file empty. I think I'll put, like, one word in the Symbols section so you can see how it looks in the output. But unless you want to manage hundreds or thousands of symbol names, you'll probably be satisfied to have symbol names appear using the default color.

Language Rules File

As described previously, the colorizer is data-driven. So the key to making it work is in how you define the rules for the languages you need to support.

The language rules file is an XML file that can defined any number of languages. The sample project includes a sample language rules file that defines several language. You can use this file as is, modify it, or create a new language rules file from scratch. (Note that you could also create one from code by writing code to add all the rules and then using LanguageRulesCollection.Save() to save those rules to a file.

For each language, the following rules can be defined. (Please see the sample language rules file for exact formatting requirements.)

name	The name of this language.
caseSensitive	Determines if this language is case-sensitive (boolean).
symbolChars	Characters that make up language keywords and symbol names.
symbolFirstChars	Characters that can appear as the first character in language keywords and symbol names.
operatorChars	Characters that can appear within language operators. Must include all characters used to signify comments.
quotes	Single character used denote string literals. Also supports an optional escape character. If a string contains the escape character followed by the quote, that quote is assumed to be part of the string and not the terminator). If the language supports more than one quote type (such as " and '), you can include multiple quotes rules.
blockComments	Defines strings to delimit block comments. If the language supports multiple block comments delimiters, you can include multiple blockComment rules.
lineComments	String that starts a line comment (characters to the end of the line are assumed to be a comment). If the language supports multiple line comment operators, you can include multiple lineComment rules.
keywords	Lists all the keywords supported by this language.
symbols	Lists all the symbol names supported by this language. For example, the names of custom types (those not defined by the language itself) could be included in this list. Because this list could be incredibly long and require frequent updates, it is often not used.

Limitations

As I mentioned before, the approach I've taken can easily handle any number of different languages. However, it may not get all the nuances of every language exactly correct. As a result, there are some limitations.

For example, the code doesn't currently support the ability to make tags a different color from attributes within HTML markup because I didn't find a generalized way to write such a rule (which is not to say it couldn't be done). Also, as written, there is no way to colorize numeric constants. I just haven't come up with a generic way to detect numbers in all possible languages due to different number bases, prefixes, suffixes, etc.

For my purposes, the approach I've taken is perfect because this code can do a decent job with pretty much any language. If you only need to work with one language and need to capture every nuance of that language, this code may or may not do everything you need.

Conclusion

That's about it. Again, there are additional classes in the source code that I haven't published here. They provide lower-level support for the code I've presented.

This is potentially a very useful task. I hope you can benefit from it.

History

10/17/2013: Completely rewrote all of the code. Moved the colorizer code into it's own project. While more symbols related to this code are now internal to that project, new functionality has been added publicly, such as the ability to define new language rules from code. The languages file is now stored as an XML file and the LanguageRulesCollection class has methods for reading and writing this file. Added many additional new features and fixed a number of issues.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.

Language:	C#
Technology:	WinForms
Platform:	Windows
License:	CPOL
Views:	23,671