Resolving Partial URLs

Screenshot of Demo Program

Introduction

I recently had to write some code that spidered the Internet. Basically, spidering is a process of extracting all the links from a URL, and then recursively doing the same for all the links that were extracted. This is the process companies like Google employ to crawl the web.

As you might expect, most of the work in writing such code involves parsing the web page in order to extract all the links. But I also had to develop code that would resolve those links when they were relative.

Resolving Relative Links

Relative links are relative to a base URL. For example, the link Page.htm would refer to Page.htm within the same directory as the base URL. And the link ../Page.htm would refer to Page.htm within the parent directory of the base URL.

In most case, the base URL is simple the directory that contains the page from which the links are extracted. However, pages can also contain a <base> tag, which specifies the base URL for all relative links on that page. For example, if the page contains <base href="http://www.blackbeltcoder" />, then all relative links on the page would be relative to http://www.blackbeltcoder.com.

So the code I wrote started by downloading the current URL. It then searched for a <base> tag. If a <base> tag was found, that value would be used as the base URL; otherwise, the base URL was set to the directory for the page being examined. Then, as I found links within the page, I would resolve any relative links against the base URL in order to produce a final list of absolute URLs.

My ResolveUrl() Method

Listing 1 shows the code I came up with to accomplish this last task. The ResolveUrl() method takes relative URL argument and a base URL argument and combines them into an absolute URL.

The routine is smart enough to handle most anomalies. If the relative URL is already an absolute URL, then the method simply returns the "relative URL". If the relative URL refers to parent directories using the ".." syntax, my routine attempts to correctly resolve them.

Note that, if the base URL contains a filename (e.g. http://www.blackbeltcoder.com/Page.htm), that filename is not incorporated into the absolute URL. Remember, we are only interested in the directory portion of the base URL. The code determines that a given URL contains a filename when it does not end with "/". I found this to be a reliable approach. The routine would not produce correct results if you manually passed http://www.blackbeltcoder.com/SomeDirectory as the base URL because it would assume SomeDirectory is a filename.

Listing 1: The UrlHelper Class and ResolveUrl Method

class UrlHelper
{
    private static char[] _slashes = { '\\', '/' };

    /// <summary>
    /// Creates an absolute URL by combining a relative URL with a base URL.
    /// </summary>
    /// <param name="relativeUrl">Relative URL to resolve</param>
    /// <param name="baseUrl">Base URL that first URL is relative to</param>
    /// <returns></returns>
    public static string ResolveUrl(string relativeUrl, string baseUrl)
    {
        const string defaultProtocol = "http://";
        int i, j;

        // Assume URL is already absolute if it includes a protocol
        if (GetProtocolLength(relativeUrl) > 0)
            return relativeUrl;

        // Ensure base href has a protocol
        int protocolLen = GetProtocolLength(baseUrl);
        if (protocolLen == 0)
        {
            // Insert default protocol
            baseUrl = baseUrl.Insert(0, defaultProtocol);
            protocolLen = defaultProtocol.Length;
        }

        // Combine relative URL with base URL
        if (relativeUrl.StartsWith("#"))
        {
            // Relative URL specifies bookmark
            relativeUrl = baseUrl + relativeUrl;
        }
        else
        {
            if (relativeUrl.StartsWith("/") || relativeUrl.StartsWith("\\"))
            {
                // Relative URL starts at root directory
                i = baseUrl.IndexOfAny(_slashes, protocolLen);
            }
            else
            {
                // Append relative directory to base directory
                i = baseUrl.LastIndexOfAny(_slashes);
            }
            if (i < protocolLen)
                i = baseUrl.Length;

            // Append base and relative URL with exactly 1 '/' character between
            baseUrl = baseUrl.Substring(0, i).TrimEnd(_slashes);
            relativeUrl = relativeUrl.TrimStart(_slashes);
            relativeUrl = String.Format("{0}/{1}", baseUrl, relativeUrl);
        }

        // Remove unlikely "/./" separators
        relativeUrl = relativeUrl.Replace("/./", "/");

        // Resolve parent directory references
        i = relativeUrl.IndexOf("/..");
        while (i >= 0)
        {
            // See if url has parent directory
            j = relativeUrl.LastIndexOfAny(_slashes, i - 1);
            if (j >= protocolLen)
            {
                // Get parent by removing subdirectory
                relativeUrl = relativeUrl.Substring(0, j) + relativeUrl.Substring(i + 3);
            }
            else
            {
                // No parent directory--just remove "/.." as IE does
                relativeUrl = relativeUrl.Substring(0, i) + relativeUrl.Substring(i + 3);
            }
            i = relativeUrl.IndexOf("/..");
        }
        return relativeUrl;
    }

    /// <summary>
    /// Returns the length of the protocol of the given URL. Includes any protocol
    /// punctuation (e.g. "://")
    /// </summary>
    /// <param name="url"></param>
    /// <returns></returns>
    public static int GetProtocolLength(string url)
    {
        char[] punc = { '.', '=', '?', '&', '#', '@', '/', '\\' };

        int i = url.IndexOf(':');
        if (i >= 0)
        {
            // We're probably not looking at the protocol if any of these
            // characters came before the colon
            if (url.LastIndexOfAny(punc, i, i) == -1)
            {
                // Include colon
                i++;
                // Include any slashes
                while (i < url.Length && url[i] == '/')
                    i++;
                return i;
            }
        }
        return 0;
    }
}

Using the Code

Using the code is very simple. Just call the ResolveUrl() method with the appropriate arguments. Since the method is static, you do not need to create an instance of the class.

Listing 2: Calling the ResolveUrl() Method

string absUrl = UrlHelper.ResolveUrl(relativeUrl, baseUrl);

Conclusion

That pretty much wraps it up. I can't say that crawling the web is that common of a task. But if you ever need to do that, you might find this routine very useful.

End-User License

Use of this article and any related source code or other files is governed by the terms and conditions of The Code Project Open License.

Author Information

Jonathan Wood

I'm a software/website developer working out of the greater Salt Lake City area in Utah. I've developed many websites including Black Belt Coder, Insider Articles, and others.

Language:	C#
Technology:	.NET
Platform:	Windows
License:	CPOL
Views:	7,039