Parsing code with Sprache - Part 1

February 13th, 2021 in Development

😎 This the first article of a multiple part series on how to parse code with Sprache. You can read the second part here.

You probably know that sometimes (many times), a developer’s job is much more research or analysis than programming! In the past, there have been many instances where I was on a project where my goal was to migrate or rewrite old software, and during those engagements, I ended up creating tools to help me. Many of those tools were specialized code parsers that would automatically perform the analysis I required or even spit out newer code that I could reuse.

In all those cases, I needed to parse some structured code, and there’s a simple way of doing this: by using Sprache, a parsing framework for dotnet. In this article, I’m going to teach you how to do that.

Parsing Java code

You can write all sorts of parsers with Sprache. As long as the language is formal, it’s easy to build the necessary code, and you do that leveraging all of C# syntax greatness. I will propose an example: a parser to a partial Java ☕ grammar.

I’m choosing Java because it’s complex enough to let us see how Sprache can make your life easier; because Java is close to C#, so readers will quickly understand most of it; and because Java’s formal grammar is structured in such a way that one can extrapolate its grammar without much research.

Of course, writing a full parser would take too long; we will have to constrain ourselves to a partial analyzer. The way we do that is to get the code we want to parse and then specify the grammar for just that (I call this approach best fit, an important agile principle). So I decided on a victim for this endeavor: Google Authenticator is the source of an Android app that generates OTP’s. The app can be found in the Play Store.

😎 I’ve used this source code to reproduce the OTP algorithm, so I’m familiar with it.

To have a tangible goal, we will output a graph of class dependencies; This should help you get how to do more than just the parsing but also how to use the parsed code to get insight or produce other artifacts.

Java BNF

The starting point for our code is the formal grammar for Java. The eBNF - Extended Backus-Naur Form or notation - is a description of a formal grammar. For instance, a for block in Java can be:

Example:

for (int i = 0; i < 100; i++) { System.out.println(i); }

Grammar:

FORBLOCK = "for" "(" INITIALIZER ";" CONDITION ";" INCREMENT) BLOCK

INITIALIZER = ASSIGNMENT

ASSIGNMENT = TYPE VARIABLE "=" LITERAL 

CONDITION = BOOLEAN_EXPRESSION

INCREMENT = VARIABLE "++" | "++" VARIABLE

BLOCK = STATEMENT | "{" STATEMENT_LIST "}"

❗ This grammar is incomplete and not necessarily correct; This is a practical guide on using Sprache, not on BNF and formal languages.

So we begin by looking at our target code and creating a small grammar. To keep things simple, we will be writing parsers for every source in java/com/google/android/apps/authenticator/ and ignore non-java files.

For instance, take a look at AuthenticatorActivity.java, the first file in scope. You will see that the first few lines are comments, and then we have a package instruction, followed by several import statements, and finally a class definition. Excluding the comments (we will touch on that later), the grammar would be:

JAVA_FILE = PACKAGE, IMPORT_LIST, CLASS_DECLARATIONS; 

PACKAGE = "package", PACKAGE_NAME, ";";

PACKAGE_NAME = WORD, { "." WORD };

WORD = CHARACTER, { CHARACTER };

CHARACTER = LETTER | DIGIT;

IMPORT_LIST = IMPORT, { IMPORT };

IMPORT = "import", PACKAGE_NAME, ";";

CLASS_DECLARATION = "public", "class", WORD, "{" (* ommited *) "}";

Take a look at that language, compare it to the file, but check it against your knowledge of Java (or C#, just by realizing that those two languages are very similar). On one hand, you’ll probably see that the grammar above could be used to produce the source code; on the other, you’ll quickly realize that it doesn’t account for many constructs that are legal in Java: for instance, a source could have no imports whatsoever. That’s what I meant by best fit - we are looking for a way to parse only the code in scope.

Setting up the project

Once we have a small grammar, we can start to code. Create a new dotnet library project, add Sprache to it via Nuget, create an additional test project with xUnit, and download Google Authenticator’s source to a directory.

We will define data structures to store each element of our grammar, and a test to check if the parse worked as expected.

// BNF classes
public class JavaFile
{
    public Package Package { get; private set; }

    public List<Import> ImportList { get; private set; }

    public ClassDefinition ClassDefinition { get; private set; }
}

public class Package
{
    public PackageName PackageName { get; set; }
}

public class PackageName
{
	public PackageName() {}

	public PackageName(List<string> identifers) 
	{ 
		Identifiers = identifers;
	}

	public List<string> Identifiers { get; } = new List<string>();
}

public class Import
{
    public PackageName PackageName { get; set; }
}

public class ClassDefinition
{
    // OMMITED
}

A few important things to note on the structure above:

I didn’t define ImportList or any list structures - I simply used the List generic from C#.
I didn’t define a WORD construct yet - I will be storing those simply as strings for the time being since we’re not interested in it.

Now, how does a parser looks like? We need something that can get the data from the source - text, after all - and produce the structures above:

interface IParser<T>
{
    T Parse(string code);
}

This is a basic parser for a certain definition T - it’s just a single method that reads a string and returns the desired structure. We’ll begin by declaring one of the basics parsers, the Package parser. We’ll revisit this structure in time.

public class PackageNameParser : IParser<Package>
{
    public Package Parse(string code)
    {
        throw new NotImplementedException();
    }
}

This will be our starting point for the parser.

With those classes defined, we can move into setting up the tests.

The first few tests

😎 Always write tests first.

I’ll be using xUnit, which is probably the best testing framework for C#. You can adapt the code below to anything you want, though.

The first test case should be a warm-up, so we test what happens when code is null.

public class PackageNameParserTests
{
    [Fact]
    public void Parse_WhenCodeNull_Throws()
    {
        var sut = new PackageNameParser();

        Assert.Throws<ArgumentNullException>(() => sut.Parse(null));
    }
}

Run the test - it will fail, as usual, complying with a red-green-refactor approach. The next step is to turn it green:

public class PackageNameParser : IParser<PackageName>
{
    public PackageName Parse(string code)
    {
        if (code == null) throw new ArgumentNullException(nameof(code));

        throw new NotImplementedException();
    }
}

Run it again and the test will pass. The warm-up is done, our next step is to create a happy path. We want to test that we can correctly capture the package name:

[Fact]
public void Parse_WhenValidPackage_ReturnsName()
{
    var packageExpresison = "package tinyJavaParser;";

    var sut = new PackageNameParser();
    
    var actual = sut.Parse(packageExpresison);

    Assert.Equal("tinyJavaParser", actual.Indentifiers[0]);
}

Run it and the test will fail. Let’s implement our first Sprache parser!

Sprache parsers

If you take a look at Sprache’s README, you’ll quickly grasp how to work with it. All begins with the static Parse object, which call be called to parse multiple types of texts. You build a complex parser by combining those basic parsers using LINQ.

For instance, our package name is a string made up of chars, so we can use Sprache.Parse.Letter to parse it, but first we must account for the package keyword and whitespace that comes before:

Parser<PackageName> parser =
    from packageKeyword in Sprache.Parse.String("package").Once()
    from space in Sprache.Parse.WhiteSpace.Many()
    from packageName in Sprache.Parse.Letter.Many().Text()
    from ending in Sprache.Parse.Char(';')
    select new PackageName { Indentifiers = new List<string> { packageName } };

Let’s look at it line by line. First, we account for the package keyword, stating clearly that it must appear only once:

from packageKeyword in Sprache.Parse.String("package").Once()

Then we know that there’ll be some spaces between the keyword and the identifier:

    from space1 in Sprache.Parse.WhiteSpace.Many()

After that we extract the package name, which is a sequence of letters:

    from packageName in Sprache.Parse.Letter.Many().Text()

And finally, we make sure the line ends with a ;:

    from ending in Sprache.Parse.Char(';')

Once we did all that, we can produce the parsed structure, PackageName, using the data we gathered before:

    select new PackageName { Indentifiers = new() { packageName } };

Let’s use this knowledge and implement the method:

public PackageName Parse(string code)
{
    if (code == null) throw new ArgumentNullException(nameof(code));

    Parser<PackageName> parser =
        from packageKeyword in Sprache.Parse.String("package").Once()
        from space in Sprache.Parse.WhiteSpace.Many()
        from packageName in Sprache.Parse.Letter.Many().Text()
        from ending in Sprache.Parse.Char(';')
        select new PackageName(new() { packageName }) };

    return parser.Parse(code);
}

Run the tests, you’ll get 🟢. Now, before we go on to refactor, I want to improve the test we just made, and make it more generic.

The test is only accounting for a package name with a single word. This is not enough, and maybe you already guessed that composed identifiers will fail. We’ll add some other examples, and the best source for those is - you’ve guessed it! - the Authenticator source code.

You may realize that we are also not accounting for possible identifiers that those that have letters or symbols such as underscore. But in keeping with our goal, we don’t need a parser that will read every legal Java code ever written, instead, just something that will parse the source we’re dealing with. This might not be the case on other projects where you need a parser - for instance, maybe you don’t have full access to the code before running the parser, and in that case, you’ll need to take a deep dive into the language definition. Likewise, our parser isn’t trying to validate the code that it’s consuming - whatever we find on those files we accept that it’s valid Java. Again, you may want to take a different approach, for instance if this parser is going to be fed code written just for it, since then you’ll have people making mistakes and you’ll need to tell them what those are.

To find the examples of packages, I just used VSCode to open the directory where the code is, java/com/google/android/apps/authenticator/, and went to search:

And now we turn the test from Fact to Theory using some of those names:

I eyeballed the list of packages and picked those that I found to be representative of the domain of the name. We’ll, afterward, parse everything but it’s important that before that we already have some good testing in place.

[Theory]
[InlineData("tinyJavaParser")]
[InlineData("com.google.android.apps.authenticator")]
[InlineData("com.google.android.apps.authenticator.enroll2sv.wizard")]
public void Parse_WhenValidPackage_ReturnsName(string packageName)
{
    var packageExpresison = $"package {packageName};";

    var sut = new PackageNameParser();

    var actual = sut.Parse(packageExpresison);

    Assert.Equal(packageName, string.Join('.', actual.Identifiers));
}

Take your time understanding the changes we’ve made to the test. Remember that the refactor step applies also to tests so, while we could keep the test as it was and just add a new theory, this is cleaner, i.e., easier to maintain.

Now we run the tests and, of course, 🔴. Time to correct the code.

First we try to fix it for com.google.android.apps.authenticator, by accounting for multiple identifiers:

public PackageName Parse(string code)
{
    if (code == null) throw new ArgumentNullException(nameof(code));

    Parser<string> identifierParser = Sprache.Parse.Letter.Many().Text();

    Parser<PackageName> parser =
        from packageKeyword in Sprache.Parse.String("package").Once()
        from space in Sprache.Parse.WhiteSpace.Many()
        from packageHead in identifierParser
        from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
                                from identifier in identifierParser
                                select identifier).Many()
        from terminator in Sprache.Parse.Char(';')
        select new PackageName { Identifiers = (new[] { packageHead }).Concat(packageTail).ToList() };

    return parser.Parse(code);
}

Let’s once again dissect the code.

We begin by creating a “sub parser”, so to speak, that reads an identifier. It should be very straightforward since it’s the same code that we had before - get as many letters in sequence as possible, and convert it to “text”, i.e., a string.

Parser<string> identifierParser = Sprache.Parse.Letter.Many().Text();

Now we increment the old parsing rules to account for repetitions of identifiers. We do this by getting at least the first identifier and then, optionally, many others. packageTail also uses a sub parser that I didn’t declare, so it’s inlined because it only makes sense inside this parser.

from packageHead in identifierParser
from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
                        from identifier in identifierParser
                        select identifier).Many()

Finally, we need to create a list from the single identifier packageHead and the multiples in packageTail, so I cheated a bit:

select new PackageName { Identifiers = (new[] { packageHead }).Concat(packageTail).ToList() };

I create an array with just the packageHead (new[] { packageHead }) and concatenate it with packageTail. After that, I just call ToList() to turn the resulting array into a list.

This will test green 🟢 for "com.google.android.apps.authenticator", but still fails for "com.google.android.apps.authenticator.enroll2sv.wizard". Let’s take a look at the error:

Message: 
    Sprache.ParseException : Parsing failure: unexpected '2'; expected ; (Line 1, Column 53); recently consumed: tor.enroll

Sprache is telling us there’s a 2 it didn’t expect in the stream, and that it just had consumed "tor.enroll". If we look at the input string, it’s easy to see why it fails:

                         Just read from here
                                  ↓
com.google.android.apps.authenticator.enroll2sv.wizard
                                            ↑
                                      Failed here

So, as expected, the number 2 isn’t allowed. This is just a matter of expanding the identifierParser definition to take that into account:

Parser<string> identifierParser = Sprache.Parse.LetterOrDigit.Many().Text();

Run tests and appreciate your 🟢!

Refactor!

Now it’s time we refactor the code. We will look at how to make our code simpler to maintain and use.

There two things that we should take notice of:

In PackageNameParser.Parse we create two parser instances, identifierParser and parser. They are then used only once, right at the return. Multiple calls to this method result in recreating those instances, but they can be reused. This is made evident by Sprache docs as well.
We defined an IParser<T> interface, and Sprache has a Parser<out T> delegate, which is very close to ours. Sprache can define parsers using its delegate, and we can combine those to produce new, more complex, parsers.

With this knowledge, we should refactor our code to define once and use the parsers multiple times; We should also ditch our parsing interface and use Sprache’s delegate. Sprache way of doing things is defining those delegates as static members of a class, so we could create:

public static class JavaGrammar
{
    public static readonly Parser<string> Identifier = Sprache.Parse.LetterOrDigit.Many().Text();

	public static readonly Parser<PackageName> PackageName =
		from packageKeyword in Sprache.Parse.String("package").Once()
		from space in Sprache.Parse.WhiteSpace.Many()
		from packageHead in Identifier
		from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
								from identifier in Identifier
								select identifier).Many()
		from terminator in Sprache.Parse.Char(';')
		select new PackageName(new[] { packageHead }.Concat(packageTail).ToList());
}

And changing the tests to reflect that:

public class PackageNameParserTests
{
    [Fact]
    public void Parse_WhenCodeNull_Throws()
    {
        Assert.Throws<ArgumentNullException>(() => JavaGrammar.PackageName.Parse(null));
    }

    [Theory]
    [InlineData("tinyJavaParser")]
    [InlineData("com.google.android.apps.authenticator")]
    [InlineData("com.google.android.apps.authenticator.enroll2sv.wizard")]
    public void Parse_WhenValidPackage_ReturnsName(string packageName)
    {
        var packageExpresison = $"package {packageName};";

        var actual = JavaGrammar.PackageName.Parse(packageExpresison);

        Assert.Equal(packageName, string.Join('.', actual.Identifiers));
    }
}

❗ If you’re getting an error that the delegate doesn’t have a Parse method, fret not! You just have to add a using Sprache; directive at the top of the file.

You can remove the IParser interface that we created to help us define how the code should look like.

Review

Ufs, that was long! 😩We’ve seen a lot! Let’s review all that for a moment, shall we?

You’ve learned that language syntaxes are governed by grammars and that those can be expressed with a notation known as BNF or EBNF.
You’ve created a simplified EBNF for Java, taking into account only what was in scope. You know we call this best fit, and an agile principle that is similar to YAGNI - build only what you’re going to use.
You’ve created a bunch of test cases, mostly derived from the problem domain itself - the source from Google’s Authenticator.
You’ve learned how to combine Sprache parsers to create another, more complex parser.

This has laid the foundation for parsing the more complex program structures. In the next article, we’ll finish our grammar by creating more parsers that can deal with the whole source. I will also show you how to deal with comments and how to “skip” code that you’re not interested in. Following that article, as a bonus, we’ll finish building our tool and outputting the graph that will show users the relationship between classes.

All code produced thus far has been stored at Github. You’re welcome to fork it and use it in whichever way you want. To get the exact version of this code, use this tag.