Parsing code with Sprache - Part 2

March 13th, 2021 in Development

😎 This the second article of a multiple part series on how to parse code with Sprache. You can read the first part here.

In the previous post, we saw how to parse some text (in particular, Java code) using Sprache, a might library for C#. We saw how to use an incremental approach, and how to use unit tests to drive the development with this tool. So far we wrote a parser for Identifier and PackageName (check it here).

Now we are going to move forward a bit faster. Remember that we are targeting the Java/Android source project Google Authenticator and that our final goal is to output a graph of class dependencies for this project. In this article, we’re going to try and parse all top-level elements of the current file that we’re working on, AuthenticatorActivity.java.

❗ In all code in the blog I’ve elided the inner docs for clarity. Check the repo to see the comments.

Refactoring our way to success

Before increment the current code, I identified a small adjustment that will help us in the next step. You see, the PackageName parser is actually parsing a package statement. Package names are used in other places so we will increase reuse by extracting a PackageName from the current parser.

First, rename the parser JavaGrammar.PackageName to JavaGrammar.PackageStament, the correct name for what it’s parsing; use your IDE refactor tools for that. You also need to rename the previous test unit from PackageNameParserTests to PackageStatementParserTests to keep things coherent.

Next, let’s extract the parsing of a PackageName from JavaGrammar.PackageStament. See the lines below:

public static readonly Parser<PackageName> PackageStatement =
	from packageKeyword in Sprache.Parse.String("package").Once()
	from space in Sprache.Parse.WhiteSpace.Many()
	// ↓↓↓↓↓↓↓ Parsing of Package Name ↓↓↓↓↓↓↓
	from packageHead in Identifier
	from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
							from identifier in Identifier
							select identifier).Many()
	// ↑↑↑↑↑↑↑ Parsing of Package Name ↑↑↑↑↑↑↑
	from terminator in Sprache.Parse.Char(';')
	// ↓↓↓↓↓↓↓ And this is how the result is build ↓↓↓↓↓↓↓
	select new PackageName(new[] { packageHead }.Concat(packageTail).ToList());
	// ↑↑↑↑↑↑↑ And this is how the result is build ↑↑↑↑↑↑↑↑

We will extract that into an isolated parser, this one actually called PackageName:

public static readonly Parser<PackageName> PackageName =
	from packageHead in Identifier
	from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
						 from identifier in Identifier
						 select identifier).Many()
	select new PackageName(new[] { packageHead }.Concat(packageTail).ToList());

We need to create a structure to represent the package statement:

public class PackageStatement
{
	public PackageStatement(PackageName packageName)
	{
		PackageName = packageName;
	}

	public PackageName PackageName { get; }
}

Let’s update PackageStatement parser to return the structure, using PackageName parser:

public static readonly Parser<PackageName> PackageStatement =
	from packageKeyword in Sprache.Parse.String("package").Once()
	from space in Sprache.Parse.WhiteSpace.Many()
	from packageName in PackageName
	from terminator in Sprache.Parse.Char(';')
	select packageName;

Run all tests, and you will find out that everything is working as it should. This refactor has been commited under the tag Refactor_PackageName.

Parsing more structures

Let’s go back to creating new parsers. The next natural structure is an import statement which has the following BNF:

Import Statement

IMPORT = "import", PACKAGE_NAME, ";";

Let’s begin by updating our previous Import class. Change its name from Import to ImportStatement, which is more exact. Use your IDE refactor tool to rename the class. I’ve also created a constructor for it that initializes the PackageName:

public class ImportStatement
{
	public ImportStatement(PackageName packageName)
	{
		PackageName = packageName;
	}

	public PackageName PackageName { get; }
}

Now, let’s create a test for it:

😎 I’m creating a test file for each parser, even though they are (currently) all part of the same class. This isn’t standard, as it is best practice to have unit test class per class, but writing it this way makes it easier to find the tests and lets us keep our tests more organized.

public class ImportStatementParserTests
{
	[Theory]
	[InlineData("import android.annotation.TargetApi;")]
	[InlineData("import android.app.Activity;")]
	public void Parse_WhenValidPackageName_ReturnsStructureWithCorrectName(string importStatement)
	{
		var actual = JavaGrammar.ImportStatement.Parse(importStatement);

		Assert.Equal(packageName, string.Join('.', actual.PackageName.Identifiers));
	}
}

We used the first few imports from AuthenticatorActivity, as you can see. We are also joining the identifiers as we did last time, which is becoming a bit tedious; we will improve this in the refactor step.

Running this test will lead to 🔴 since code hasn’t been implemented. Let’s write the Parser now:

public static readonly Parser<ImportStatement> ImportStatement =
	from importKeyword in Parse.String("import").Token()
	from packageName in PackageName.Token()
	from delimiter in Parse.Char(';').Token()
	select new ImportStatement(packageName);

Avid readers will notice the previously-unseen-method Token() at the second line. This is one of the most useful methods in Sprache - it will remove whitespace around the character, but will demand it to be there. This means that, for instance, Parse.String("import").Token().Parse(" import ") will work, but Parse.String("import").Token().Parse(" importasd ") will not.

The rest should be very readable - look for the import keyword, then a PackageName structure, and returns a import statement. Run all tests. You should get a 🟢.

Moving on, let’s do a bit of refactor. Remember that I said we’d do something about string.Join('.', actual.PackageName.Identifiers)? Now it’s the time!

The string representation of a package name, say, “android.content.ActivityNotFoundException” is none other than “android.content.ActivityNotFoundException”. So, what do you say we override PackageName.ToString to comply with that behavior?

public class PackageName
{
	// ...

	public override string ToString()
	{
		return string.Join('.', Identifiers);
	}
}

😎 TIP: When overriding methods and properties, use the /// <inheritdoc/> xml doc tag.

Now, following the same idea, we can add a ToString override to ImportStatement:

public class ImportStatement
{
	// ...

	public override string ToString()
	{
		return $"import {PackageName};";
	}
}

Now we can update our unit test to reflect this refactor:

[Theory]
[InlineData("import android.annotation.TargetApi;")]
[InlineData("import android.app.Activity;")]
public void Parse_WhenValidPackageName_ReturnsStructureWithCorrectName(string importStatement)
{
	var actual = JavaGrammar.ImportStatement.Parse(importStatement);

	Assert.Equal(importStatement, actual.ToString());
}

⚠ Warning: in this case the input is formatted exactly as the output of ImportStatement.ToString. However, if you parse something like "import name.surname", while the parse will work, the ToString will return "import name.surname", without spaces. This means that the Token() rule isn’t being tested in our suite - the reason for that being that it doesn’t appear in our scope - therefore, tests are a bit brittle and should be improved on most circunstances. I won’t do this during these exercises, but readers should definetely do it.

Now, we need a way to read a block of imports in our file. We have called it IMPORT_LIST in the eBNF. The data structure for it need be no more than List<ImportStatement>, but we need a parser for it. Begin with a test:

public class ImportListParserTests
{
	public static IEnumerable<object[]> ImportLists()
	{
		yield return new string[]
		{
			@"
import com.google.android.apps.authenticator.util.EmptySpaceClickableDragSortListView;
import com.google.android.apps.authenticator.util.annotations.FixWhenMinSdkVersion;
import com.google.android.apps.authenticator2.R;
import com.google.common.annotations.VisibleForTesting;
			".Trim(),
		};

		yield return new string[]
		{
			@"
import android.support.v7.widget.Toolbar;
import android.text.Html;
import android.util.Log;
import android.view.ActionMode;
import android.view.ContextMenu;
			".Trim(),
		};
	}

	[Theory]
	[MemberData(nameof(ImportLists))]
	public void Parse_WhenValidPackageName_ReturnsStructureWithCorrectName(string importList)
	{
		var expected = importList.Split(Environment.NewLine).ToList();
		var actual = JavaGrammar.ImportList.Parse(importList);

		Assert.Equal(expected, actual.Select(_ => _.ToString()));
	}
}

The implementation for the list is simple:

public static readonly Parser<List<ImportStatement>> ImportList =
	from statements in ImportStatement.Many().Token()
	select statements.ToList();

Now that imports are dealt with, let’s move to the next code structure in AuthenticatorActivity.java, an annotation.

Annotations

The next code structure in AuthenticatorActivity is the class declaration. It contains a piece of code that we haven’t talked about before, an annotation.

Java annotations are analogous to C# attributes, and look like this:

@FixWhenMinSdkVersion(11)

Just like in C#, those structures can only appear before declarations, and in this case, it’s a class declaration. We need to create the data structure to accomodate this, and the parser to produce it.

As mentioned previously, we’re interactively building the eBNF interactively. This is to make our parser simpler - to just write the code necessary for the structures that are present in the source. That’s why the annotation wasn’t mentioned before. Let’s update it:

ANNOTATION = "@", IDENTIFIER, "(", ARGUMENT_LIST, ")"

ARGUMENT_LIST = ARGUMENT, { "," ARGUMENT }

ARGUMENT = LITERAL

LITERAL = INTEGER_LITERAL

The above eBNF is partial; For instance, the argument list for the annotation is more complex, allowing for other types. But thus far we only have the integer parameter, so we will keep ourselves to it.

The first thing is to define the data structure for an annotation. Looking at it, you can imagine that it has a identifier as its name, and then the argument list.

public interface ILiteral
{
	object Value { get; }
}

public class IntegerLiteral  : ILiteral
{
	public IntegerLiteral(int value) 
	{
		Value = value;
	}

	public int Value { get; } 

	object ILiteral.Value => Value;

	public override string ToString()
	{
		return Value.ToString();
	}
}

public class Annotation
{
	public Annotation(string name, List<ILiteral> arguments)
	{
		Name = name;
		Arguments = arguments;
	}

	public string Name { get; }

	public List<Argument> Arguments { get; }

	public override string ToString()
	{
		return $"@{Name}({string.Join(", ", Arguments)})";
	}
}

We have introduced a bit of abstraction that might save us some work later - we’ve made it clear that literal can be many things, no only integers. Casting it to the correct structure will enable users to get the typed value - otherwise, for now, we box the int and return it as an object.

We need to create a parser for this new structure - am integer literal.

Tests:

public class IntergerLiteralParserTests
{
	[Theory]
	[InlineData(11)]
	public void MyTheory(int value)
	{
		var actual = JavaGrammar.IntegerLiteral.Parse(value.ToString());

		Assert.Equal(value, actual.Value);
	}
}

And the parser:

public static readonly Parser<IntegerLiteral> IntegerLiteral =
	from digits in Parse.Digit.AtLeastOnce()
	let number = string.Concat(digits)
	let value = int.Parse(number)
	select new IntegerLiteral(value);

One thing in the code above that might make you wonder is the AtLeastOnce(). This is very close to Many(), with the difference that it will fail parsing when there isn’t at least a single digit. If we do not apply this here, the parser will acceptan an empty argument list.

Now we move to the actual Annotation.

public class AnnotationParserTests
{
	[Theory]
	[InlineData("@Number(11)", new object[] { 11 })]
	public void Parse_WhenAnnotationHasParameters_CorrectParameters(string annotation, object[] parameters)
	{
		var actual = JavaGrammar.Annotation.Parse(annotation);

		Assert.Equal(parameters, actual.Arguments.Cast<object>().ToArray());
	}
}

❗ These tests are very basic; on most production scenarios, I suggest writting tests that have more conditions; for instance, we could test cases like @ SomeAnnotation, splitting the code in two lines, etc. Look at the code being tested to find gaps or risks, and create corresponding tests. Finally, with TDD we should write one test at a time and evolve iteratively.

The reason we can get away with such simple tests is we know in advance all the code that need to be parsed, so we can test agaist a real case and check if bugs arise, but even then, I would be careful where this not just a blog post.

The implementation will leverage the IntegerLiteral parser we just coded:

public static readonly Parser<Annotation> Annotation =
	from at in Parse.Char('@').Once()
	from identifier in Identifier.Token()
	from startList in Parse.Char('(').Token()
	from literal in IntegerLiteral.Token().Optional()
	from endList in Parse.Char(')').Token()
	let arguments = literal.IsDefined
		? new List<ILiteral> { literal.Get() }
		: new List<ILiteral>()
	select new Annotation(identifier, arguments);

Again, we can see an optional parser being called. To get the actual list, we need to do some LINQ gymnastics - either return a list with a single literal or an empty list. In the future, we’ll probably create a ArgumentList parser that should replace this, but until now, there’s no need.

Tests should be 🟢.

Class Declaration

The next thing in the file is the actual class definition. Let’s break it down:

@FixWhenMinSdkVersion(11)
public class AuthenticatorActivity extends TestableActivity {
// ...

This is a good example because we will right out of the bat deal with extends, a common but not basal case. The annotation has already been taken care of, so let’s break down the class declaration:

visibility
   │
   │          identifier, class          identifier, base class
   │                   │                          │
┌──┴─┐       ┌─────────┴─────────┐         ┌──────┴───────┐
public class AuthenticatorActivity extends TestableActivity 
       └─┬─┘                       └──┬──┘
         │             interface inheritance keyword
         │
class declaration keyword

So we need to expand the EBNF:

CLASS_DECLARATION = VISIBILITY, "class", IDENTIFIER, { "extends", IDENTIFIER }, "{", (* ommited *), "}";

VISIBILITY = "public" 

❗ We’re not dealing with other visibilities just for the moment, in other to reflect our evolving approach. As we deal with more cases, we expand on the definition.

This should be actually simple, but we need to update the class structure to reflect it:

public enum Visibility
{
	Public,
}

public class ClassDefinition
{
	public ClassDefinition(Visibility visibility, string name, string? baseClass = null, Annotation? annotation = null)
	{
		Visibility = visibility;
		Name = name;
		BaseClass = baseClass;
		Annotation = annotation;
	}

		public Visibility Visibility { get; }

		public string Name { get; }

		public string? BaseClass { get; }

		public Annotation? Annotation { get; }
}

A few things to note: we’re strictly adhering to the code excerpt that is being processed, so, even though there’s a Visibility member, it’s only possible value is Public; although Java allows for multiple annotations, we are just considering a single annotation, etc. The reason I’m folowing this approach is that we don’t know yet if those cases will arise within code. If they do, we’ll rewrite the code above.

Now let’s create our tests:

public class ClassDefinitionParserTests
{
	[Fact]
	public void Parse_AnnotatedClassWithExtends_CorrectParameters()
	{
		var code = @"
@FixWhenMinSdkVersion(11)
public class AuthenticatorActivity extends TestableActivity
".Trim();

		var actual = JavaGrammar.ClassDefinition.Parse(code);

		Assert.Equal("FixWhenMinSdkVersion", actual.Annotation.Name);
		Assert.Equal(11, actual.Annotation.Arguments[0].Value);
		Assert.Equal(Visibility.Public, actual.Visibility);
		Assert.Equal("AuthenticatorActivity", actual.Name);
		Assert.Equal("TestableActivity", actual.BaseClass);
	}
}

Again, the tests deal only with what we’ve seen so far. I’ve used a Fact instead of a Theory because we are only dealing with a single case. Once we have more cases to test, I’ll convert it to Theory.

Ah! I’m testing the result of the annotation parsing, which is actually repeating the tests already done in the AnnotationParserTests. We could do this a bit differently and just make sure that the correct parser was called, but for now, let’s keep this simple and repeat the test.

Now the implementation, albeit long, is simple, just parse each fragment we saw in the breakdown above:

public static readonly Parser<ClassDefinition> ClassDefinition =
	from annotation in Annotation.Token()
	from visibility in Parse.String("public").Token()
	from classKeyword in Parse.String("class").Token()
	from className in Identifier.Token()
	from extendsKeyword in Parse.String("extends").Token()
	from baseClassName in Identifier.Token()
	select new ClassDefinition(Visibility.Public, className, baseClassName, annotation);

I cut some corners here, like parsing visibility directly. We will also deal with this once we need to.

Run your tests and appreciate your 🟢.

Summary

With this article, we’ve completed the first step: we are able to parse all top-level elements of a Java source file. You’ve learned a bit more about how to use Sprache and combine parsers, and probably picked up some techniques on how to create code that is fit for purpose, using a test-driven approach, and evolving previous code as we advance in our understanding of the problem domain, a process that is called discovery.

All code produced thus far has been stored at Github. You’re welcome to fork it and use it in whichever way you want. To get the exact version of this code, use this tag.

In the next article, we’ll start parsing class elements like constructors, fields and methods. We will be even more focused, dealing only with the code structures that appear in code, and hopefully we can finish parsing our first class.

See you next time!