Parsing code with Sprache - Part 1
š This the first article of a multiple part series on how to parse code with Sprache. You can read the second part here.
You probably know that sometimes (many times), a developerās job is much more research or analysis than programming! In the past, there have been many instances where I was on a project where my goal was to migrate or rewrite old software, and during those engagements, I ended up creating tools to help me. Many of those tools were specialized code parsers that would automatically perform the analysis I required or even spit out newer code that I could reuse.
In all those cases, I needed to parse some structured code, and thereās a simple way of doing this: by using Sprache, a parsing framework for dotnet. In this article, Iām going to teach you how to do that.
Parsing Java code
You can write all sorts of parsers with Sprache. As long as the language is formal, itās easy to build the necessary code, and you do that leveraging all of C# syntax greatness. I will propose an example: a parser to a partial Java ā grammar.
Iām choosing Java because itās complex enough to let us see how Sprache can make your life easier; because Java is close to C#, so readers will quickly understand most of it; and because Javaās formal grammar is structured in such a way that one can extrapolate its grammar without much research.
Of course, writing a full parser would take too long; we will have to constrain ourselves to a partial analyzer. The way we do that is to get the code we want to parse and then specify the grammar for just that (I call this approach best fit, an important agile principle). So I decided on a victim for this endeavor: Google Authenticator is the source of an Android app that generates OTPās. The app can be found in the Play Store.
š Iāve used this source code to reproduce the OTP algorithm, so Iām familiar with it.
To have a tangible goal, we will output a graph of class dependencies; This should help you get how to do more than just the parsing but also how to use the parsed code to get insight or produce other artifacts.
Java BNF
The starting point for our code is the formal grammar for Java. The eBNF - Extended Backus-Naur Form or notation - is a description of a formal grammar. For instance, a for
block in Java can be:
Example:
for (int i = 0; i < 100; i++) { System.out.println(i); }
Grammar:
FORBLOCK = "for" "(" INITIALIZER ";" CONDITION ";" INCREMENT) BLOCK
INITIALIZER = ASSIGNMENT
ASSIGNMENT = TYPE VARIABLE "=" LITERAL
CONDITION = BOOLEAN_EXPRESSION
INCREMENT = VARIABLE "++" | "++" VARIABLE
BLOCK = STATEMENT | "{" STATEMENT_LIST "}"
ā This grammar is incomplete and not necessarily correct; This is a practical guide on using Sprache, not on BNF and formal languages.
So we begin by looking at our target code and creating a small grammar. To keep things simple, we will be writing parsers for every source in java/com/google/android/apps/authenticator/
and ignore non-java files.
For instance, take a look at AuthenticatorActivity.java, the first file in scope. You will see that the first few lines are comments, and then we have a package
instruction, followed by several import
statements, and finally a class
definition. Excluding the comments (we will touch on that later), the grammar would be:
JAVA_FILE = PACKAGE, IMPORT_LIST, CLASS_DECLARATIONS;
PACKAGE = "package", PACKAGE_NAME, ";";
PACKAGE_NAME = WORD, { "." WORD };
WORD = CHARACTER, { CHARACTER };
CHARACTER = LETTER | DIGIT;
IMPORT_LIST = IMPORT, { IMPORT };
IMPORT = "import", PACKAGE_NAME, ";";
CLASS_DECLARATION = "public", "class", WORD, "{" (* ommited *) "}";
Take a look at that language, compare it to the file, but check it against your knowledge of Java (or C#, just by realizing that those two languages are very similar). On one hand, youāll probably see that the grammar above could be used to produce the source code; on the other, youāll quickly realize that it doesnāt account for many constructs that are legal in Java: for instance, a source could have no import
s whatsoever. Thatās what I meant by best fit - we are looking for a way to parse only the code in scope.
Setting up the project
Once we have a small grammar, we can start to code. Create a new dotnet library project, add Sprache to it via Nuget, create an additional test project with xUnit, and download Google Authenticatorās source to a directory.
We will define data structures to store each element of our grammar, and a test to check if the parse worked as expected.
// BNF classes
public class JavaFile
{
public Package Package { get; private set; }
public List<Import> ImportList { get; private set; }
public ClassDefinition ClassDefinition { get; private set; }
}
public class Package
{
public PackageName PackageName { get; set; }
}
public class PackageName
{
public PackageName() {}
public PackageName(List<string> identifers)
{
Identifiers = identifers;
}
public List<string> Identifiers { get; } = new List<string>();
}
public class Import
{
public PackageName PackageName { get; set; }
}
public class ClassDefinition
{
// OMMITED
}
A few important things to note on the structure above:
- I didnāt define ImportList or any list structures - I simply used the
List
generic from C#. - I didnāt define a WORD construct yet - I will be storing those simply as strings for the time being since weāre not interested in it.
Now, how does a parser looks like? We need something that can get the data from the source - text, after all - and produce the structures above:
interface IParser<T>
{
T Parse(string code);
}
This is a basic parser for a certain definition T
- itās just a single method that reads a string and returns the desired structure. Weāll begin by declaring one of the basics parsers, the Package
parser. Weāll revisit this structure in time.
public class PackageNameParser : IParser<Package>
{
public Package Parse(string code)
{
throw new NotImplementedException();
}
}
This will be our starting point for the parser.
With those classes defined, we can move into setting up the tests.
The first few tests
š Always write tests first.
Iāll be using xUnit, which is probably the best testing framework for C#. You can adapt the code below to anything you want, though.
The first test case should be a warm-up, so we test what happens when code is null
.
public class PackageNameParserTests
{
[Fact]
public void Parse_WhenCodeNull_Throws()
{
var sut = new PackageNameParser();
Assert.Throws<ArgumentNullException>(() => sut.Parse(null));
}
}
Run the test - it will fail, as usual, complying with a red-green-refactor approach. The next step is to turn it green:
public class PackageNameParser : IParser<PackageName>
{
public PackageName Parse(string code)
{
if (code == null) throw new ArgumentNullException(nameof(code));
throw new NotImplementedException();
}
}
Run it again and the test will pass. The warm-up is done, our next step is to create a happy path. We want to test that we can correctly capture the package name:
[Fact]
public void Parse_WhenValidPackage_ReturnsName()
{
var packageExpresison = "package tinyJavaParser;";
var sut = new PackageNameParser();
var actual = sut.Parse(packageExpresison);
Assert.Equal("tinyJavaParser", actual.Indentifiers[0]);
}
Run it and the test will fail. Letās implement our first Sprache parser!
Sprache parsers
If you take a look at Spracheās README, youāll quickly grasp how to work with it. All begins with the static Parse
object, which call be called to parse multiple types of texts. You build a complex parser by combining those basic parsers using LINQ.
For instance, our package name is a string made up of chars, so we can use Sprache.Parse.Letter
to parse it, but first we must account for the package
keyword and whitespace that comes before:
Parser<PackageName> parser =
from packageKeyword in Sprache.Parse.String("package").Once()
from space in Sprache.Parse.WhiteSpace.Many()
from packageName in Sprache.Parse.Letter.Many().Text()
from ending in Sprache.Parse.Char(';')
select new PackageName { Indentifiers = new List<string> { packageName } };
Letās look at it line by line. First, we account for the package keyword, stating clearly that it must appear only once:
from packageKeyword in Sprache.Parse.String("package").Once()
Then we know that thereāll be some spaces between the keyword and the identifier:
from space1 in Sprache.Parse.WhiteSpace.Many()
After that we extract the package name, which is a sequence of letters:
from packageName in Sprache.Parse.Letter.Many().Text()
And finally, we make sure the line ends with a ;
:
from ending in Sprache.Parse.Char(';')
Once we did all that, we can produce the parsed structure, PackageName
, using the data we gathered before:
select new PackageName { Indentifiers = new() { packageName } };
Letās use this knowledge and implement the method:
public PackageName Parse(string code)
{
if (code == null) throw new ArgumentNullException(nameof(code));
Parser<PackageName> parser =
from packageKeyword in Sprache.Parse.String("package").Once()
from space in Sprache.Parse.WhiteSpace.Many()
from packageName in Sprache.Parse.Letter.Many().Text()
from ending in Sprache.Parse.Char(';')
select new PackageName(new() { packageName }) };
return parser.Parse(code);
}
Run the tests, youāll get š¢. Now, before we go on to refactor, I want to improve the test we just made, and make it more generic.
The test is only accounting for a package name with a single word. This is not enough, and maybe you already guessed that composed identifiers will fail. Weāll add some other examples, and the best source for those is - youāve guessed it! - the Authenticator source code.
You may realize that we are also not accounting for possible identifiers that those that have letters or symbols such as underscore. But in keeping with our goal, we donāt need a parser that will read every legal Java code ever written, instead, just something that will parse the source weāre dealing with. This might not be the case on other projects where you need a parser - for instance, maybe you donāt have full access to the code before running the parser, and in that case, youāll need to take a deep dive into the language definition. Likewise, our parser isnāt trying to validate the code that itās consuming - whatever we find on those files we accept that itās valid Java. Again, you may want to take a different approach, for instance if this parser is going to be fed code written just for it, since then youāll have people making mistakes and youāll need to tell them what those are.
To find the examples of packages, I just used VSCode to open the directory where the code is, java/com/google/android/apps/authenticator/
, and went to search:
And now we turn the test from Fact
to Theory
using some of those names:
I eyeballed the list of packages and picked those that I found to be representative of the domain of the name. Weāll, afterward, parse everything but itās important that before that we already have some good testing in place.
[Theory]
[InlineData("tinyJavaParser")]
[InlineData("com.google.android.apps.authenticator")]
[InlineData("com.google.android.apps.authenticator.enroll2sv.wizard")]
public void Parse_WhenValidPackage_ReturnsName(string packageName)
{
var packageExpresison = $"package {packageName};";
var sut = new PackageNameParser();
var actual = sut.Parse(packageExpresison);
Assert.Equal(packageName, string.Join('.', actual.Identifiers));
}
Take your time understanding the changes weāve made to the test. Remember that the refactor step applies also to tests so, while we could keep the test as it was and just add a new theory, this is cleaner, i.e., easier to maintain.
Now we run the tests and, of course, š“. Time to correct the code.
First we try to fix it for com.google.android.apps.authenticator, by accounting for multiple identifiers:
public PackageName Parse(string code)
{
if (code == null) throw new ArgumentNullException(nameof(code));
Parser<string> identifierParser = Sprache.Parse.Letter.Many().Text();
Parser<PackageName> parser =
from packageKeyword in Sprache.Parse.String("package").Once()
from space in Sprache.Parse.WhiteSpace.Many()
from packageHead in identifierParser
from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
from identifier in identifierParser
select identifier).Many()
from terminator in Sprache.Parse.Char(';')
select new PackageName { Identifiers = (new[] { packageHead }).Concat(packageTail).ToList() };
return parser.Parse(code);
}
Letās once again dissect the code.
We begin by creating a āsub parserā, so to speak, that reads an identifier. It should be very straightforward since itās the same code that we had before - get as many letters in sequence as possible, and convert it to ātextā, i.e., a string
.
Parser<string> identifierParser = Sprache.Parse.Letter.Many().Text();
Now we increment the old parsing rules to account for repetitions of identifiers. We do this by getting at least the first identifier and then, optionally, many others. packageTail
also uses a sub parser that I didnāt declare, so itās inlined because it only makes sense inside this parser.
from packageHead in identifierParser
from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
from identifier in identifierParser
select identifier).Many()
Finally, we need to create a list from the single identifier packageHead
and the multiples in packageTail
, so I cheated a bit:
select new PackageName { Identifiers = (new[] { packageHead }).Concat(packageTail).ToList() };
I create an array with just the packageHead (new[] { packageHead }
) and concatenate it with packageTail
. After that, I just call ToList()
to turn the resulting array into a list.
This will test green š¢ for "com.google.android.apps.authenticator"
, but still fails for "com.google.android.apps.authenticator.enroll2sv.wizard"
. Letās take a look at the error:
Message:
Sprache.ParseException : Parsing failure: unexpected '2'; expected ; (Line 1, Column 53); recently consumed: tor.enroll
Sprache is telling us thereās a 2
it didnāt expect in the stream, and that it just had consumed "tor.enroll"
. If we look at the input string, itās easy to see why it fails:
Just read from here
ā
com.google.android.apps.authenticator.enroll2sv.wizard
ā
Failed here
So, as expected, the number 2 isnāt allowed. This is just a matter of expanding the identifierParser
definition to take that into account:
Parser<string> identifierParser = Sprache.Parse.LetterOrDigit.Many().Text();
Run tests and appreciate your š¢!
Refactor!
Now itās time we refactor the code. We will look at how to make our code simpler to maintain and use.
There two things that we should take notice of:
-
In
PackageNameParser.Parse
we create two parser instances,identifierParser
andparser
. They are then used only once, right at the return. Multiple calls to this method result in recreating those instances, but they can be reused. This is made evident by Sprache docs as well. -
We defined an
IParser<T>
interface, and Sprache has aParser<out T>
delegate, which is very close to ours. Sprache can define parsers using its delegate, and we can combine those to produce new, more complex, parsers.
With this knowledge, we should refactor our code to define once and use the parsers multiple times; We should also ditch our parsing interface and use Spracheās delegate. Sprache way of doing things is defining those delegates as static members of a class, so we could create:
public static class JavaGrammar
{
public static readonly Parser<string> Identifier = Sprache.Parse.LetterOrDigit.Many().Text();
public static readonly Parser<PackageName> PackageName =
from packageKeyword in Sprache.Parse.String("package").Once()
from space in Sprache.Parse.WhiteSpace.Many()
from packageHead in Identifier
from packageTail in (from delimiter in Sprache.Parse.Char('.').Once()
from identifier in Identifier
select identifier).Many()
from terminator in Sprache.Parse.Char(';')
select new PackageName(new[] { packageHead }.Concat(packageTail).ToList());
}
And changing the tests to reflect that:
public class PackageNameParserTests
{
[Fact]
public void Parse_WhenCodeNull_Throws()
{
Assert.Throws<ArgumentNullException>(() => JavaGrammar.PackageName.Parse(null));
}
[Theory]
[InlineData("tinyJavaParser")]
[InlineData("com.google.android.apps.authenticator")]
[InlineData("com.google.android.apps.authenticator.enroll2sv.wizard")]
public void Parse_WhenValidPackage_ReturnsName(string packageName)
{
var packageExpresison = $"package {packageName};";
var actual = JavaGrammar.PackageName.Parse(packageExpresison);
Assert.Equal(packageName, string.Join('.', actual.Identifiers));
}
}
ā If youāre getting an error that the delegate doesnāt have a Parse method, fret not! You just have to add a
using Sprache;
directive at the top of the file.
You can remove the IParser interface that we created to help us define how the code should look like.
Review
Ufs, that was long! š©Weāve seen a lot! Letās review all that for a moment, shall we?
- Youāve learned that language syntaxes are governed by grammars and that those can be expressed with a notation known as BNF or EBNF.
- Youāve created a simplified EBNF for Java, taking into account only what was in scope. You know we call this best fit, and an agile principle that is similar to YAGNI - build only what youāre going to use.
- Youāve created a bunch of test cases, mostly derived from the problem domain itself - the source from Googleās Authenticator.
- Youāve learned how to combine Sprache parsers to create another, more complex parser.
This has laid the foundation for parsing the more complex program structures. In the next article, weāll finish our grammar by creating more parsers that can deal with the whole source. I will also show you how to deal with comments and how to āskipā code that youāre not interested in. Following that article, as a bonus, weāll finish building our tool and outputting the graph that will show users the relationship between classes.
All code produced thus far has been stored at Github. Youāre welcome to fork it and use it in whichever way you want. To get the exact version of this code, use this tag.