Embedded DSL Parser


#1

I had the chance already to write about the project I’m working on, DeveelDB (http://github.com/deveel/deveeldb), that is based on the SQL model. As such, one of the core components is an SQL language compiler.
In the current/old version (Version 1.0b) I used a parser generator I ported from JavaCC (http://github.com/deveel/csharpcc), because this was convenient for the scope of the project, not requiring any reference to any external library, and the license of the BSD license of the generated files allowed me to include them in the project (that goes as APL 2.0).

Although it is working fine technically (despite of the ugly generated code), the level of maintainability was really low, also given the Java origin of the project.
Because of this reason I decided to recreate the parser (in the new version 2.0, under development), using a modified version of Irony (http://irony.codeplex.com), that I could then embed into the core library, given the MIT license of the project.
The logic behind the new parser is totally different and requires further implementations to handle the tree-visitor, and some times it’s a pain to actually implement it (because of the string checks of the non-terminals to verify the correct allocation of nodes in the tree).
Anyways, I’m happier with this solution rather than the old one.

I’m wondering if anyone has better suggestions than Irony for such matter, taking in consideration 1) open license (letting me embed it in the main project), 2) performances (in such context, query parsing happens millions of times per minute) and 3) size of the library (to reduce the total footprint of the final solution).


#2

I have been using Irony to parse/analyze a language with a fairly complex grammar (between C and C++) and I had to make some improvements to Irony for the scanner part to achieve better performance as Irony is not using a DFA driven tokenizer but a hand-written tokenizer that is actually not really efficient (Mainly by allowing to plug a different token scanner instead of the default one in Irony). Also, due to its runtime declarative grammar, Irony is slow to initialize as it has to compute all state transition tables (for the grammar I worked on, It was around 500-600ms to initialize). For the size of the library, I don’t remember that Irony size was so annoying though I did remove lots of things in it (as I removed the custom tokenizer) but you can always ILMerge…etc. One thing that was nice in Irony that I used was the ability to use grammar as a base class to be able derive from a core grammar and add specific parsing/AST creation node for a derived grammar, but this may not be your usecase.

If I had to redo this whole parsing work today, I would try to go with ANTLR4:

  1. ANTLR4 csharp License is BSD
  2. Performance wise, I guess it should be faster than Irony (as they are using a DFA table for token parsing), and rules are code generated (still you get an init cost with ANTLR, because it has to JIT all the code, but It may be faster with NGen/.NETNative). It is also far more usable than previous versions. EBNF grammar description is quite concise. They are better handling operator precedence, so writing all operators rules is just a matter of listing them in a rule, instead of creating nested rules by operator precedence as you would do with a conventional parser. The ANTLR4 book is probably worth to buy.
  3. It is less than 200Ko (and again, IL mergeable/interlizable)

#3

Thank you very much for the detailed and analytic response.

To clarify one point, I’m actually merging Irony into the project (using ILRepack, rather than ILMerge: same thing).

I must be honest: I never gave ANTLR any chance, cause I considered it bloated and heavy, also because it’s a Java-originated solution, so I wasn’t aware of an embedded runtime that I could join: definitively, it’s a way to explore. I will take your consideration in great account.
As said in my previous post, I’m pleased with the Irony performances, but the maintainability is hard: an ANTLR grammar could make it better on that side, and since the performances you say are comparable (or better) this is something to do.
Also, since ANTLR generates the DFA (while I have an hand-written one for the Irony parser), this would help a lot too.

Thanks again for the input.


#4

Actually, I realized that you were more than correct.
In fact, I produced a full PL/SQL parser (based on SQL-2003) and the size in-memory for the irony grammar was huge (25-30Mb), sending all the parameters off-load. Also, the execution slowed down sensibly and I realized this is not the right decision.
I started writing a new parser based on ANTLR4, that appears to be very performant and small on memory (less than 5Mb) still too early for a final response, as I’ve written just the Expression parsing (that is, anyway, the most difficult part).

What it worries me the most is the final size of the library: the expression parser is so far 378Kb, while the ANTLR4 runtime library is little more than 200kb: I foresee that the final SQL parser library will reach 800kb and that will bring the only SQL parsing domain up to 1Mb.

My intention is to embed everything through merging into the kernel library (deveeldb.dll) that would add the 1Mb or remaining code (also other 2 small libraries embedded: one for IoC and another for Astronomical Mathematics) to the SQL. That makes it 2Mb… not nice in an embedded context (despite of the performances).

I’m considering an alternative version (to name “micro”) that doesn’t include the SQL parsing (and other features, like security), to produce a version under 1Mb.


.NET Foundation Website | Blog | Projects | Code of Conduct