Wednesday, August 26, 2009

The ANTLR Parser Generator
These web page discuss my experience with a parser generator called ANTLR. ANTLR is a second generation parser generator. The first generation was PCCTS. ANTLR is implemented in Java, but will generate parsers in both Java and C++.
I have found that the learning curve for ANTLR is very steep. However, this is not unique to ANTLR. YACC/Bison also takes some effort to learn. The one advantage of YACC/Bison is that there are a number of good books on their use. ANTLR does not have as wide a user base yet. The Web site www.antlr.org includes HTML documentation for ANTLR. However, ANTLR is complicated and I have found that this documentation provides only a starting point. I have spent many hours working with ANTLR and working on small examples to learn the tool. In the spirit of free software I am publishing these Web pages in the hope that they will help other ANTLR users. As my use of ANTLR progresses I will update these Web pages, so this is definitely a work in progress.
Since ANTLR is an open source project (under the gentle guidance of Terence Parr), there are a number of active contributors. One of these is Ric Klaren. Ric publishes an ANTLR development snapshot. (http://wwwhome.cs.utwente.nl/~klaren/antlr/antlr-devel.tar.gz).
The main discussion group for ANTLR is hosted by Yahoo Groups and can be found at http://groups.yahoo.com/group/antlr-interest. The archives and current messages can be viewed, since this is a public forum. To post you need to join the group. The Yahoo user interface leaves a bit to be desired. The number of messages posted per month is shown at the bottom of the page. Click on the current month and you will see the latest discussion.
A great deal of credit should be given to anyone who puts the effort, time and creativity into building something. Many people will come along and criticize the works of others, but most of these critics have never created anything significant themselves. The critics have never struggled with the issues that the creator of the work did and in most cases could not create a similar work themselves.
ANTLR is a significant piece of software and a great deal of credit is due to Terence Parr, Peter Wells and the other developers and maintainers of ANTLR. In these Web pages I discuss some of the pros and cons of ANTLR, as I see them. But I recognize the efforts of the authors and maintainers of ANTLR. They stepped up and created the software.
Why Use ANTLR?
Running ANTLR under Microsoft Java
Building a C++ ANTLR Parser on Windows NT (this should also help UNIX users).
ANTLR Examples developed while I learned to use ANTLR. This page is devoted to grammar and AST generation issues with ANTLR.
Miscellaneous ANTLR Examples. Various useful ANTLR examples that don't involve grammar or AST generation





Why Use ANTLR?
Introduction
This Web page discussions parser generators and the ANTLR parser generator in particular. See Compiler Front End and Infrastructure Software for compiler construction software suites.
If you just stumbled on this Web page you may be wondering what a parser generator is. I should warn you that this is probably not the best place to learn the answer. There are a number of excellent books that discuss parser generators and compiler construction. These include the all time classic compiler text (the so called "red dragon book") Compilers: Principles, Techniques and Tools by Aho, Sethi and Ullman, Addison Wesley, 1986 and Advanced Compiler Design and Implementation by Muchnick, Morgan Kaufmann, 1997. Briefly, a parser generator is a program that reads a grammar that describes a language and outputs source code (in language like C, C++, or Java) that will recognize the language described by the grammar. The most commonly used parser generator is the UNIX program YACC (the Free Software Foundation program Bison is a GNU version of YACC). Both UNIX and Windows NT versions of Bison are available from Cygnus.
A low key, largely non-technical introduction to compiler constructon has been written by Jack Crenshaw, who currently writes an outstanding column for Embedded Systems.
Another longer and more academic Web based reference is the book Compilers and Compiler Generators: an introduction with C++ by P.D. Terry, Rhodes University, 1996. This web pages publishes P.D. Terry's book in a variety of formats, including postscript and Adobe pdf format.
ANTLR is a second generation parser generator. The first generation was PCCTS. Both parsers were designed and implemented by Terence Parr. Various people have helped over the years. See the ANTLR and PCCTS source for the complete credits.
Although ANTLR is implemented in Java, it will generate either a Java or a C++ parser. I sometimes get e-mail asking which parser should be used: ANTLR or PCCTS. I don't have enough experience with PCCTS to comment on this. My view is that ANTLR represents the current main stream of development. Also, I would assume that the lessons Terence learned implemented PCCTS were incorporated into ANTLR. But there is an active PCCTS user base, so I am sure there are other, probably better informed view on this.
Although there is some brief introductory material, this Web page is written for people who already have some knowledge of compiler and language processor construction.
The ANTLR documentation and software can be found at www.antlr.org. ANTLR is a available in source form, with only the restriction being that the authors are credited. ANTLR evolved from PCCTS (the Purdue Compiler Construction Tool Set), which was developed by Terance Parr as part of his PhD thesis work. ANTLR is a rewrite of PCCTS. A history of ANTLR can be found here.
Automaticly Generated Parsers vs. Hand Crafted Parsers
The first compiler I worked on was the UCSD Pascal compiler. This compiler was based on Niklaus Wirth's Pascal compiler written at ETH in Switzerland. At the time, parser generators were in their infancy. Parsers were written by hand. The Pascal parser was a recursive decent parser, whose structure mirrored the Pascal syntax.
The technique for creating this kind of recursive decent parser was to carefully develop the BNF syntax for the language to be parsed and then write the parser so that it mirrored this syntax. Any change in the grammar had to be mirrored with a change in the software. This could be time consumming and any mistakes in the parser software would result in incorrect parsing.
Parser generators allow the programmer to work at a higher level of abstraction. Creating the language parser involves writing the grammar, instead of directly writing the software that parses the grammar. Changes that might take a few hours to implement in a hand crafted recursive decent parser can be made in a few minutes when a parser generator is used. A few hundred lines of grammar is also much easier to understand than a few thousand lines of hand crafted parser. Two arguments have classicly been used against automaticly created parsers:
Automaticly generated parsers are slow and use lots of memory.
Error reporting is poor in automaticly generated parsers.
Most language processors spend a fraction of their time parsing. For example, most compilers spend more time in optimization and code generation than they do in language parsing. On a modern computer system the time taken by the parser will probably be dominated by disk access, not parser execution. Nor is the issue of memory usage an argument against automaticly generated parsers. On a computer system with 128 Mb of RAM and perhaps 256 Mb of Virtual memory, the amount of memory used by the parser will usually be insignificant.
Error reporting is a problem, especially for YACC/Bison. ANTLR generates a recursive decent parser from the grammar and has fairly good error reporting (this is discussed at greater length below). Of course "good" is a matter of opinion.
Although I started my software career writing recursive decent parsers and have written hand crafted parsers for a variety of lanuages, I now greatly prefer automaticly generated parsers.
Since we're on the topic of YACC, a minor digression:
Another problem with YACC are the obscure shift-reduce and reduce-reduce conflicts. Terance Parr writes that PCCTS, the predecessor to ANTLR, was created in response to his horror of having to debug YACC grammars. If you do have to debug YACC grammars, Reiner Pesch has written a tool that might help. I have not tried it out and have included this link primarily because it might be of help in the future, since I am sometimes forced to use YACC on projects at work. Reiner writes, in comp.compilers (February 12, 2001):
Due to my diploma thesis in computer science I have implemented a tool which intends to assist developers using LALR parser generators like yacc with the analysis of shift/reduce- and reduce/reduce-conflicts. (The tool currently works with yacc, byacc, bison and happy).
So if you're a user of any LALR parser generator mentioned above, you could do me (and hopefully yourself) a favour by taking a look at ... http://www-users.rwth-aachen.de/reiner.pesch/yca/yca.html.
The yca.html Web page publishes a tool called YCA (YACC Conflict Analyser[sic]). YACC/Bison grammar conflicts can take a lot of time to understand so any tool that helps is worth looking into.
Parser Generator Options
Lookahead
The term lookahead refers to the number of lexical tokens that a parser looks at, beyond the current one, to determine the matching grammar production. Parser generators have a long history and they originally generated parsers to run on machines that, by modern standards, had very little memory (e.g., 129K bytes to 4Mb). Since token lookahead tends to increase parser memory use and make the parser slower, most parser generators created parsers that supported only one token lookahead. This is denoted by a number next to the grammar type (for example, LALR(1) is an LALR grammar with one token lookahead). Several more recent parser generators (ANTLR and JavaCC) can create parsers that can selectively lookahead more than one token (e.g., LL(k)).
Creating a grammar for a complex language is a lot of work. It is easy to accidently create a grammar with ambiguities, where the parser cannot tell two or more productions apart. Parser generators will report errors or warnings when ambiguous grammars are encountered. However, with every parser generator I've ever used, these messages are difficult to interpret. You must pour over the grammar considering how the grammar productions collided to create ambiguity. At first glance, the grammar writer may think that an LL(k) grammar will allow sloppier grammars. This is not actually the case. A grammar that contains ambiguity can actually be more ambiguous with k lookahead. Having k lookahead does allow some productions to be more clearly recognized, however. The most common example of this is the else clause in C, C++ and Java which can be clearly specified in a grammar production with selective k lookahead. This allows the grammar to be somewhat smaller. A parser with k lookahead can also recognize a wider range of languages than a parser with only one token lookahead. But these languages must still be unambiguously specified. So k lookahead provides power and flexibility, but I have not found that it greatly decreased the work necessary to create a grammar.
Open Source or Free (no license fee) Parser Generators
Until YACC became part of the UNIX software tool set, parser generators existed only in academia and at a few companies that kept the tools to themselves. Yacc is now a quarter century old. The original paper on yacc was written in July, 1975. An HTML version can be found here.
There are now a remarkable variety of parser generators available in addition to ANTLR. These parser generators include open source software, free binaries and commercial products. Originally this list came from a google search. I have updated it with information from comp.compilers. This list is getting long enough that I should probably break it out into its own web page.
Java Compiler Compiler, a parser generator written in Java that generates a Java parser. JavaCC was written at Sun Microsystems and is available free.
Metamata Parse is a follow-on tool to Sun's JavaCC. It is also available free. Metamata Parse claims to be an improved Java parser generator with a number of advantages over JavaCC. Metamata seems to have disappeared and I have not found another source for their version of JavaCC.
CUP: LALR Parser Generator for Java. This parser generator takes a YACC style grammar and creates a Java parser. Instead of YACC's $-number syntax, CUP uses labels that remind me of ANTLR. The parser created by CUP is table driven. CUP appears to use YACC style error recovery as well. The CUP documentation looks pretty good so if you only need a simple language recognizer in Java CUP might be a good choice. CUP is free software and is provided as part of at least some Linux distributions.
IBM's Jikes Parser Generator is part of the IBM open source Jikes Java compiler release. Like the Jikes Java compiler, the parser generator is available in source form. The Jikes software is not covered by the Free Software Foundation's copyleft, but can be used more liberally (e.g, as far as I can tell, if you use IBM's source in a product you don't have to disclose all of the source of the product).
Jikes Parser Generator is a parser generator that accepts as input an annotated description for a language grammar and produces text files suitable for inclusion in a parser for that language. It is similar in function and use to the widely-available parser generators Yacc and Bison.
It also provides support for automatic diagnosis and recovery from syntactic errors. It is the parser generator used by the Jikes Compiler. Jikes Parser Generator can generate parsers for LALR(k) grammars and produce output suitable for use with parsers written in Java, C, or C++.
Lemon Parser generator from a small software company in North Carolina named Hipp, Wyrick & Company, Inc., or "Hwaci" for short. This is an LALR(1) parser generator that is available in "Open Source" form.
btyacc. Gary Funck sent me a pointer to btyacc which is a YACC inspired parser generator. The btyacc parser generator is described as:
This is original Berkeley Yacc modified by Chris Dodd chrisd@collins.com and then further enhanced by Vadim Maslov vadik@siber.com of Siber Systems to better fit a production environment (and to have less bugs, too).
The btyacc parser generator creates a C++ table driven parser from a YACC style grammar. The authors write that btyacc has "industrial strength error processing [reporting?] and recovery", which would be a big improvement over stock YACC. They have also added a number of useful ideas, including a standard variable for source position, which can be used to annotate the abstract syntax trees created during parsing. One of the central features of btyacc is its support for syntactic backtracking. Chris Dodd writes:
BTYACC is a modified version of yacc that supports automatic backtracking and semantic disambiguation to parse ambiguous grammars, as well as syntactic sugar for inherited attributes (which tend to introduce conflicts). Whenever a btyacc generated parser runs into a shift-reduce or reduce-reduce error in the parse table, it remembers the current parse point (yacc stack and input stream state), and goes into trial parse mode. It then continues parsing, ignoring most rule actions. If it runs into an error (either through the parse table or through an action calling YYERROR), it backtracks to the most recent conflict point and tries a different alternative. If it finds a successful parse (reaches the end of the input or an action calls YYVALID), it backtracks to the point where it first entered trial parse mode, and continues with a full parse (executing all actions), following the path of the successful trial.
Btyacc is available in source form.
Ckit from Bell Labs
I have not looked into this parser generator. Ckit was mentioned in the USENET group comp.compilers by Torben Mogensen. Torben writes:
There is a tool called ckit which is basically a parser for C written in SML. The output is a representation of the abstract syntac tree. If I was to write a C compiler today, I would start from this.
SML has several features that makes it very well suited for writing compilers:
Recursive datatypes with pattern matching.
Automatic memory management (GC).
A good module system.
A type system that catches many errors at compile-time that would not be caught at all by C.
Exceptions
A wide range of compilers, so you can develop on a fast compiler and run your final code through an optimizing compiler.
You can see Andrew Appel's book Modern Compiler Implementation in ML for examples of compiler-code written in SML. You can even compare with how the same things are done in C or Java by also looking at Modern Compiler Implementation in C and Modern Compiler Implementation in Java by the same author. [Prof. Appel's web page on ML and compiling can be found here]
As for maintenance, the module system and the type system are really aids for maintenance. Most ML implementations have some kind of intelligent recompilation scheme.
In all, the major problem may be social: C/C++ programmers may be reluctant to learn a new (and radically different) language. The language itself is fairly easy to learn (we teach it as the first language at DIKU) but it takes a while for a programmer who thinks in objects and state to wrap his mind around a different way of expressing things.
Apparently there is also a compiler projects page for Bell Labs. Dave MacQueen of Bell Labs writes:
ML (Standard ML in particular) has a good track record in compiler applications. Compiling is basically a symbolic process, and SML is an excellent symbolic processing language. [...]
SML has many features that make it attractive from a software engineering point of view: safety, sound static typing, polymorphism, algebraic types and pattern matching, a powerful module system with parametric modules, exceptions, concurrency, garbage collection, etc. SML/NJ has a very helpful compilation manager, CM, that helps minimize development turn-around time. [...]
ckit 1.0 was released 31 March 2000. It has been used as the front end for various C static analysis tools in an industrial context where millions of lines of code are being analyzed, and it has been used in another industrial application as the front end for a C-derived domain specific language.
The MLRISC code generation framework is used in the SML/NJ compiler and in several other compiler projects. Back ends are currently provided for x86, sparc, alpha, powerpc, hppa.
The Eli System from the University of Colorado at Boulder.
According to the Eli Web page:
Eli provides modern compiler construction facilities to users with a wide range of sophistication. It offers complete solutions for commonly-encountered language implementation subtasks and contains libraries of reusable specifications, making possible the production of high-quality implementations from simple problem descriptions.
The system has been in the field since 1989, and has been used in a number of projects worldwide. It generates programs whose performance is comparable to that of a good hand-coded implementation. Development time for a processor using Eli is generally about one third of that for comparable hand code, and maintenance is significantly easier because specifications rather than implementations are being maintained.
Eli includes support for scanner, parser and AST generation. It also seems to provide support for generating code for the semantic analysis phase. For more details see the Eli Documentation. Although the Eli web site claims that Eli is easy to learn and that you don't have to have compiler experience, this looks like another compiler tool with a steep learning curve.
Commercial parser generators
The Visual Parse++ from Sandstone. This looks like a great tool. It generates LALR(1) parsers in several languages (C, C++ and Java). It comes with a complete graphical interface that shows the parser reductions, which should make debugging easier.
The YACC++ parser generator seems to have been originally developed at Johns Hopkins University. A Web page describing the features of YACC++ can still be found at Johns Hopkins.
Apparently YACC++ is now a commercial product from a software company named Compiler Resources Inc. However, it is the most expensive parser generator that I am aware of. YACC++ generates a C++ parser and claims to have much better error recovery than YACC or Bison.
AnaGram from Parsifal Software
To digress for a moment: Parsifal is an interesting choice of names for a software company. In the Arthurian Legend Parsifal was only knight among the Knights of the Round Table who found the holy grail. The grail itself is loaded with symbolism. In compiler design and implementation, the the holy grail is a tool (or tool set) that will create a compiler, from parser to code generator, from a set of formal descriptions (a grammar and instruction set description, for example). When this grail is found, it is said, compiler developers will all be out of work and cost of creating a new compiler will be low. Like Arthur's knights, many have searched for this grail, but no one has found it, at least for production quality compilers.
The AnaGram is an LALR(1) parser generator is supported on Win32. A happy AnaGram user sent me e-mail about this product.
Just a note for your "commercial parser generator" products web page, we have been using a great commercial product for years called "AnaGram" from Parsifal Software. Our company has been writing compilers for decades and have used several products over that history. First was YACC, then Bison, then LALR from a company now defunct. The last product used for our mainstream products has remained AnaGram because it works as advertised, we've never had an error in the code it generated, the documentation was good, and the support was exellent. It works under DOS and now Win/32, and generates C[++] code. It is accepts LALR(1) grammars and was lightyears ahead of the LALR YACC and Bison we had previously used.
The Lark LALR(2) and Ell (LL(1)) parser generator
I have not looked into this parser generator. It was mentioned in a comp.compilers posting. Apparently an older version is available free, but the newer version is a commercial compiler construction toolbox called Cocktail. The software is available from CoCoLab. Josef Grosch writes, in comp.compilers:
The parser generators Lark (LALR(2)) and Ell (LL(1)) of the Cocktail Toolbox provide automatic error reporting, recovery and repair. There is no need to add extra grammar rules using an error token. Everything is generated automatically. Example: if (a = b] write (a);
The above source line would result in the following message: 3, 13: Error found/expected : ]/) = + - <> <= >= < > IN OR ...
The location of the error is reported (line 3, column 13). The message is classified as Error. It is a syntax error indicated by "found/expected". The token ] is causing the syntax error. The tokens after the slash are the set of expected tokens. This set is truncated which is indicated by the three dots.
Error recovery also happens automatically. A minimal sequence of tokens is skipped so that parsing can safely continue. Moreover a minimal sequence of tokens is inserted (virtually) in order to repair the error. This has the advantage that the output of the parser is always consistent in the sense that it belongs to a syntactically correct input. Therefore phases following after parsing do not have to take care about syntax errors.
References:
J. Grosch, `Efficient and Comfortable Error Recovery in Recursive Descent Parsers', Structured Programming, 11, 129-140 (1990).
http://www.cocolab.de - look for ell.ps oder ell.pdf
Older versions of Cocktail are available for free on the Internet. Newer versions are commercial products.
Tree Parsers
Related to parser generators are tree parsers. Tree parsers read in a grammar that describes an abstract syntax tree that is generated by a compiler front end or compiler middle pass and generates a tree walker that processes the trees. ANTLR can be used to generate tree parsers. Another tree parser generator is iburg, developed by C. W. Fraser, D. R. Hanson, and T. A. Proebsting at Princeton and used to construct the code generator for Fraser and Hanson's lcc ANSI C compiler.
Kimwitu. This link is supposed to be a "persistent" URL. At the time this was written, Kimwitu resided at http://fmt.cs.utwente.nl/kimwitu/
Kinwitu is a software construction tool that takes trees as inputs and associates actions with tree "terms". Kimwitu was developed by Axel Belinfante and his colleagues Peter van Eijk, Henk Eertink and Henk Alblas at the University of Twente, Enschede, in the Netherlands. The strange name apparently comes from Swahili and deals with forests or trees.
Kimwitu is open source software covered under the GNU copyleft. However, the tree processor generated by Kimwitu can be used without restriction (e.g., you do not have to place your software under the GNU copyleft license if it uses a component generated with Kimwitu). See the Kimwitu license for more details.
Kimwitu is specificly targeted at compiler applications. A parser generates an abstract syntax tree (AST). Semantic analysis and tree transformation can then be performed with later passes written with the aid of Kimwitu.
The arguments for developing AST passes with a tool like Kimwitu is the same as the argument for using a parser generator. By developing at a higher level the code is easier to understand, easier to modify and faster to develop. In fact, the authors of Kimwitu state: "Experience has demonstrated that Kimwitu drastically speeds up development time, facilitates tool integration and generates production quality programs." [emphasis in the original]
Sadly there is no free lunch. The cost of more powerful tools is the learning curve to master the tool. Also, there is the issue of robustness. ANTLR has a large active user group so the classic open source argument applies to bug fixes, at least in theory (e.g., since a group of people uses the tool, you're not alone in fixing bugs). Of course in practice bug fixes may be really slow since those who maintain the open source tool are trying to make a living as well.
Kimwitu++
The Kimwitu tree parser outputs C. Kimwitu++ is a variant that outputs C++. As the name also suggests, Kimwitu++ is the "successor" to Kimwitu. One of the problems inherent in open source software is fragmentation and this seems to be an example of this. It is unclear how Kimwitu++ tracks the work being done on Kimwitu.
State Machine Generators
State machine generators are closely related to the deterministic finite automata (DFA) generated by scanner generators like the one built into ANTLR. However, scanner generators are aimed at language applications and are not well suited for state machine generation outside this area. The Libero state machine generator from iMatix supports state machine generation for a variety of applications.
The input to Libero is a state machine diagram expressed in a textural language. The output can be in a variety of languages ranging from Java and C++ to COBOL. The Libero tool is provided as open source software under the Free Software foundation GPL. The code generated by Libero can be used without restriction.
Other Lists of Parser Generators
The www.nullstone.com web pages have a number of links to a variety of topics involving compilers. The Nullstone test suite tests compiler optimization and is an invaluable tool for compiler testing and regression. Nullstone is run by Christopher Glaeser, who is an old friend and someone I have boundless respect for.
The Idiom.com Catalog of Free Compilers, Interpreters and Langauge Tools.. I don't have anything to do with the construction and maintenance of this Web page. However, Idiom.com is the ISP that hosts this Web page, my domain, bearcave.com and my shell account. Idiom is a great ISP. If you are trying to escape a large ISP that does not return your e-mail and keeps you waiting on the phone, try Idiom.
Why did I become interested in ANTLR?
I have used YACC (which is more or less interchangeable with Bison) for many years. YACC has several notable problems however:
Parsers created with YACC tend to have poor error reporting. Anyone who has used the original portable C compiler under UNIX will be familiar with the dreaded "syntax error line XX" error messages. With work, YACC can be made to do a better job by introducing error productions, but its still an uphill battle.
Errors in YACC grammars are difficult to understand. YACC will report a "shift/reduce" or "reduce/reduce" error associated with a given grammar rule. In many cases to understand why this error occured, the user must pour over the YACC debug trace, which can be obscure.
The parser generated by YACC is table driven (that is, the parser state transitions result from values in a table). This makes the parser difficult to impossible to debug.
Language grammars take a lot of work to develop. I was hoping that by using a parser generator with more than one token lookahead I would not have to work as hard on the grammar. As a friend of mine pointed out to me:
LR(k) for any k, most grammars that are not LR(1) are ambigious so the additional lookahead doesn't really do what most people think.
ANTLR's syntactic predicates allow arbitrary lookahead which can be a powerful way to resolve difficult parts of a syntax. But sadly they don't make grammars that much easier to develop. ANTLR grammars still take a lot of work and good grammar development is an art.
I need a parser generator that generates C++ (or at least C), since I am planning on implementing my compiler in C++.
I am planning to use the parser generator for a production quality compiler that I hope may be a commercial product one day where I might license the compiler source. So I had some other requirements:
Cost. For me, $500 is a lot of money for a software tool. So while tools like Visual Parse++ have attractive features, the cost is too high.
Longevity. As a compiler writer I know that selling compilers is not an easy business. Selling parser generators is an even more difficult business. So I don't want to spend a couple of years developing a compiler based on a product from company that may disappear.
Support. If I license my compiler to someone they will expect me to support all phases of the compiler. If a parser generator bug is found in the front end, I will be expected to fix it.
Given the issues listed above, ANTLR is the most attractive parser generator I have looked at:
ANTLR generates recursive decent parsers and has good error reporting "right out of the box". It appears that with a little work even better error reporting is possible.
The parser generated by ANTLR is more or less readable. This helps in debugging.
ANTLR is free.
ANTLR is available as "open source" so I don't have to worry about the vendor going out of business. Also, I can support it myself if necessary. ANTLR is not Linux, but there are a number of ANTLR users world wide, so there is a reasonable chance that bugs will be identified and corrected.
ANTLR will generate C++ parsers. Java may have some advantages as a compiler implementation language, but the performance of Java on compiler like applications is terrible. It is hard enough to make an optimization phase run with acceptable performance. If the compiler developer is saddled with Java's interpretive overhead as well, the compiler may be very slow. Unless native compiled Java is used, C++ is the only choice available for production quality compiler implementation.
Although ANTLR has some compelling advantages, it has some disadvantages as well:
ANTLR has a steep learning curve. There is a fair amount of HTML documention on ANTLR, but the tool is complex and I have found that the documentation can only be understood by working through many examples.
ANTLR grammars are not easy to debug. An error in an ANTLR grammar can result in an error report in a distant production. For example, I had an error in the grammar production for a C style for loop. ANTLR reported a non-determinism warning on the grammar production for unary minus. It was not until I was adding trees to the grammar and I corrected the problem with the for loop production that I understood that this was the problem. But it's worth mentioning that I have yet to use a parser generator where grammar errors were easy to find.
YACC follows the UNIX tool philosophy of doing one thing and providing a fairly simple interface to other tools and software. YACC is just a parser generator. If you want a scanner you use Lex or write one yourself. In the case of ANTLR, you get much more. ANTLR has a very nice lexical scanner description language, it can build trees, it can parse trees. ANTLR is sort of the Emacs of parser generators. This is great if you want to get a language processor up and running quickly. The ANTLR scanner even handles file input and line numbering. But if you don't want to use the stock ANTLR components things get more complicated. For example, if you don't want to use ANTLR's reference counted tree nodes but you still want to use the tree building syntax, its unclear to me what you need to do. Similarly there is only sketchy documentation on how to interface a non-ANTLR scanner with ANTLR.
Afterward
I did use ANTLR to generate a Java parser based on the John Mitchell, Terence Parr, John Lilley and Scott Stanchfield grammar. This is a very nice grammar and represents a lot of work on the author's part. I would like to publicly thank them for putting this grammar in the public domain. I am very happy with ANTLR. The parser is now complete and consists of about 5K lines of grammar, C++ actions and comments.
I do not use ANTLR tree generation. In part the reason for this is that I generate a C++ parser and C++ does not have an "interface" construct, like Java has. The Java parser defines the tree via a Java interface, which allows the Java parser user to define the actual implementation of the tree in any number of ways. For the C++ user however, the tree base class must be used and this had more features (like reference count) than I wanted. So I generate my own light weight trees using C++ actions.
Tree node definitions aside, there is another reason to generate AST through C++ or Java actions. The AST generated by the parser represents the core information for the statement or expression. As a result, tree shape is very important and a lot of work can go into generating trees that will make later processing easier. If trees are generated directly from ANTLR, tree shape tends to be controlled by modifying the grammar. This can fragment rules into several sub-productions. A grammar for a language like Java is already large. Fragmenting rules may make it more difficult to maintain. Direct generation of trees gives allows more freedom than generating trees using ANTLR. The drawback is that it is also more work.

No comments:

Post a Comment