Compiler Front End and Infrastructure Software
Compiler infrastructure software usually includes one or more language front ends, software for abstract syntax tree (AST) generation and AST modification. Optimization and code generation phases may also be included. These software suites are specific for compiler creation and in this domain are more powerful than parser generators, as long as the user is willing to live with the pre-built infrastructure. This software is also less general than parser generators which can be used for a range of language processing tasks (e.g., HTML or XML parsing, for example).
Although compiler design and development seems to have fallen out of favor as a research topic, it has a long and rich history. A great deal of work has been done on compiler infrastructure software. It is a gross understatement to say that this is not a complete bibliography. The OSUIF paper, referenced below, includes a more complete list of infrastructure tools.
Obviously these tool suites are not limited to or specific for Java. However, when considering the architecture for a Java compiler it is reasonable to consider whether some of this software can be used to simply the huge development task.
See Why Use ANTLR for a related Web page that discusses the ANTLR parser generator and lists other parser generator options (including tools like state machine generators and tree walker generators).
There is no free lunch
Software reuse has been one of the mantras of software development for the last twenty years (remember Ada?) The concept is both appealing and impossible to argue with: by reusing tested, carefully designed software, more reliable software can be created in less time. In practice software reuse has proven difficult and there are few wide spread successes. In my opinion there are, in fact, two: UNIX (or POSIX) and Java. The C++ Standard Template Library (STL) may be a more modest third.
There are two barriers to software reuse: design and complexity. Creating reusable software requires a deep understanding of object creation and reuse. This is a rare skill and one that software engineers spend their careers developing. Many object packages that are created for reuse are difficult to use because of design flaws. In addition to design issues, application specific packages, like compiler infrastructures (I know you were wondering where I was going with this), are difficult to use because they have a huge learning curve. These infrastructures consist of tens of thousands of lines of code. They can be rapidly reused by their creators, but reuse is more difficult in the hands of someone else.
A compiler infrastructure could be created entirely out of objects which could be plugged in and out to create new compiler software. Examples of these objects include memory allocation, parsers, flow graph creation and optimization. However, it is difficult to create an efficient set of compiler objects that are truly interchangeable. For example, if the AST changes, will flow graph creation still work? As a result, many infrastructures are interdependent and it is difficult to reuse one piece without them all.
Finally, many of these infrastructures are aimed at research applications and trade flexibility for performance and size. This makes them impractical for production quality compilers.
Compiler Infrastructures
Low Level Virtual Machine (LLVM) Compiler Infrastructure
The name of the LLVM Compiler Infrastructure is misleading. To me the name implies something like a virtual instruction set and virtual machine. This compiler infrastructure does include an intermediate based on Static Single Assignment. The infrastructure also includes optimization components, including interprocedural optimization and code generator support.
Open64 Project
The Open64 project is based on the Silicon Graphics SGI Pro64 compiler components. Open64 is a compiler suite supports C, C++ and Fortran 90. This is an open source release of the SGI compilers targeted at the Linux operating system and the Intel IA-64 processor. This compiler suite uses front ends based on the GNU gcc and g++ compiler front ends. The Fortran 90 front end is, presumably, SGI's.
Unlike the GNU compiler, this compiler suite uses a modern tree structured intermediate that is successively lowered. The SGI Web site includes documentation on the symbol table and intermediate format. The SGI compiler targets the Intel IA-64 architecture and includes an advanced optimization phase that goes far beyond anything that is likely to ever appear in the GNU compiler. Some features, like C++ exceptions, have not been implemented at the time of this writing. Although this compiler software seems to be based on SGI's existing compiler stream, the IA-64 incarnation is relatively new and immature. For example, at least two of the SPEC2000 benchmarks do not execute correctly when compiled with this compiler.
The SGI Pro64 compiler suite is unquestionably the most advanced open source compiler currently available. There is a huge learning curve for a large software base like this, but for many advanced compiler applications it seems likely that this compiler suite will displace the GNU compiler.
The license under which this software is released was not finalized at the time this note was written. SGI states that they plan to follow the Free Software Foundation's "copyleft" agreement, so any software that uses the SGI compiler software must be made available as open source as well.
SGI has fallen on hard times recently and has refashioned itself as a Linux supplier. Whether they can be profitable at this remains to be seen. It is remarkable to see the release of this compiler suite as open source. The only obvious motivation for releasing software representing millions of dollars worth of development effort and many years of work into the public domain is the culture of the Linux platform, where all software is open source. I have deep reservations about the long term viability of this model. From a financial perspective perhaps SGI views the development of this software as already paid for by the "old" SGI and written off at a loss. As a result, they don't seek any direct return on this software. On the other hand, there is an on-going cost in maintaining and enhancing these compilers. Since this technology can also be picked up by, say, Red Hat/Cygnus, it is not clear SGI will pay for these on-going costs.
Whether SGI ever makes money with this technology or not, this is a significant and very impressive software release. In particular, it appears that SGI has some deep insights into SSA optimization techniques.
OSUIF at UC Santa Barbara.
SUIF is an acronym for the Stanford University Intermediate Form. This is a common high level compiler intermediate aimed at compiler research. The "O" is for Object-oriented. The UC Santa Barbara project, lead by Professor Urs Hoelzle, has developed a Java .class file to SUIF front end.
Their paper j2s: A SUIF Java Compiler provides an interesting discussion of the issues involved with generating a flow graph from Java (or at least Java in byte code form). This compiler generates C output rather than native code. However, there are, at least in theory, SUIF backends that will generate native code.
The SUIF intermediate is a high level intermediate. It is very flexible and allows various compiler passes to be added. But part of the cost of this flexibility is performance. So SUIF does not appear to be appropriate for an industrial production compiler.
Zephyr: Tools for a National Compiler Infrastructure
The Zephyr project collaboration between Princeton University and University of Virginia. The Zypher site is hosted by the University of Virginia. The Zyphyr project has defined the Zephyr Abstract Syntax Description Language (ASDL) and a machine independent optimizer that works on ASDL intermediates. A set of code generators is also available.
The Stanford University SUIF Compiler System and the Zephyr project are funded by the national compiler infrastructure project. The Stanford SUIF project also defines the Stanford University Intermediate Format (SUIF). I am more familiar with the SUIF project at Stanford (since they are in my "backyard" and I knew some of the Stanford people when they were graduate students). From looking at the Zypher and SUIF web pages it is unclear to me what connection and collaboration between these two projects exists. SUIF also provides a set of optimizers and code generators. Apparently it is possible to define a SUIF intermediate using Zephyr. Zephyr provides some level of support for Java.
The national compiler infrastructure project is aimed at supporting compiler research. The software tends to be designed for flexibility and modularity. But is not designed for industrial use. Although the industrial compiler writer may learn from these projects, it seems unlikely that this software could be directly adapted for production quality compilers.
Impact Research Group at the University of Illinois
The University of Illionois has a long history of research work on high performance computing and advanced compilation techniques. The Impact group is working on compiler technology and evaluation tools for very wide instruction word (VLIW) instruction sets and other instruction level parallelism. The Impact group seems to have a very wide range of research topics, including Java runtime architecture and (my own favorite) native compilation from Java byte code. Given the range of topics and the size of the group it is not clear that the group can make equal progress in all areas.
Most of the compiler infrastructure packages concentrate on compiler front end and middle (AST generation) software. Some include target independent optimization. The Impact software seems to concentrate on backend and architecture dependent optimization.
Eventually the Impact group intends to make a compiler infrastructure and related documentation public. They write:
The IMPACT compiler, with its advanced features such as predicated compilation, instruction-level parallelism optimizations, compiler engineered speculation, profile-based optimizations, advanced machine description facilities, scheduling frameworks for resource sensitive code optimizations, and pointer-based dependence analysis and tracking facility, has become a premier compiler technology base for major U.S. companies as well as academic researchers. This project's software and documentation will be released in several phases over two years and the end product will be a high-quality and readily-available compiler environment that supports a wide variety of advanced instruction-level parallel processing research.
Currently this software is distributed via http://www.trimaran.org/.
Trimaran is a compiler infrastructure for supporting state of the art research in compiling for Instruction Level Parallel (ILP) architectures. The system is currently oriented towards EPIC (Explicitly Parallel Instruction Computing) architectures, and supports compiler research in what is typically considered to be "back end" techniques such as instruction scheduling, register allocation, and machine-dependent optimizations.
Trimaran is available for non-commercial applications. Presumably commercial applications are negotiated separately
Commercial Compiler Front End Products
Compiler front ends, consisting of at least a parser and in many cases symbol table and semantic analysis support are a lot of work to implement (the EDG front end, below, consists of almost 300K lines of source code and comments).
Over time there have been a number of companies that offer front end software. One of the most widely used is the Edison Design Group's C++ front end. This front end reads C++ and outputs ANSI C.
The Edison Design Group (EDG) has been in business for quite a while. However, many front end companies have come and gone. For example, Compass Design Automation sold a VHDL front end that has been used by several EDA companies. Compass was purchased by Avant!, which as far as I know has stopped selling the front end. Another VHDL front end was available from Leda (a French software company). However Leda has been purchased by Synopsys. Synopsys no longer seems to be selling the Leda VHDL front end (this is not a surprise, since Synopsys would be supplying potential competitors with an important component that could be used to build tools that compete with those sold by Synopsys. Listed below is a partial list of front end companies (other than EDG, mentioned above):
The Java Front End from Bear Products International
This front end is based on a parser generated with ANTLR. It reads Java source and class files and does full semantic analysis. The output of the front end is a decorated abstract syntax tree (AST) and a hierarchical symbol table (referenced from the AST). This is written by the author of these web pages.
C/C++ Front End and Parser from Semantic Designs Inc.
In a posting to comp.compilers, Ira Baxter describes this software as
Semantic Designs offers the DMS Reengineering Toolkit, with a robust C/C++ front end parser that handles preprocessor directives, automatically builds an AST retaining comments, literal formats, etc. can prettyprint the AST back to source form reproducing the comments/ literals, and can carry out pattern-directed transformations on the tree. (DMS also handles many other languages.)
Compiler creation software
The ANTLR parser generator has been discussed in detail elsewhere on this site. Parser generators are relatively general tools, since they can be used to build parsers for languages like HTML and scripting languages. Compiler creation software is specialized software for translating a language into native code. In addition to the parser generator, compiler creation software will usually include a standard intermediate that is shared by the language front ends, various optimization passes and code generators. At one time MetaWare sold compiler creation software. They don't seem to do this currently. Compass, a Boston area compiler and consulting company that worked on compilers for Thinking Machines and MasPar used to sell compiler creation software, but they are no longer in business. So the list included below is rather short.
Elegant compiler generation tool set from Philips Research.
To quote the Philips Research posting in comp.compiler, the Elegant compiler generator tool set is
an industrial compiler generator, used for many years within Philips. After development of Elegant was stopped, it was decided to make it available in the public domain under the GNU license. This includes all sources, so that you will be able to extend the system if you feel inspired by it.
Elegant is a compiler generator based on attribute grammars with the following features:
It accepts LL(1) and LALR(1) grammars.
It can generate a back-tracking parser for other grammars.
Any non-cyclic attribute dependency is allowed (!).
Comes with
front-end generator
scanner generator
postscript syntax diagram generator
Elegant programing language
Integrated automatic error recovery
According to Philips Research,
Compilers written in Elegant are very fast (several thousand lines of source text per second) and the attribute grammars offer a very expressive formalism on a high level of abstraction, without sacrificing performance.
The tool Front is a front-end generator that comes with Elegant. It accepts a mix of an EBNF context free grammar and a set of type definitions that describe the abstract syntax graph of the language. From this, Front generates an Elegant attribute grammar that maps an input string onto this data structure, including all scoping and symbol-table handling. Front allows the easy and fast construction of compiler front-ends, which makes it particularly suited for language design.
The Elegant programming language is smoothlessly integrated with the attribute grammar specification language (in fact, the latter is a subset of the former). The language is strongly inspired by functional programming languages, especially their type systems, yet, it is an imperative language that does not discourage side-effects. The language features:
Strong typing.
Subtyping (linear inheritance).
Polymorphic types.
Polymorphic functions.
Several different lazy types.
Overloading.
List comprehensions (they are overloaded, so they work for many other types as well, including your own).
Automatic and user definable coercions.
Pattern matching (= sub-type analysis).
Function (lambda) expressions.
Module system.
The whole system offers:
Lots of compile time and run-type checks.
Garbage collection.
Compiles onto ANSI C.
Interface to C.
Available on SunOS 5, HP_UX 11, Linux, IRIX.
Portable to anything with a decent ANSI C compiler.
Self generating, i.e. written in Elegant (of course!).
All in all a very professional and complete system, free for you to use!
Whether Elegant is really a "compiler generation tool set" depends on how one defines a compiler. There does not appear to be any optimization and code generations support (the compiler outputs C code). So Elegant is really software for the creation of compiler front ends and middles passes (where the middle pass follows the front end, but exists before control flow graph generation). On the other hand, this might be a great tool set to implement a language like Python.
Associated Compiler Engineers (ACE) CoSy Compilation System
Given the shortage of compiler creation software (compared to parser generators, for example) it would seem that there is not a huge market clamoring for these tools. This leaves one to puzzle over the deeper meaning of the quote below, from Associated Compiler Engineers:
Processors and hardware architectures are emerging more rapidly than ever before, have a shorter life-cycle, and are becoming increasingly complex. Compiler developers who still use the classical sequential operation approach are experiencing setbacks due to the complexity and in-chip parallelism of recent processors. The cost of hand-crafting optimal compilers is spiraling, and the classical approach therefore becomes less and less cost effective, and in the long run, impossible to bear.
While it is true that optimizing compilers are expensive to build, it is not clear that lots of people want to build them.
The CoSy Compilation System is ACE's compiler creation software toolset. This include front ends for a variety of languages, including Java. A full set of optimization passes are also available.
Saturday, September 5, 2009
Monday, August 31, 2009
Using Microsoft Java From the Command Line
Microsoft's language tools (e.g., Visual C++ anv Visual J++) are organized around the Microsoft Developer Studio visual "shell" interface. Many users will never have to venture beyond this interface. But if you want to develop source code that can be built on both Windows and UNIX platforms you will probably have to use make/nmake which will reference the compiler from the "command line". Also, if you use software that originally ran on UNIX you may have to use the command line interface. In the case of Visual C++ once you discover that the C++ compiler is named cl things are pretty straight forward. For Microsoft Java I found command line use more obscure. This note discusses how to use Microsoft's Java from the command line.
The Microsoft Java Compiler: jvc.exe
The Microsoft Java compiler converts Java source into Java byte code. The convention is that Java source files end with the .java suffix. The Microsoft Java compiler reads a .java file and generates a .class file. The .class file can only be executed by the Java interpreter.
The Microsoft Java Interpreter (a.k.a. Java Virtual Machine): jview.exe
The jview program reads in a .class file for a console application and executes it. Currently I'm not sure what it does if it's given a GUI applet or application. From Microsoft's documentation it appears that these should be run on Wjview.exe.
The CLASSPATH environment variable
The CLASSPATH environment variable tells the Microsoft Java interpreter where to find libraries of classes. This variable can be set in the system dialog, which is under the control pannel.
A CLASSPATH example
Like VHDL (and I believe Ada), Java stores class libraries in a directory. I have been using a tool named ANTLR, which is a parser generator written in Java. The ANTLR tool, documentation etc... is installed in E:\antlr
The Java class library for ANTLR is in a subdirectory named antlr (e.g., E:\antlr\antlr). So I set CLASSPATH to E:\antlr. The ANTLR parser generator classis Tool. So to run the parser generator I use the following command line command:
jview antlr.Tool grammar.g
The CLASSPATH variable tells the jview Java interpreter to look in the directory E:\antlr. The argument antlr.Tool tells jview to look in the antlr subdirectory for the Tool.class file (which is Java byte code).
Microsoft's language tools (e.g., Visual C++ anv Visual J++) are organized around the Microsoft Developer Studio visual "shell" interface. Many users will never have to venture beyond this interface. But if you want to develop source code that can be built on both Windows and UNIX platforms you will probably have to use make/nmake which will reference the compiler from the "command line". Also, if you use software that originally ran on UNIX you may have to use the command line interface. In the case of Visual C++ once you discover that the C++ compiler is named cl things are pretty straight forward. For Microsoft Java I found command line use more obscure. This note discusses how to use Microsoft's Java from the command line.
The Microsoft Java Compiler: jvc.exe
The Microsoft Java compiler converts Java source into Java byte code. The convention is that Java source files end with the .java suffix. The Microsoft Java compiler reads a .java file and generates a .class file. The .class file can only be executed by the Java interpreter.
The Microsoft Java Interpreter (a.k.a. Java Virtual Machine): jview.exe
The jview program reads in a .class file for a console application and executes it. Currently I'm not sure what it does if it's given a GUI applet or application. From Microsoft's documentation it appears that these should be run on Wjview.exe.
The CLASSPATH environment variable
The CLASSPATH environment variable tells the Microsoft Java interpreter where to find libraries of classes. This variable can be set in the system dialog, which is under the control pannel.
A CLASSPATH example
Like VHDL (and I believe Ada), Java stores class libraries in a directory. I have been using a tool named ANTLR, which is a parser generator written in Java. The ANTLR tool, documentation etc... is installed in E:\antlr
The Java class library for ANTLR is in a subdirectory named antlr (e.g., E:\antlr\antlr). So I set CLASSPATH to E:\antlr. The ANTLR parser generator classis Tool. So to run the parser generator I use the following command line command:
jview antlr.Tool grammar.g
The CLASSPATH variable tells the jview Java interpreter to look in the directory E:\antlr. The argument antlr.Tool tells jview to look in the antlr subdirectory for the Tool.class file (which is Java byte code).
Friday, August 28, 2009
Using Microsoft Java From the Command Line
Microsoft's language tools (e.g., Visual C++ anv Visual J++) are organized around the Microsoft Developer Studio visual "shell" interface. Many users will never have to venture beyond this interface. But if you want to develop source code that can be built on both Windows and UNIX platforms you will probably have to use make/nmake which will reference the compiler from the "command line". Also, if you use software that originally ran on UNIX you may have to use the command line interface. In the case of Visual C++ once you discover that the C++ compiler is named cl things are pretty straight forward. For Microsoft Java I found command line use more obscure. This note discusses how to use Microsoft's Java from the command line.
The Microsoft Java Compiler: jvc.exe
The Microsoft Java compiler converts Java source into Java byte code. The convention is that Java source files end with the .java suffix. The Microsoft Java compiler reads a .java file and generates a .class file. The .class file can only be executed by the Java interpreter.
The Microsoft Java Interpreter (a.k.a. Java Virtual Machine): jview.exe
The jview program reads in a .class file for a console application and executes it. Currently I'm not sure what it does if it's given a GUI applet or application. From Microsoft's documentation it appears that these should be run on Wjview.exe.
The CLASSPATH environment variable
The CLASSPATH environment variable tells the Microsoft Java interpreter where to find libraries of classes. This variable can be set in the system dialog, which is under the control pannel.
A CLASSPATH example
Like VHDL (and I believe Ada), Java stores class libraries in a directory. I have been using a tool named ANTLR, which is a parser generator written in Java. The ANTLR tool, documentation etc... is installed in E:\antlr
The Java class library for ANTLR is in a subdirectory named antlr (e.g., E:\antlr\antlr). So I set CLASSPATH to E:\antlr. The ANTLR parser generator classis Tool. So to run the parser generator I use the following command line command:
jview antlr.Tool grammar.g
The CLASSPATH variable tells the jview Java interpreter to look in the directory E:\antlr. The argument antlr.Tool tells jview to look in the antlr subdirectory for the Tool.class file (which is Java byte code).
Microsoft's language tools (e.g., Visual C++ anv Visual J++) are organized around the Microsoft Developer Studio visual "shell" interface. Many users will never have to venture beyond this interface. But if you want to develop source code that can be built on both Windows and UNIX platforms you will probably have to use make/nmake which will reference the compiler from the "command line". Also, if you use software that originally ran on UNIX you may have to use the command line interface. In the case of Visual C++ once you discover that the C++ compiler is named cl things are pretty straight forward. For Microsoft Java I found command line use more obscure. This note discusses how to use Microsoft's Java from the command line.
The Microsoft Java Compiler: jvc.exe
The Microsoft Java compiler converts Java source into Java byte code. The convention is that Java source files end with the .java suffix. The Microsoft Java compiler reads a .java file and generates a .class file. The .class file can only be executed by the Java interpreter.
The Microsoft Java Interpreter (a.k.a. Java Virtual Machine): jview.exe
The jview program reads in a .class file for a console application and executes it. Currently I'm not sure what it does if it's given a GUI applet or application. From Microsoft's documentation it appears that these should be run on Wjview.exe.
The CLASSPATH environment variable
The CLASSPATH environment variable tells the Microsoft Java interpreter where to find libraries of classes. This variable can be set in the system dialog, which is under the control pannel.
A CLASSPATH example
Like VHDL (and I believe Ada), Java stores class libraries in a directory. I have been using a tool named ANTLR, which is a parser generator written in Java. The ANTLR tool, documentation etc... is installed in E:\antlr
The Java class library for ANTLR is in a subdirectory named antlr (e.g., E:\antlr\antlr). So I set CLASSPATH to E:\antlr. The ANTLR parser generator classis Tool. So to run the parser generator I use the following command line command:
jview antlr.Tool grammar.g
The CLASSPATH variable tells the jview Java interpreter to look in the directory E:\antlr. The argument antlr.Tool tells jview to look in the antlr subdirectory for the Tool.class file (which is Java byte code).
Wednesday, August 26, 2009
The ANTLR Parser Generator
These web page discuss my experience with a parser generator called ANTLR. ANTLR is a second generation parser generator. The first generation was PCCTS. ANTLR is implemented in Java, but will generate parsers in both Java and C++.
I have found that the learning curve for ANTLR is very steep. However, this is not unique to ANTLR. YACC/Bison also takes some effort to learn. The one advantage of YACC/Bison is that there are a number of good books on their use. ANTLR does not have as wide a user base yet. The Web site www.antlr.org includes HTML documentation for ANTLR. However, ANTLR is complicated and I have found that this documentation provides only a starting point. I have spent many hours working with ANTLR and working on small examples to learn the tool. In the spirit of free software I am publishing these Web pages in the hope that they will help other ANTLR users. As my use of ANTLR progresses I will update these Web pages, so this is definitely a work in progress.
Since ANTLR is an open source project (under the gentle guidance of Terence Parr), there are a number of active contributors. One of these is Ric Klaren. Ric publishes an ANTLR development snapshot. (http://wwwhome.cs.utwente.nl/~klaren/antlr/antlr-devel.tar.gz).
The main discussion group for ANTLR is hosted by Yahoo Groups and can be found at http://groups.yahoo.com/group/antlr-interest. The archives and current messages can be viewed, since this is a public forum. To post you need to join the group. The Yahoo user interface leaves a bit to be desired. The number of messages posted per month is shown at the bottom of the page. Click on the current month and you will see the latest discussion.
A great deal of credit should be given to anyone who puts the effort, time and creativity into building something. Many people will come along and criticize the works of others, but most of these critics have never created anything significant themselves. The critics have never struggled with the issues that the creator of the work did and in most cases could not create a similar work themselves.
ANTLR is a significant piece of software and a great deal of credit is due to Terence Parr, Peter Wells and the other developers and maintainers of ANTLR. In these Web pages I discuss some of the pros and cons of ANTLR, as I see them. But I recognize the efforts of the authors and maintainers of ANTLR. They stepped up and created the software.
Why Use ANTLR?
Running ANTLR under Microsoft Java
Building a C++ ANTLR Parser on Windows NT (this should also help UNIX users).
ANTLR Examples developed while I learned to use ANTLR. This page is devoted to grammar and AST generation issues with ANTLR.
Miscellaneous ANTLR Examples. Various useful ANTLR examples that don't involve grammar or AST generation
Why Use ANTLR?
Introduction
This Web page discussions parser generators and the ANTLR parser generator in particular. See Compiler Front End and Infrastructure Software for compiler construction software suites.
If you just stumbled on this Web page you may be wondering what a parser generator is. I should warn you that this is probably not the best place to learn the answer. There are a number of excellent books that discuss parser generators and compiler construction. These include the all time classic compiler text (the so called "red dragon book") Compilers: Principles, Techniques and Tools by Aho, Sethi and Ullman, Addison Wesley, 1986 and Advanced Compiler Design and Implementation by Muchnick, Morgan Kaufmann, 1997. Briefly, a parser generator is a program that reads a grammar that describes a language and outputs source code (in language like C, C++, or Java) that will recognize the language described by the grammar. The most commonly used parser generator is the UNIX program YACC (the Free Software Foundation program Bison is a GNU version of YACC). Both UNIX and Windows NT versions of Bison are available from Cygnus.
A low key, largely non-technical introduction to compiler constructon has been written by Jack Crenshaw, who currently writes an outstanding column for Embedded Systems.
Another longer and more academic Web based reference is the book Compilers and Compiler Generators: an introduction with C++ by P.D. Terry, Rhodes University, 1996. This web pages publishes P.D. Terry's book in a variety of formats, including postscript and Adobe pdf format.
ANTLR is a second generation parser generator. The first generation was PCCTS. Both parsers were designed and implemented by Terence Parr. Various people have helped over the years. See the ANTLR and PCCTS source for the complete credits.
Although ANTLR is implemented in Java, it will generate either a Java or a C++ parser. I sometimes get e-mail asking which parser should be used: ANTLR or PCCTS. I don't have enough experience with PCCTS to comment on this. My view is that ANTLR represents the current main stream of development. Also, I would assume that the lessons Terence learned implemented PCCTS were incorporated into ANTLR. But there is an active PCCTS user base, so I am sure there are other, probably better informed view on this.
Although there is some brief introductory material, this Web page is written for people who already have some knowledge of compiler and language processor construction.
The ANTLR documentation and software can be found at www.antlr.org. ANTLR is a available in source form, with only the restriction being that the authors are credited. ANTLR evolved from PCCTS (the Purdue Compiler Construction Tool Set), which was developed by Terance Parr as part of his PhD thesis work. ANTLR is a rewrite of PCCTS. A history of ANTLR can be found here.
Automaticly Generated Parsers vs. Hand Crafted Parsers
The first compiler I worked on was the UCSD Pascal compiler. This compiler was based on Niklaus Wirth's Pascal compiler written at ETH in Switzerland. At the time, parser generators were in their infancy. Parsers were written by hand. The Pascal parser was a recursive decent parser, whose structure mirrored the Pascal syntax.
The technique for creating this kind of recursive decent parser was to carefully develop the BNF syntax for the language to be parsed and then write the parser so that it mirrored this syntax. Any change in the grammar had to be mirrored with a change in the software. This could be time consumming and any mistakes in the parser software would result in incorrect parsing.
Parser generators allow the programmer to work at a higher level of abstraction. Creating the language parser involves writing the grammar, instead of directly writing the software that parses the grammar. Changes that might take a few hours to implement in a hand crafted recursive decent parser can be made in a few minutes when a parser generator is used. A few hundred lines of grammar is also much easier to understand than a few thousand lines of hand crafted parser. Two arguments have classicly been used against automaticly created parsers:
Automaticly generated parsers are slow and use lots of memory.
Error reporting is poor in automaticly generated parsers.
Most language processors spend a fraction of their time parsing. For example, most compilers spend more time in optimization and code generation than they do in language parsing. On a modern computer system the time taken by the parser will probably be dominated by disk access, not parser execution. Nor is the issue of memory usage an argument against automaticly generated parsers. On a computer system with 128 Mb of RAM and perhaps 256 Mb of Virtual memory, the amount of memory used by the parser will usually be insignificant.
Error reporting is a problem, especially for YACC/Bison. ANTLR generates a recursive decent parser from the grammar and has fairly good error reporting (this is discussed at greater length below). Of course "good" is a matter of opinion.
Although I started my software career writing recursive decent parsers and have written hand crafted parsers for a variety of lanuages, I now greatly prefer automaticly generated parsers.
Since we're on the topic of YACC, a minor digression:
Another problem with YACC are the obscure shift-reduce and reduce-reduce conflicts. Terance Parr writes that PCCTS, the predecessor to ANTLR, was created in response to his horror of having to debug YACC grammars. If you do have to debug YACC grammars, Reiner Pesch has written a tool that might help. I have not tried it out and have included this link primarily because it might be of help in the future, since I am sometimes forced to use YACC on projects at work. Reiner writes, in comp.compilers (February 12, 2001):
Due to my diploma thesis in computer science I have implemented a tool which intends to assist developers using LALR parser generators like yacc with the analysis of shift/reduce- and reduce/reduce-conflicts. (The tool currently works with yacc, byacc, bison and happy).
So if you're a user of any LALR parser generator mentioned above, you could do me (and hopefully yourself) a favour by taking a look at ... http://www-users.rwth-aachen.de/reiner.pesch/yca/yca.html.
The yca.html Web page publishes a tool called YCA (YACC Conflict Analyser[sic]). YACC/Bison grammar conflicts can take a lot of time to understand so any tool that helps is worth looking into.
Parser Generator Options
Lookahead
The term lookahead refers to the number of lexical tokens that a parser looks at, beyond the current one, to determine the matching grammar production. Parser generators have a long history and they originally generated parsers to run on machines that, by modern standards, had very little memory (e.g., 129K bytes to 4Mb). Since token lookahead tends to increase parser memory use and make the parser slower, most parser generators created parsers that supported only one token lookahead. This is denoted by a number next to the grammar type (for example, LALR(1) is an LALR grammar with one token lookahead). Several more recent parser generators (ANTLR and JavaCC) can create parsers that can selectively lookahead more than one token (e.g., LL(k)).
Creating a grammar for a complex language is a lot of work. It is easy to accidently create a grammar with ambiguities, where the parser cannot tell two or more productions apart. Parser generators will report errors or warnings when ambiguous grammars are encountered. However, with every parser generator I've ever used, these messages are difficult to interpret. You must pour over the grammar considering how the grammar productions collided to create ambiguity. At first glance, the grammar writer may think that an LL(k) grammar will allow sloppier grammars. This is not actually the case. A grammar that contains ambiguity can actually be more ambiguous with k lookahead. Having k lookahead does allow some productions to be more clearly recognized, however. The most common example of this is the else clause in C, C++ and Java which can be clearly specified in a grammar production with selective k lookahead. This allows the grammar to be somewhat smaller. A parser with k lookahead can also recognize a wider range of languages than a parser with only one token lookahead. But these languages must still be unambiguously specified. So k lookahead provides power and flexibility, but I have not found that it greatly decreased the work necessary to create a grammar.
Open Source or Free (no license fee) Parser Generators
Until YACC became part of the UNIX software tool set, parser generators existed only in academia and at a few companies that kept the tools to themselves. Yacc is now a quarter century old. The original paper on yacc was written in July, 1975. An HTML version can be found here.
There are now a remarkable variety of parser generators available in addition to ANTLR. These parser generators include open source software, free binaries and commercial products. Originally this list came from a google search. I have updated it with information from comp.compilers. This list is getting long enough that I should probably break it out into its own web page.
Java Compiler Compiler, a parser generator written in Java that generates a Java parser. JavaCC was written at Sun Microsystems and is available free.
Metamata Parse is a follow-on tool to Sun's JavaCC. It is also available free. Metamata Parse claims to be an improved Java parser generator with a number of advantages over JavaCC. Metamata seems to have disappeared and I have not found another source for their version of JavaCC.
CUP: LALR Parser Generator for Java. This parser generator takes a YACC style grammar and creates a Java parser. Instead of YACC's $-number syntax, CUP uses labels that remind me of ANTLR. The parser created by CUP is table driven. CUP appears to use YACC style error recovery as well. The CUP documentation looks pretty good so if you only need a simple language recognizer in Java CUP might be a good choice. CUP is free software and is provided as part of at least some Linux distributions.
IBM's Jikes Parser Generator is part of the IBM open source Jikes Java compiler release. Like the Jikes Java compiler, the parser generator is available in source form. The Jikes software is not covered by the Free Software Foundation's copyleft, but can be used more liberally (e.g, as far as I can tell, if you use IBM's source in a product you don't have to disclose all of the source of the product).
Jikes Parser Generator is a parser generator that accepts as input an annotated description for a language grammar and produces text files suitable for inclusion in a parser for that language. It is similar in function and use to the widely-available parser generators Yacc and Bison.
It also provides support for automatic diagnosis and recovery from syntactic errors. It is the parser generator used by the Jikes Compiler. Jikes Parser Generator can generate parsers for LALR(k) grammars and produce output suitable for use with parsers written in Java, C, or C++.
Lemon Parser generator from a small software company in North Carolina named Hipp, Wyrick & Company, Inc., or "Hwaci" for short. This is an LALR(1) parser generator that is available in "Open Source" form.
btyacc. Gary Funck sent me a pointer to btyacc which is a YACC inspired parser generator. The btyacc parser generator is described as:
This is original Berkeley Yacc modified by Chris Dodd chrisd@collins.com and then further enhanced by Vadim Maslov vadik@siber.com of Siber Systems to better fit a production environment (and to have less bugs, too).
The btyacc parser generator creates a C++ table driven parser from a YACC style grammar. The authors write that btyacc has "industrial strength error processing [reporting?] and recovery", which would be a big improvement over stock YACC. They have also added a number of useful ideas, including a standard variable for source position, which can be used to annotate the abstract syntax trees created during parsing. One of the central features of btyacc is its support for syntactic backtracking. Chris Dodd writes:
BTYACC is a modified version of yacc that supports automatic backtracking and semantic disambiguation to parse ambiguous grammars, as well as syntactic sugar for inherited attributes (which tend to introduce conflicts). Whenever a btyacc generated parser runs into a shift-reduce or reduce-reduce error in the parse table, it remembers the current parse point (yacc stack and input stream state), and goes into trial parse mode. It then continues parsing, ignoring most rule actions. If it runs into an error (either through the parse table or through an action calling YYERROR), it backtracks to the most recent conflict point and tries a different alternative. If it finds a successful parse (reaches the end of the input or an action calls YYVALID), it backtracks to the point where it first entered trial parse mode, and continues with a full parse (executing all actions), following the path of the successful trial.
Btyacc is available in source form.
Ckit from Bell Labs
I have not looked into this parser generator. Ckit was mentioned in the USENET group comp.compilers by Torben Mogensen. Torben writes:
There is a tool called ckit which is basically a parser for C written in SML. The output is a representation of the abstract syntac tree. If I was to write a C compiler today, I would start from this.
SML has several features that makes it very well suited for writing compilers:
Recursive datatypes with pattern matching.
Automatic memory management (GC).
A good module system.
A type system that catches many errors at compile-time that would not be caught at all by C.
Exceptions
A wide range of compilers, so you can develop on a fast compiler and run your final code through an optimizing compiler.
You can see Andrew Appel's book Modern Compiler Implementation in ML for examples of compiler-code written in SML. You can even compare with how the same things are done in C or Java by also looking at Modern Compiler Implementation in C and Modern Compiler Implementation in Java by the same author. [Prof. Appel's web page on ML and compiling can be found here]
As for maintenance, the module system and the type system are really aids for maintenance. Most ML implementations have some kind of intelligent recompilation scheme.
In all, the major problem may be social: C/C++ programmers may be reluctant to learn a new (and radically different) language. The language itself is fairly easy to learn (we teach it as the first language at DIKU) but it takes a while for a programmer who thinks in objects and state to wrap his mind around a different way of expressing things.
Apparently there is also a compiler projects page for Bell Labs. Dave MacQueen of Bell Labs writes:
ML (Standard ML in particular) has a good track record in compiler applications. Compiling is basically a symbolic process, and SML is an excellent symbolic processing language. [...]
SML has many features that make it attractive from a software engineering point of view: safety, sound static typing, polymorphism, algebraic types and pattern matching, a powerful module system with parametric modules, exceptions, concurrency, garbage collection, etc. SML/NJ has a very helpful compilation manager, CM, that helps minimize development turn-around time. [...]
ckit 1.0 was released 31 March 2000. It has been used as the front end for various C static analysis tools in an industrial context where millions of lines of code are being analyzed, and it has been used in another industrial application as the front end for a C-derived domain specific language.
The MLRISC code generation framework is used in the SML/NJ compiler and in several other compiler projects. Back ends are currently provided for x86, sparc, alpha, powerpc, hppa.
The Eli System from the University of Colorado at Boulder.
According to the Eli Web page:
Eli provides modern compiler construction facilities to users with a wide range of sophistication. It offers complete solutions for commonly-encountered language implementation subtasks and contains libraries of reusable specifications, making possible the production of high-quality implementations from simple problem descriptions.
The system has been in the field since 1989, and has been used in a number of projects worldwide. It generates programs whose performance is comparable to that of a good hand-coded implementation. Development time for a processor using Eli is generally about one third of that for comparable hand code, and maintenance is significantly easier because specifications rather than implementations are being maintained.
Eli includes support for scanner, parser and AST generation. It also seems to provide support for generating code for the semantic analysis phase. For more details see the Eli Documentation. Although the Eli web site claims that Eli is easy to learn and that you don't have to have compiler experience, this looks like another compiler tool with a steep learning curve.
Commercial parser generators
The Visual Parse++ from Sandstone. This looks like a great tool. It generates LALR(1) parsers in several languages (C, C++ and Java). It comes with a complete graphical interface that shows the parser reductions, which should make debugging easier.
The YACC++ parser generator seems to have been originally developed at Johns Hopkins University. A Web page describing the features of YACC++ can still be found at Johns Hopkins.
Apparently YACC++ is now a commercial product from a software company named Compiler Resources Inc. However, it is the most expensive parser generator that I am aware of. YACC++ generates a C++ parser and claims to have much better error recovery than YACC or Bison.
AnaGram from Parsifal Software
To digress for a moment: Parsifal is an interesting choice of names for a software company. In the Arthurian Legend Parsifal was only knight among the Knights of the Round Table who found the holy grail. The grail itself is loaded with symbolism. In compiler design and implementation, the the holy grail is a tool (or tool set) that will create a compiler, from parser to code generator, from a set of formal descriptions (a grammar and instruction set description, for example). When this grail is found, it is said, compiler developers will all be out of work and cost of creating a new compiler will be low. Like Arthur's knights, many have searched for this grail, but no one has found it, at least for production quality compilers.
The AnaGram is an LALR(1) parser generator is supported on Win32. A happy AnaGram user sent me e-mail about this product.
Just a note for your "commercial parser generator" products web page, we have been using a great commercial product for years called "AnaGram" from Parsifal Software. Our company has been writing compilers for decades and have used several products over that history. First was YACC, then Bison, then LALR from a company now defunct. The last product used for our mainstream products has remained AnaGram because it works as advertised, we've never had an error in the code it generated, the documentation was good, and the support was exellent. It works under DOS and now Win/32, and generates C[++] code. It is accepts LALR(1) grammars and was lightyears ahead of the LALR YACC and Bison we had previously used.
The Lark LALR(2) and Ell (LL(1)) parser generator
I have not looked into this parser generator. It was mentioned in a comp.compilers posting. Apparently an older version is available free, but the newer version is a commercial compiler construction toolbox called Cocktail. The software is available from CoCoLab. Josef Grosch writes, in comp.compilers:
The parser generators Lark (LALR(2)) and Ell (LL(1)) of the Cocktail Toolbox provide automatic error reporting, recovery and repair. There is no need to add extra grammar rules using an error token. Everything is generated automatically. Example: if (a = b] write (a);
The above source line would result in the following message: 3, 13: Error found/expected : ]/) = + - <> <= >= < > IN OR ...
The location of the error is reported (line 3, column 13). The message is classified as Error. It is a syntax error indicated by "found/expected". The token ] is causing the syntax error. The tokens after the slash are the set of expected tokens. This set is truncated which is indicated by the three dots.
Error recovery also happens automatically. A minimal sequence of tokens is skipped so that parsing can safely continue. Moreover a minimal sequence of tokens is inserted (virtually) in order to repair the error. This has the advantage that the output of the parser is always consistent in the sense that it belongs to a syntactically correct input. Therefore phases following after parsing do not have to take care about syntax errors.
References:
J. Grosch, `Efficient and Comfortable Error Recovery in Recursive Descent Parsers', Structured Programming, 11, 129-140 (1990).
http://www.cocolab.de - look for ell.ps oder ell.pdf
Older versions of Cocktail are available for free on the Internet. Newer versions are commercial products.
Tree Parsers
Related to parser generators are tree parsers. Tree parsers read in a grammar that describes an abstract syntax tree that is generated by a compiler front end or compiler middle pass and generates a tree walker that processes the trees. ANTLR can be used to generate tree parsers. Another tree parser generator is iburg, developed by C. W. Fraser, D. R. Hanson, and T. A. Proebsting at Princeton and used to construct the code generator for Fraser and Hanson's lcc ANSI C compiler.
Kimwitu. This link is supposed to be a "persistent" URL. At the time this was written, Kimwitu resided at http://fmt.cs.utwente.nl/kimwitu/
Kinwitu is a software construction tool that takes trees as inputs and associates actions with tree "terms". Kimwitu was developed by Axel Belinfante and his colleagues Peter van Eijk, Henk Eertink and Henk Alblas at the University of Twente, Enschede, in the Netherlands. The strange name apparently comes from Swahili and deals with forests or trees.
Kimwitu is open source software covered under the GNU copyleft. However, the tree processor generated by Kimwitu can be used without restriction (e.g., you do not have to place your software under the GNU copyleft license if it uses a component generated with Kimwitu). See the Kimwitu license for more details.
Kimwitu is specificly targeted at compiler applications. A parser generates an abstract syntax tree (AST). Semantic analysis and tree transformation can then be performed with later passes written with the aid of Kimwitu.
The arguments for developing AST passes with a tool like Kimwitu is the same as the argument for using a parser generator. By developing at a higher level the code is easier to understand, easier to modify and faster to develop. In fact, the authors of Kimwitu state: "Experience has demonstrated that Kimwitu drastically speeds up development time, facilitates tool integration and generates production quality programs." [emphasis in the original]
Sadly there is no free lunch. The cost of more powerful tools is the learning curve to master the tool. Also, there is the issue of robustness. ANTLR has a large active user group so the classic open source argument applies to bug fixes, at least in theory (e.g., since a group of people uses the tool, you're not alone in fixing bugs). Of course in practice bug fixes may be really slow since those who maintain the open source tool are trying to make a living as well.
Kimwitu++
The Kimwitu tree parser outputs C. Kimwitu++ is a variant that outputs C++. As the name also suggests, Kimwitu++ is the "successor" to Kimwitu. One of the problems inherent in open source software is fragmentation and this seems to be an example of this. It is unclear how Kimwitu++ tracks the work being done on Kimwitu.
State Machine Generators
State machine generators are closely related to the deterministic finite automata (DFA) generated by scanner generators like the one built into ANTLR. However, scanner generators are aimed at language applications and are not well suited for state machine generation outside this area. The Libero state machine generator from iMatix supports state machine generation for a variety of applications.
The input to Libero is a state machine diagram expressed in a textural language. The output can be in a variety of languages ranging from Java and C++ to COBOL. The Libero tool is provided as open source software under the Free Software foundation GPL. The code generated by Libero can be used without restriction.
Other Lists of Parser Generators
The www.nullstone.com web pages have a number of links to a variety of topics involving compilers. The Nullstone test suite tests compiler optimization and is an invaluable tool for compiler testing and regression. Nullstone is run by Christopher Glaeser, who is an old friend and someone I have boundless respect for.
The Idiom.com Catalog of Free Compilers, Interpreters and Langauge Tools.. I don't have anything to do with the construction and maintenance of this Web page. However, Idiom.com is the ISP that hosts this Web page, my domain, bearcave.com and my shell account. Idiom is a great ISP. If you are trying to escape a large ISP that does not return your e-mail and keeps you waiting on the phone, try Idiom.
Why did I become interested in ANTLR?
I have used YACC (which is more or less interchangeable with Bison) for many years. YACC has several notable problems however:
Parsers created with YACC tend to have poor error reporting. Anyone who has used the original portable C compiler under UNIX will be familiar with the dreaded "syntax error line XX" error messages. With work, YACC can be made to do a better job by introducing error productions, but its still an uphill battle.
Errors in YACC grammars are difficult to understand. YACC will report a "shift/reduce" or "reduce/reduce" error associated with a given grammar rule. In many cases to understand why this error occured, the user must pour over the YACC debug trace, which can be obscure.
The parser generated by YACC is table driven (that is, the parser state transitions result from values in a table). This makes the parser difficult to impossible to debug.
Language grammars take a lot of work to develop. I was hoping that by using a parser generator with more than one token lookahead I would not have to work as hard on the grammar. As a friend of mine pointed out to me:
LR(k) for any k, most grammars that are not LR(1) are ambigious so the additional lookahead doesn't really do what most people think.
ANTLR's syntactic predicates allow arbitrary lookahead which can be a powerful way to resolve difficult parts of a syntax. But sadly they don't make grammars that much easier to develop. ANTLR grammars still take a lot of work and good grammar development is an art.
I need a parser generator that generates C++ (or at least C), since I am planning on implementing my compiler in C++.
I am planning to use the parser generator for a production quality compiler that I hope may be a commercial product one day where I might license the compiler source. So I had some other requirements:
Cost. For me, $500 is a lot of money for a software tool. So while tools like Visual Parse++ have attractive features, the cost is too high.
Longevity. As a compiler writer I know that selling compilers is not an easy business. Selling parser generators is an even more difficult business. So I don't want to spend a couple of years developing a compiler based on a product from company that may disappear.
Support. If I license my compiler to someone they will expect me to support all phases of the compiler. If a parser generator bug is found in the front end, I will be expected to fix it.
Given the issues listed above, ANTLR is the most attractive parser generator I have looked at:
ANTLR generates recursive decent parsers and has good error reporting "right out of the box". It appears that with a little work even better error reporting is possible.
The parser generated by ANTLR is more or less readable. This helps in debugging.
ANTLR is free.
ANTLR is available as "open source" so I don't have to worry about the vendor going out of business. Also, I can support it myself if necessary. ANTLR is not Linux, but there are a number of ANTLR users world wide, so there is a reasonable chance that bugs will be identified and corrected.
ANTLR will generate C++ parsers. Java may have some advantages as a compiler implementation language, but the performance of Java on compiler like applications is terrible. It is hard enough to make an optimization phase run with acceptable performance. If the compiler developer is saddled with Java's interpretive overhead as well, the compiler may be very slow. Unless native compiled Java is used, C++ is the only choice available for production quality compiler implementation.
Although ANTLR has some compelling advantages, it has some disadvantages as well:
ANTLR has a steep learning curve. There is a fair amount of HTML documention on ANTLR, but the tool is complex and I have found that the documentation can only be understood by working through many examples.
ANTLR grammars are not easy to debug. An error in an ANTLR grammar can result in an error report in a distant production. For example, I had an error in the grammar production for a C style for loop. ANTLR reported a non-determinism warning on the grammar production for unary minus. It was not until I was adding trees to the grammar and I corrected the problem with the for loop production that I understood that this was the problem. But it's worth mentioning that I have yet to use a parser generator where grammar errors were easy to find.
YACC follows the UNIX tool philosophy of doing one thing and providing a fairly simple interface to other tools and software. YACC is just a parser generator. If you want a scanner you use Lex or write one yourself. In the case of ANTLR, you get much more. ANTLR has a very nice lexical scanner description language, it can build trees, it can parse trees. ANTLR is sort of the Emacs of parser generators. This is great if you want to get a language processor up and running quickly. The ANTLR scanner even handles file input and line numbering. But if you don't want to use the stock ANTLR components things get more complicated. For example, if you don't want to use ANTLR's reference counted tree nodes but you still want to use the tree building syntax, its unclear to me what you need to do. Similarly there is only sketchy documentation on how to interface a non-ANTLR scanner with ANTLR.
Afterward
I did use ANTLR to generate a Java parser based on the John Mitchell, Terence Parr, John Lilley and Scott Stanchfield grammar. This is a very nice grammar and represents a lot of work on the author's part. I would like to publicly thank them for putting this grammar in the public domain. I am very happy with ANTLR. The parser is now complete and consists of about 5K lines of grammar, C++ actions and comments.
I do not use ANTLR tree generation. In part the reason for this is that I generate a C++ parser and C++ does not have an "interface" construct, like Java has. The Java parser defines the tree via a Java interface, which allows the Java parser user to define the actual implementation of the tree in any number of ways. For the C++ user however, the tree base class must be used and this had more features (like reference count) than I wanted. So I generate my own light weight trees using C++ actions.
Tree node definitions aside, there is another reason to generate AST through C++ or Java actions. The AST generated by the parser represents the core information for the statement or expression. As a result, tree shape is very important and a lot of work can go into generating trees that will make later processing easier. If trees are generated directly from ANTLR, tree shape tends to be controlled by modifying the grammar. This can fragment rules into several sub-productions. A grammar for a language like Java is already large. Fragmenting rules may make it more difficult to maintain. Direct generation of trees gives allows more freedom than generating trees using ANTLR. The drawback is that it is also more work.
These web page discuss my experience with a parser generator called ANTLR. ANTLR is a second generation parser generator. The first generation was PCCTS. ANTLR is implemented in Java, but will generate parsers in both Java and C++.
I have found that the learning curve for ANTLR is very steep. However, this is not unique to ANTLR. YACC/Bison also takes some effort to learn. The one advantage of YACC/Bison is that there are a number of good books on their use. ANTLR does not have as wide a user base yet. The Web site www.antlr.org includes HTML documentation for ANTLR. However, ANTLR is complicated and I have found that this documentation provides only a starting point. I have spent many hours working with ANTLR and working on small examples to learn the tool. In the spirit of free software I am publishing these Web pages in the hope that they will help other ANTLR users. As my use of ANTLR progresses I will update these Web pages, so this is definitely a work in progress.
Since ANTLR is an open source project (under the gentle guidance of Terence Parr), there are a number of active contributors. One of these is Ric Klaren. Ric publishes an ANTLR development snapshot. (http://wwwhome.cs.utwente.nl/~klaren/antlr/antlr-devel.tar.gz).
The main discussion group for ANTLR is hosted by Yahoo Groups and can be found at http://groups.yahoo.com/group/antlr-interest. The archives and current messages can be viewed, since this is a public forum. To post you need to join the group. The Yahoo user interface leaves a bit to be desired. The number of messages posted per month is shown at the bottom of the page. Click on the current month and you will see the latest discussion.
A great deal of credit should be given to anyone who puts the effort, time and creativity into building something. Many people will come along and criticize the works of others, but most of these critics have never created anything significant themselves. The critics have never struggled with the issues that the creator of the work did and in most cases could not create a similar work themselves.
ANTLR is a significant piece of software and a great deal of credit is due to Terence Parr, Peter Wells and the other developers and maintainers of ANTLR. In these Web pages I discuss some of the pros and cons of ANTLR, as I see them. But I recognize the efforts of the authors and maintainers of ANTLR. They stepped up and created the software.
Why Use ANTLR?
Running ANTLR under Microsoft Java
Building a C++ ANTLR Parser on Windows NT (this should also help UNIX users).
ANTLR Examples developed while I learned to use ANTLR. This page is devoted to grammar and AST generation issues with ANTLR.
Miscellaneous ANTLR Examples. Various useful ANTLR examples that don't involve grammar or AST generation
Why Use ANTLR?
Introduction
This Web page discussions parser generators and the ANTLR parser generator in particular. See Compiler Front End and Infrastructure Software for compiler construction software suites.
If you just stumbled on this Web page you may be wondering what a parser generator is. I should warn you that this is probably not the best place to learn the answer. There are a number of excellent books that discuss parser generators and compiler construction. These include the all time classic compiler text (the so called "red dragon book") Compilers: Principles, Techniques and Tools by Aho, Sethi and Ullman, Addison Wesley, 1986 and Advanced Compiler Design and Implementation by Muchnick, Morgan Kaufmann, 1997. Briefly, a parser generator is a program that reads a grammar that describes a language and outputs source code (in language like C, C++, or Java) that will recognize the language described by the grammar. The most commonly used parser generator is the UNIX program YACC (the Free Software Foundation program Bison is a GNU version of YACC). Both UNIX and Windows NT versions of Bison are available from Cygnus.
A low key, largely non-technical introduction to compiler constructon has been written by Jack Crenshaw, who currently writes an outstanding column for Embedded Systems.
Another longer and more academic Web based reference is the book Compilers and Compiler Generators: an introduction with C++ by P.D. Terry, Rhodes University, 1996. This web pages publishes P.D. Terry's book in a variety of formats, including postscript and Adobe pdf format.
ANTLR is a second generation parser generator. The first generation was PCCTS. Both parsers were designed and implemented by Terence Parr. Various people have helped over the years. See the ANTLR and PCCTS source for the complete credits.
Although ANTLR is implemented in Java, it will generate either a Java or a C++ parser. I sometimes get e-mail asking which parser should be used: ANTLR or PCCTS. I don't have enough experience with PCCTS to comment on this. My view is that ANTLR represents the current main stream of development. Also, I would assume that the lessons Terence learned implemented PCCTS were incorporated into ANTLR. But there is an active PCCTS user base, so I am sure there are other, probably better informed view on this.
Although there is some brief introductory material, this Web page is written for people who already have some knowledge of compiler and language processor construction.
The ANTLR documentation and software can be found at www.antlr.org. ANTLR is a available in source form, with only the restriction being that the authors are credited. ANTLR evolved from PCCTS (the Purdue Compiler Construction Tool Set), which was developed by Terance Parr as part of his PhD thesis work. ANTLR is a rewrite of PCCTS. A history of ANTLR can be found here.
Automaticly Generated Parsers vs. Hand Crafted Parsers
The first compiler I worked on was the UCSD Pascal compiler. This compiler was based on Niklaus Wirth's Pascal compiler written at ETH in Switzerland. At the time, parser generators were in their infancy. Parsers were written by hand. The Pascal parser was a recursive decent parser, whose structure mirrored the Pascal syntax.
The technique for creating this kind of recursive decent parser was to carefully develop the BNF syntax for the language to be parsed and then write the parser so that it mirrored this syntax. Any change in the grammar had to be mirrored with a change in the software. This could be time consumming and any mistakes in the parser software would result in incorrect parsing.
Parser generators allow the programmer to work at a higher level of abstraction. Creating the language parser involves writing the grammar, instead of directly writing the software that parses the grammar. Changes that might take a few hours to implement in a hand crafted recursive decent parser can be made in a few minutes when a parser generator is used. A few hundred lines of grammar is also much easier to understand than a few thousand lines of hand crafted parser. Two arguments have classicly been used against automaticly created parsers:
Automaticly generated parsers are slow and use lots of memory.
Error reporting is poor in automaticly generated parsers.
Most language processors spend a fraction of their time parsing. For example, most compilers spend more time in optimization and code generation than they do in language parsing. On a modern computer system the time taken by the parser will probably be dominated by disk access, not parser execution. Nor is the issue of memory usage an argument against automaticly generated parsers. On a computer system with 128 Mb of RAM and perhaps 256 Mb of Virtual memory, the amount of memory used by the parser will usually be insignificant.
Error reporting is a problem, especially for YACC/Bison. ANTLR generates a recursive decent parser from the grammar and has fairly good error reporting (this is discussed at greater length below). Of course "good" is a matter of opinion.
Although I started my software career writing recursive decent parsers and have written hand crafted parsers for a variety of lanuages, I now greatly prefer automaticly generated parsers.
Since we're on the topic of YACC, a minor digression:
Another problem with YACC are the obscure shift-reduce and reduce-reduce conflicts. Terance Parr writes that PCCTS, the predecessor to ANTLR, was created in response to his horror of having to debug YACC grammars. If you do have to debug YACC grammars, Reiner Pesch has written a tool that might help. I have not tried it out and have included this link primarily because it might be of help in the future, since I am sometimes forced to use YACC on projects at work. Reiner writes, in comp.compilers (February 12, 2001):
Due to my diploma thesis in computer science I have implemented a tool which intends to assist developers using LALR parser generators like yacc with the analysis of shift/reduce- and reduce/reduce-conflicts. (The tool currently works with yacc, byacc, bison and happy).
So if you're a user of any LALR parser generator mentioned above, you could do me (and hopefully yourself) a favour by taking a look at ... http://www-users.rwth-aachen.de/reiner.pesch/yca/yca.html.
The yca.html Web page publishes a tool called YCA (YACC Conflict Analyser[sic]). YACC/Bison grammar conflicts can take a lot of time to understand so any tool that helps is worth looking into.
Parser Generator Options
Lookahead
The term lookahead refers to the number of lexical tokens that a parser looks at, beyond the current one, to determine the matching grammar production. Parser generators have a long history and they originally generated parsers to run on machines that, by modern standards, had very little memory (e.g., 129K bytes to 4Mb). Since token lookahead tends to increase parser memory use and make the parser slower, most parser generators created parsers that supported only one token lookahead. This is denoted by a number next to the grammar type (for example, LALR(1) is an LALR grammar with one token lookahead). Several more recent parser generators (ANTLR and JavaCC) can create parsers that can selectively lookahead more than one token (e.g., LL(k)).
Creating a grammar for a complex language is a lot of work. It is easy to accidently create a grammar with ambiguities, where the parser cannot tell two or more productions apart. Parser generators will report errors or warnings when ambiguous grammars are encountered. However, with every parser generator I've ever used, these messages are difficult to interpret. You must pour over the grammar considering how the grammar productions collided to create ambiguity. At first glance, the grammar writer may think that an LL(k) grammar will allow sloppier grammars. This is not actually the case. A grammar that contains ambiguity can actually be more ambiguous with k lookahead. Having k lookahead does allow some productions to be more clearly recognized, however. The most common example of this is the else clause in C, C++ and Java which can be clearly specified in a grammar production with selective k lookahead. This allows the grammar to be somewhat smaller. A parser with k lookahead can also recognize a wider range of languages than a parser with only one token lookahead. But these languages must still be unambiguously specified. So k lookahead provides power and flexibility, but I have not found that it greatly decreased the work necessary to create a grammar.
Open Source or Free (no license fee) Parser Generators
Until YACC became part of the UNIX software tool set, parser generators existed only in academia and at a few companies that kept the tools to themselves. Yacc is now a quarter century old. The original paper on yacc was written in July, 1975. An HTML version can be found here.
There are now a remarkable variety of parser generators available in addition to ANTLR. These parser generators include open source software, free binaries and commercial products. Originally this list came from a google search. I have updated it with information from comp.compilers. This list is getting long enough that I should probably break it out into its own web page.
Java Compiler Compiler, a parser generator written in Java that generates a Java parser. JavaCC was written at Sun Microsystems and is available free.
Metamata Parse is a follow-on tool to Sun's JavaCC. It is also available free. Metamata Parse claims to be an improved Java parser generator with a number of advantages over JavaCC. Metamata seems to have disappeared and I have not found another source for their version of JavaCC.
CUP: LALR Parser Generator for Java. This parser generator takes a YACC style grammar and creates a Java parser. Instead of YACC's $-number syntax, CUP uses labels that remind me of ANTLR. The parser created by CUP is table driven. CUP appears to use YACC style error recovery as well. The CUP documentation looks pretty good so if you only need a simple language recognizer in Java CUP might be a good choice. CUP is free software and is provided as part of at least some Linux distributions.
IBM's Jikes Parser Generator is part of the IBM open source Jikes Java compiler release. Like the Jikes Java compiler, the parser generator is available in source form. The Jikes software is not covered by the Free Software Foundation's copyleft, but can be used more liberally (e.g, as far as I can tell, if you use IBM's source in a product you don't have to disclose all of the source of the product).
Jikes Parser Generator is a parser generator that accepts as input an annotated description for a language grammar and produces text files suitable for inclusion in a parser for that language. It is similar in function and use to the widely-available parser generators Yacc and Bison.
It also provides support for automatic diagnosis and recovery from syntactic errors. It is the parser generator used by the Jikes Compiler. Jikes Parser Generator can generate parsers for LALR(k) grammars and produce output suitable for use with parsers written in Java, C, or C++.
Lemon Parser generator from a small software company in North Carolina named Hipp, Wyrick & Company, Inc., or "Hwaci" for short. This is an LALR(1) parser generator that is available in "Open Source" form.
btyacc. Gary Funck sent me a pointer to btyacc which is a YACC inspired parser generator. The btyacc parser generator is described as:
This is original Berkeley Yacc modified by Chris Dodd chrisd@collins.com and then further enhanced by Vadim Maslov vadik@siber.com of Siber Systems to better fit a production environment (and to have less bugs, too).
The btyacc parser generator creates a C++ table driven parser from a YACC style grammar. The authors write that btyacc has "industrial strength error processing [reporting?] and recovery", which would be a big improvement over stock YACC. They have also added a number of useful ideas, including a standard variable for source position, which can be used to annotate the abstract syntax trees created during parsing. One of the central features of btyacc is its support for syntactic backtracking. Chris Dodd writes:
BTYACC is a modified version of yacc that supports automatic backtracking and semantic disambiguation to parse ambiguous grammars, as well as syntactic sugar for inherited attributes (which tend to introduce conflicts). Whenever a btyacc generated parser runs into a shift-reduce or reduce-reduce error in the parse table, it remembers the current parse point (yacc stack and input stream state), and goes into trial parse mode. It then continues parsing, ignoring most rule actions. If it runs into an error (either through the parse table or through an action calling YYERROR), it backtracks to the most recent conflict point and tries a different alternative. If it finds a successful parse (reaches the end of the input or an action calls YYVALID), it backtracks to the point where it first entered trial parse mode, and continues with a full parse (executing all actions), following the path of the successful trial.
Btyacc is available in source form.
Ckit from Bell Labs
I have not looked into this parser generator. Ckit was mentioned in the USENET group comp.compilers by Torben Mogensen. Torben writes:
There is a tool called ckit which is basically a parser for C written in SML. The output is a representation of the abstract syntac tree. If I was to write a C compiler today, I would start from this.
SML has several features that makes it very well suited for writing compilers:
Recursive datatypes with pattern matching.
Automatic memory management (GC).
A good module system.
A type system that catches many errors at compile-time that would not be caught at all by C.
Exceptions
A wide range of compilers, so you can develop on a fast compiler and run your final code through an optimizing compiler.
You can see Andrew Appel's book Modern Compiler Implementation in ML for examples of compiler-code written in SML. You can even compare with how the same things are done in C or Java by also looking at Modern Compiler Implementation in C and Modern Compiler Implementation in Java by the same author. [Prof. Appel's web page on ML and compiling can be found here]
As for maintenance, the module system and the type system are really aids for maintenance. Most ML implementations have some kind of intelligent recompilation scheme.
In all, the major problem may be social: C/C++ programmers may be reluctant to learn a new (and radically different) language. The language itself is fairly easy to learn (we teach it as the first language at DIKU) but it takes a while for a programmer who thinks in objects and state to wrap his mind around a different way of expressing things.
Apparently there is also a compiler projects page for Bell Labs. Dave MacQueen of Bell Labs writes:
ML (Standard ML in particular) has a good track record in compiler applications. Compiling is basically a symbolic process, and SML is an excellent symbolic processing language. [...]
SML has many features that make it attractive from a software engineering point of view: safety, sound static typing, polymorphism, algebraic types and pattern matching, a powerful module system with parametric modules, exceptions, concurrency, garbage collection, etc. SML/NJ has a very helpful compilation manager, CM, that helps minimize development turn-around time. [...]
ckit 1.0 was released 31 March 2000. It has been used as the front end for various C static analysis tools in an industrial context where millions of lines of code are being analyzed, and it has been used in another industrial application as the front end for a C-derived domain specific language.
The MLRISC code generation framework is used in the SML/NJ compiler and in several other compiler projects. Back ends are currently provided for x86, sparc, alpha, powerpc, hppa.
The Eli System from the University of Colorado at Boulder.
According to the Eli Web page:
Eli provides modern compiler construction facilities to users with a wide range of sophistication. It offers complete solutions for commonly-encountered language implementation subtasks and contains libraries of reusable specifications, making possible the production of high-quality implementations from simple problem descriptions.
The system has been in the field since 1989, and has been used in a number of projects worldwide. It generates programs whose performance is comparable to that of a good hand-coded implementation. Development time for a processor using Eli is generally about one third of that for comparable hand code, and maintenance is significantly easier because specifications rather than implementations are being maintained.
Eli includes support for scanner, parser and AST generation. It also seems to provide support for generating code for the semantic analysis phase. For more details see the Eli Documentation. Although the Eli web site claims that Eli is easy to learn and that you don't have to have compiler experience, this looks like another compiler tool with a steep learning curve.
Commercial parser generators
The Visual Parse++ from Sandstone. This looks like a great tool. It generates LALR(1) parsers in several languages (C, C++ and Java). It comes with a complete graphical interface that shows the parser reductions, which should make debugging easier.
The YACC++ parser generator seems to have been originally developed at Johns Hopkins University. A Web page describing the features of YACC++ can still be found at Johns Hopkins.
Apparently YACC++ is now a commercial product from a software company named Compiler Resources Inc. However, it is the most expensive parser generator that I am aware of. YACC++ generates a C++ parser and claims to have much better error recovery than YACC or Bison.
AnaGram from Parsifal Software
To digress for a moment: Parsifal is an interesting choice of names for a software company. In the Arthurian Legend Parsifal was only knight among the Knights of the Round Table who found the holy grail. The grail itself is loaded with symbolism. In compiler design and implementation, the the holy grail is a tool (or tool set) that will create a compiler, from parser to code generator, from a set of formal descriptions (a grammar and instruction set description, for example). When this grail is found, it is said, compiler developers will all be out of work and cost of creating a new compiler will be low. Like Arthur's knights, many have searched for this grail, but no one has found it, at least for production quality compilers.
The AnaGram is an LALR(1) parser generator is supported on Win32. A happy AnaGram user sent me e-mail about this product.
Just a note for your "commercial parser generator" products web page, we have been using a great commercial product for years called "AnaGram" from Parsifal Software. Our company has been writing compilers for decades and have used several products over that history. First was YACC, then Bison, then LALR from a company now defunct. The last product used for our mainstream products has remained AnaGram because it works as advertised, we've never had an error in the code it generated, the documentation was good, and the support was exellent. It works under DOS and now Win/32, and generates C[++] code. It is accepts LALR(1) grammars and was lightyears ahead of the LALR YACC and Bison we had previously used.
The Lark LALR(2) and Ell (LL(1)) parser generator
I have not looked into this parser generator. It was mentioned in a comp.compilers posting. Apparently an older version is available free, but the newer version is a commercial compiler construction toolbox called Cocktail. The software is available from CoCoLab. Josef Grosch writes, in comp.compilers:
The parser generators Lark (LALR(2)) and Ell (LL(1)) of the Cocktail Toolbox provide automatic error reporting, recovery and repair. There is no need to add extra grammar rules using an error token. Everything is generated automatically. Example: if (a = b] write (a);
The above source line would result in the following message: 3, 13: Error found/expected : ]/) = + - <> <= >= < > IN OR ...
The location of the error is reported (line 3, column 13). The message is classified as Error. It is a syntax error indicated by "found/expected". The token ] is causing the syntax error. The tokens after the slash are the set of expected tokens. This set is truncated which is indicated by the three dots.
Error recovery also happens automatically. A minimal sequence of tokens is skipped so that parsing can safely continue. Moreover a minimal sequence of tokens is inserted (virtually) in order to repair the error. This has the advantage that the output of the parser is always consistent in the sense that it belongs to a syntactically correct input. Therefore phases following after parsing do not have to take care about syntax errors.
References:
J. Grosch, `Efficient and Comfortable Error Recovery in Recursive Descent Parsers', Structured Programming, 11, 129-140 (1990).
http://www.cocolab.de - look for ell.ps oder ell.pdf
Older versions of Cocktail are available for free on the Internet. Newer versions are commercial products.
Tree Parsers
Related to parser generators are tree parsers. Tree parsers read in a grammar that describes an abstract syntax tree that is generated by a compiler front end or compiler middle pass and generates a tree walker that processes the trees. ANTLR can be used to generate tree parsers. Another tree parser generator is iburg, developed by C. W. Fraser, D. R. Hanson, and T. A. Proebsting at Princeton and used to construct the code generator for Fraser and Hanson's lcc ANSI C compiler.
Kimwitu. This link is supposed to be a "persistent" URL. At the time this was written, Kimwitu resided at http://fmt.cs.utwente.nl/kimwitu/
Kinwitu is a software construction tool that takes trees as inputs and associates actions with tree "terms". Kimwitu was developed by Axel Belinfante and his colleagues Peter van Eijk, Henk Eertink and Henk Alblas at the University of Twente, Enschede, in the Netherlands. The strange name apparently comes from Swahili and deals with forests or trees.
Kimwitu is open source software covered under the GNU copyleft. However, the tree processor generated by Kimwitu can be used without restriction (e.g., you do not have to place your software under the GNU copyleft license if it uses a component generated with Kimwitu). See the Kimwitu license for more details.
Kimwitu is specificly targeted at compiler applications. A parser generates an abstract syntax tree (AST). Semantic analysis and tree transformation can then be performed with later passes written with the aid of Kimwitu.
The arguments for developing AST passes with a tool like Kimwitu is the same as the argument for using a parser generator. By developing at a higher level the code is easier to understand, easier to modify and faster to develop. In fact, the authors of Kimwitu state: "Experience has demonstrated that Kimwitu drastically speeds up development time, facilitates tool integration and generates production quality programs." [emphasis in the original]
Sadly there is no free lunch. The cost of more powerful tools is the learning curve to master the tool. Also, there is the issue of robustness. ANTLR has a large active user group so the classic open source argument applies to bug fixes, at least in theory (e.g., since a group of people uses the tool, you're not alone in fixing bugs). Of course in practice bug fixes may be really slow since those who maintain the open source tool are trying to make a living as well.
Kimwitu++
The Kimwitu tree parser outputs C. Kimwitu++ is a variant that outputs C++. As the name also suggests, Kimwitu++ is the "successor" to Kimwitu. One of the problems inherent in open source software is fragmentation and this seems to be an example of this. It is unclear how Kimwitu++ tracks the work being done on Kimwitu.
State Machine Generators
State machine generators are closely related to the deterministic finite automata (DFA) generated by scanner generators like the one built into ANTLR. However, scanner generators are aimed at language applications and are not well suited for state machine generation outside this area. The Libero state machine generator from iMatix supports state machine generation for a variety of applications.
The input to Libero is a state machine diagram expressed in a textural language. The output can be in a variety of languages ranging from Java and C++ to COBOL. The Libero tool is provided as open source software under the Free Software foundation GPL. The code generated by Libero can be used without restriction.
Other Lists of Parser Generators
The www.nullstone.com web pages have a number of links to a variety of topics involving compilers. The Nullstone test suite tests compiler optimization and is an invaluable tool for compiler testing and regression. Nullstone is run by Christopher Glaeser, who is an old friend and someone I have boundless respect for.
The Idiom.com Catalog of Free Compilers, Interpreters and Langauge Tools.. I don't have anything to do with the construction and maintenance of this Web page. However, Idiom.com is the ISP that hosts this Web page, my domain, bearcave.com and my shell account. Idiom is a great ISP. If you are trying to escape a large ISP that does not return your e-mail and keeps you waiting on the phone, try Idiom.
Why did I become interested in ANTLR?
I have used YACC (which is more or less interchangeable with Bison) for many years. YACC has several notable problems however:
Parsers created with YACC tend to have poor error reporting. Anyone who has used the original portable C compiler under UNIX will be familiar with the dreaded "syntax error line XX" error messages. With work, YACC can be made to do a better job by introducing error productions, but its still an uphill battle.
Errors in YACC grammars are difficult to understand. YACC will report a "shift/reduce" or "reduce/reduce" error associated with a given grammar rule. In many cases to understand why this error occured, the user must pour over the YACC debug trace, which can be obscure.
The parser generated by YACC is table driven (that is, the parser state transitions result from values in a table). This makes the parser difficult to impossible to debug.
Language grammars take a lot of work to develop. I was hoping that by using a parser generator with more than one token lookahead I would not have to work as hard on the grammar. As a friend of mine pointed out to me:
LR(k) for any k, most grammars that are not LR(1) are ambigious so the additional lookahead doesn't really do what most people think.
ANTLR's syntactic predicates allow arbitrary lookahead which can be a powerful way to resolve difficult parts of a syntax. But sadly they don't make grammars that much easier to develop. ANTLR grammars still take a lot of work and good grammar development is an art.
I need a parser generator that generates C++ (or at least C), since I am planning on implementing my compiler in C++.
I am planning to use the parser generator for a production quality compiler that I hope may be a commercial product one day where I might license the compiler source. So I had some other requirements:
Cost. For me, $500 is a lot of money for a software tool. So while tools like Visual Parse++ have attractive features, the cost is too high.
Longevity. As a compiler writer I know that selling compilers is not an easy business. Selling parser generators is an even more difficult business. So I don't want to spend a couple of years developing a compiler based on a product from company that may disappear.
Support. If I license my compiler to someone they will expect me to support all phases of the compiler. If a parser generator bug is found in the front end, I will be expected to fix it.
Given the issues listed above, ANTLR is the most attractive parser generator I have looked at:
ANTLR generates recursive decent parsers and has good error reporting "right out of the box". It appears that with a little work even better error reporting is possible.
The parser generated by ANTLR is more or less readable. This helps in debugging.
ANTLR is free.
ANTLR is available as "open source" so I don't have to worry about the vendor going out of business. Also, I can support it myself if necessary. ANTLR is not Linux, but there are a number of ANTLR users world wide, so there is a reasonable chance that bugs will be identified and corrected.
ANTLR will generate C++ parsers. Java may have some advantages as a compiler implementation language, but the performance of Java on compiler like applications is terrible. It is hard enough to make an optimization phase run with acceptable performance. If the compiler developer is saddled with Java's interpretive overhead as well, the compiler may be very slow. Unless native compiled Java is used, C++ is the only choice available for production quality compiler implementation.
Although ANTLR has some compelling advantages, it has some disadvantages as well:
ANTLR has a steep learning curve. There is a fair amount of HTML documention on ANTLR, but the tool is complex and I have found that the documentation can only be understood by working through many examples.
ANTLR grammars are not easy to debug. An error in an ANTLR grammar can result in an error report in a distant production. For example, I had an error in the grammar production for a C style for loop. ANTLR reported a non-determinism warning on the grammar production for unary minus. It was not until I was adding trees to the grammar and I corrected the problem with the for loop production that I understood that this was the problem. But it's worth mentioning that I have yet to use a parser generator where grammar errors were easy to find.
YACC follows the UNIX tool philosophy of doing one thing and providing a fairly simple interface to other tools and software. YACC is just a parser generator. If you want a scanner you use Lex or write one yourself. In the case of ANTLR, you get much more. ANTLR has a very nice lexical scanner description language, it can build trees, it can parse trees. ANTLR is sort of the Emacs of parser generators. This is great if you want to get a language processor up and running quickly. The ANTLR scanner even handles file input and line numbering. But if you don't want to use the stock ANTLR components things get more complicated. For example, if you don't want to use ANTLR's reference counted tree nodes but you still want to use the tree building syntax, its unclear to me what you need to do. Similarly there is only sketchy documentation on how to interface a non-ANTLR scanner with ANTLR.
Afterward
I did use ANTLR to generate a Java parser based on the John Mitchell, Terence Parr, John Lilley and Scott Stanchfield grammar. This is a very nice grammar and represents a lot of work on the author's part. I would like to publicly thank them for putting this grammar in the public domain. I am very happy with ANTLR. The parser is now complete and consists of about 5K lines of grammar, C++ actions and comments.
I do not use ANTLR tree generation. In part the reason for this is that I generate a C++ parser and C++ does not have an "interface" construct, like Java has. The Java parser defines the tree via a Java interface, which allows the Java parser user to define the actual implementation of the tree in any number of ways. For the C++ user however, the tree base class must be used and this had more features (like reference count) than I wanted. So I generate my own light weight trees using C++ actions.
Tree node definitions aside, there is another reason to generate AST through C++ or Java actions. The AST generated by the parser represents the core information for the statement or expression. As a result, tree shape is very important and a lot of work can go into generating trees that will make later processing easier. If trees are generated directly from ANTLR, tree shape tends to be controlled by modifying the grammar. This can fragment rules into several sub-productions. A grammar for a language like Java is already large. Fragmenting rules may make it more difficult to maintain. Direct generation of trees gives allows more freedom than generating trees using ANTLR. The drawback is that it is also more work.
Subscribe to:
Posts (Atom)