Saturday, September 5, 2009

Compiler Front End and Infrastructure Software
Compiler infrastructure software usually includes one or more language front ends, software for abstract syntax tree (AST) generation and AST modification. Optimization and code generation phases may also be included. These software suites are specific for compiler creation and in this domain are more powerful than parser generators, as long as the user is willing to live with the pre-built infrastructure. This software is also less general than parser generators which can be used for a range of language processing tasks (e.g., HTML or XML parsing, for example).
Although compiler design and development seems to have fallen out of favor as a research topic, it has a long and rich history. A great deal of work has been done on compiler infrastructure software. It is a gross understatement to say that this is not a complete bibliography. The OSUIF paper, referenced below, includes a more complete list of infrastructure tools.
Obviously these tool suites are not limited to or specific for Java. However, when considering the architecture for a Java compiler it is reasonable to consider whether some of this software can be used to simply the huge development task.
See Why Use ANTLR for a related Web page that discusses the ANTLR parser generator and lists other parser generator options (including tools like state machine generators and tree walker generators).
There is no free lunch
Software reuse has been one of the mantras of software development for the last twenty years (remember Ada?) The concept is both appealing and impossible to argue with: by reusing tested, carefully designed software, more reliable software can be created in less time. In practice software reuse has proven difficult and there are few wide spread successes. In my opinion there are, in fact, two: UNIX (or POSIX) and Java. The C++ Standard Template Library (STL) may be a more modest third.
There are two barriers to software reuse: design and complexity. Creating reusable software requires a deep understanding of object creation and reuse. This is a rare skill and one that software engineers spend their careers developing. Many object packages that are created for reuse are difficult to use because of design flaws. In addition to design issues, application specific packages, like compiler infrastructures (I know you were wondering where I was going with this), are difficult to use because they have a huge learning curve. These infrastructures consist of tens of thousands of lines of code. They can be rapidly reused by their creators, but reuse is more difficult in the hands of someone else.
A compiler infrastructure could be created entirely out of objects which could be plugged in and out to create new compiler software. Examples of these objects include memory allocation, parsers, flow graph creation and optimization. However, it is difficult to create an efficient set of compiler objects that are truly interchangeable. For example, if the AST changes, will flow graph creation still work? As a result, many infrastructures are interdependent and it is difficult to reuse one piece without them all.
Finally, many of these infrastructures are aimed at research applications and trade flexibility for performance and size. This makes them impractical for production quality compilers.
Compiler Infrastructures
Low Level Virtual Machine (LLVM) Compiler Infrastructure
The name of the LLVM Compiler Infrastructure is misleading. To me the name implies something like a virtual instruction set and virtual machine. This compiler infrastructure does include an intermediate based on Static Single Assignment. The infrastructure also includes optimization components, including interprocedural optimization and code generator support.
Open64 Project
The Open64 project is based on the Silicon Graphics SGI Pro64 compiler components. Open64 is a compiler suite supports C, C++ and Fortran 90. This is an open source release of the SGI compilers targeted at the Linux operating system and the Intel IA-64 processor. This compiler suite uses front ends based on the GNU gcc and g++ compiler front ends. The Fortran 90 front end is, presumably, SGI's.
Unlike the GNU compiler, this compiler suite uses a modern tree structured intermediate that is successively lowered. The SGI Web site includes documentation on the symbol table and intermediate format. The SGI compiler targets the Intel IA-64 architecture and includes an advanced optimization phase that goes far beyond anything that is likely to ever appear in the GNU compiler. Some features, like C++ exceptions, have not been implemented at the time of this writing. Although this compiler software seems to be based on SGI's existing compiler stream, the IA-64 incarnation is relatively new and immature. For example, at least two of the SPEC2000 benchmarks do not execute correctly when compiled with this compiler.
The SGI Pro64 compiler suite is unquestionably the most advanced open source compiler currently available. There is a huge learning curve for a large software base like this, but for many advanced compiler applications it seems likely that this compiler suite will displace the GNU compiler.
The license under which this software is released was not finalized at the time this note was written. SGI states that they plan to follow the Free Software Foundation's "copyleft" agreement, so any software that uses the SGI compiler software must be made available as open source as well.
SGI has fallen on hard times recently and has refashioned itself as a Linux supplier. Whether they can be profitable at this remains to be seen. It is remarkable to see the release of this compiler suite as open source. The only obvious motivation for releasing software representing millions of dollars worth of development effort and many years of work into the public domain is the culture of the Linux platform, where all software is open source. I have deep reservations about the long term viability of this model. From a financial perspective perhaps SGI views the development of this software as already paid for by the "old" SGI and written off at a loss. As a result, they don't seek any direct return on this software. On the other hand, there is an on-going cost in maintaining and enhancing these compilers. Since this technology can also be picked up by, say, Red Hat/Cygnus, it is not clear SGI will pay for these on-going costs.
Whether SGI ever makes money with this technology or not, this is a significant and very impressive software release. In particular, it appears that SGI has some deep insights into SSA optimization techniques.
OSUIF at UC Santa Barbara.
SUIF is an acronym for the Stanford University Intermediate Form. This is a common high level compiler intermediate aimed at compiler research. The "O" is for Object-oriented. The UC Santa Barbara project, lead by Professor Urs Hoelzle, has developed a Java .class file to SUIF front end.
Their paper j2s: A SUIF Java Compiler provides an interesting discussion of the issues involved with generating a flow graph from Java (or at least Java in byte code form). This compiler generates C output rather than native code. However, there are, at least in theory, SUIF backends that will generate native code.
The SUIF intermediate is a high level intermediate. It is very flexible and allows various compiler passes to be added. But part of the cost of this flexibility is performance. So SUIF does not appear to be appropriate for an industrial production compiler.
Zephyr: Tools for a National Compiler Infrastructure
The Zephyr project collaboration between Princeton University and University of Virginia. The Zypher site is hosted by the University of Virginia. The Zyphyr project has defined the Zephyr Abstract Syntax Description Language (ASDL) and a machine independent optimizer that works on ASDL intermediates. A set of code generators is also available.
The Stanford University SUIF Compiler System and the Zephyr project are funded by the national compiler infrastructure project. The Stanford SUIF project also defines the Stanford University Intermediate Format (SUIF). I am more familiar with the SUIF project at Stanford (since they are in my "backyard" and I knew some of the Stanford people when they were graduate students). From looking at the Zypher and SUIF web pages it is unclear to me what connection and collaboration between these two projects exists. SUIF also provides a set of optimizers and code generators. Apparently it is possible to define a SUIF intermediate using Zephyr. Zephyr provides some level of support for Java.
The national compiler infrastructure project is aimed at supporting compiler research. The software tends to be designed for flexibility and modularity. But is not designed for industrial use. Although the industrial compiler writer may learn from these projects, it seems unlikely that this software could be directly adapted for production quality compilers.
Impact Research Group at the University of Illinois
The University of Illionois has a long history of research work on high performance computing and advanced compilation techniques. The Impact group is working on compiler technology and evaluation tools for very wide instruction word (VLIW) instruction sets and other instruction level parallelism. The Impact group seems to have a very wide range of research topics, including Java runtime architecture and (my own favorite) native compilation from Java byte code. Given the range of topics and the size of the group it is not clear that the group can make equal progress in all areas.
Most of the compiler infrastructure packages concentrate on compiler front end and middle (AST generation) software. Some include target independent optimization. The Impact software seems to concentrate on backend and architecture dependent optimization.
Eventually the Impact group intends to make a compiler infrastructure and related documentation public. They write:
The IMPACT compiler, with its advanced features such as predicated compilation, instruction-level parallelism optimizations, compiler engineered speculation, profile-based optimizations, advanced machine description facilities, scheduling frameworks for resource sensitive code optimizations, and pointer-based dependence analysis and tracking facility, has become a premier compiler technology base for major U.S. companies as well as academic researchers. This project's software and documentation will be released in several phases over two years and the end product will be a high-quality and readily-available compiler environment that supports a wide variety of advanced instruction-level parallel processing research.
Currently this software is distributed via http://www.trimaran.org/.
Trimaran is a compiler infrastructure for supporting state of the art research in compiling for Instruction Level Parallel (ILP) architectures. The system is currently oriented towards EPIC (Explicitly Parallel Instruction Computing) architectures, and supports compiler research in what is typically considered to be "back end" techniques such as instruction scheduling, register allocation, and machine-dependent optimizations.
Trimaran is available for non-commercial applications. Presumably commercial applications are negotiated separately
Commercial Compiler Front End Products
Compiler front ends, consisting of at least a parser and in many cases symbol table and semantic analysis support are a lot of work to implement (the EDG front end, below, consists of almost 300K lines of source code and comments).
Over time there have been a number of companies that offer front end software. One of the most widely used is the Edison Design Group's C++ front end. This front end reads C++ and outputs ANSI C.
The Edison Design Group (EDG) has been in business for quite a while. However, many front end companies have come and gone. For example, Compass Design Automation sold a VHDL front end that has been used by several EDA companies. Compass was purchased by Avant!, which as far as I know has stopped selling the front end. Another VHDL front end was available from Leda (a French software company). However Leda has been purchased by Synopsys. Synopsys no longer seems to be selling the Leda VHDL front end (this is not a surprise, since Synopsys would be supplying potential competitors with an important component that could be used to build tools that compete with those sold by Synopsys. Listed below is a partial list of front end companies (other than EDG, mentioned above):
The Java Front End from Bear Products International
This front end is based on a parser generated with ANTLR. It reads Java source and class files and does full semantic analysis. The output of the front end is a decorated abstract syntax tree (AST) and a hierarchical symbol table (referenced from the AST). This is written by the author of these web pages.
C/C++ Front End and Parser from Semantic Designs Inc.
In a posting to comp.compilers, Ira Baxter describes this software as
Semantic Designs offers the DMS Reengineering Toolkit, with a robust C/C++ front end parser that handles preprocessor directives, automatically builds an AST retaining comments, literal formats, etc. can prettyprint the AST back to source form reproducing the comments/ literals, and can carry out pattern-directed transformations on the tree. (DMS also handles many other languages.)
Compiler creation software
The ANTLR parser generator has been discussed in detail elsewhere on this site. Parser generators are relatively general tools, since they can be used to build parsers for languages like HTML and scripting languages. Compiler creation software is specialized software for translating a language into native code. In addition to the parser generator, compiler creation software will usually include a standard intermediate that is shared by the language front ends, various optimization passes and code generators. At one time MetaWare sold compiler creation software. They don't seem to do this currently. Compass, a Boston area compiler and consulting company that worked on compilers for Thinking Machines and MasPar used to sell compiler creation software, but they are no longer in business. So the list included below is rather short.
Elegant compiler generation tool set from Philips Research.
To quote the Philips Research posting in comp.compiler, the Elegant compiler generator tool set is
an industrial compiler generator, used for many years within Philips. After development of Elegant was stopped, it was decided to make it available in the public domain under the GNU license. This includes all sources, so that you will be able to extend the system if you feel inspired by it.
Elegant is a compiler generator based on attribute grammars with the following features:
It accepts LL(1) and LALR(1) grammars.
It can generate a back-tracking parser for other grammars.
Any non-cyclic attribute dependency is allowed (!).
Comes with
front-end generator
scanner generator
postscript syntax diagram generator
Elegant programing language
Integrated automatic error recovery
According to Philips Research,
Compilers written in Elegant are very fast (several thousand lines of source text per second) and the attribute grammars offer a very expressive formalism on a high level of abstraction, without sacrificing performance.
The tool Front is a front-end generator that comes with Elegant. It accepts a mix of an EBNF context free grammar and a set of type definitions that describe the abstract syntax graph of the language. From this, Front generates an Elegant attribute grammar that maps an input string onto this data structure, including all scoping and symbol-table handling. Front allows the easy and fast construction of compiler front-ends, which makes it particularly suited for language design.
The Elegant programming language is smoothlessly integrated with the attribute grammar specification language (in fact, the latter is a subset of the former). The language is strongly inspired by functional programming languages, especially their type systems, yet, it is an imperative language that does not discourage side-effects. The language features:
Strong typing.
Subtyping (linear inheritance).
Polymorphic types.
Polymorphic functions.
Several different lazy types.
Overloading.
List comprehensions (they are overloaded, so they work for many other types as well, including your own).
Automatic and user definable coercions.
Pattern matching (= sub-type analysis).
Function (lambda) expressions.
Module system.
The whole system offers:
Lots of compile time and run-type checks.
Garbage collection.
Compiles onto ANSI C.
Interface to C.
Available on SunOS 5, HP_UX 11, Linux, IRIX.
Portable to anything with a decent ANSI C compiler.
Self generating, i.e. written in Elegant (of course!).
All in all a very professional and complete system, free for you to use!
Whether Elegant is really a "compiler generation tool set" depends on how one defines a compiler. There does not appear to be any optimization and code generations support (the compiler outputs C code). So Elegant is really software for the creation of compiler front ends and middles passes (where the middle pass follows the front end, but exists before control flow graph generation). On the other hand, this might be a great tool set to implement a language like Python.
Associated Compiler Engineers (ACE) CoSy Compilation System
Given the shortage of compiler creation software (compared to parser generators, for example) it would seem that there is not a huge market clamoring for these tools. This leaves one to puzzle over the deeper meaning of the quote below, from Associated Compiler Engineers:
Processors and hardware architectures are emerging more rapidly than ever before, have a shorter life-cycle, and are becoming increasingly complex. Compiler developers who still use the classical sequential operation approach are experiencing setbacks due to the complexity and in-chip parallelism of recent processors. The cost of hand-crafting optimal compilers is spiraling, and the classical approach therefore becomes less and less cost effective, and in the long run, impossible to bear.
While it is true that optimizing compilers are expensive to build, it is not clear that lots of people want to build them.
The CoSy Compilation System is ACE's compiler creation software toolset. This include front ends for a variety of languages, including Java. A full set of optimization passes are also available.