This week I proposed killing gcjx and replacing it with the Eclipse compiler. Per had looked into this before, but this proposal was triggered by a comment that Andrew made on irc that same morning. I surprised myself by taking to it with enthusiasm.
Since then I've done some more investigation. This project seems very practical, and I think will let us have a 1.5 gcj much quicker. There are a couple potential optimization regressions by going this route, but these are fixable in the compiler without too much work.
I also spent a little time hacking on the eclipse compiler's driver, trying to get it in shape for an experiment to test this plan. That turned out to be easy.
While doing this though I finally felt the sadness I knew would eventually arrive. The trigger was something very minor -- I was looking at the eclipse compiler driver, and realized that on a lexical level it is pretty ugly code. There aren't many comments, and the ones that are there aren't very good; the class I was hacking on didn't have a very layout or even consistent indentation style. And so I took a quick look at the corresponding code in gcjx... we're definitely losing something in this exchange. (But to be fair, the driver is not exactly a core part of the compiler. I doubt it gets much love.)
We're not losing much though, and I still think this is the best way forward. Plus, and this also surprised me, I seem to have gotten whatever emotional fix I was looking for from writing gcjx. I started it at a kind of professional local minima, and writing it helped remind me that I'm reasonably competent at this programming thing. Now I'm on to feeling inadequate at a higher level.
Future compilersI think some aspects of gcjx should be emulated in all future GCC front ends. For one thing, front ends should have their own representation, derived from the language being compiled -- they should treat GCC trees as a target format, not a high-level representation. Trees aren't statically typed, and they carry too much other baggage as well.
Second, front ends ought to be written as libraries. These days it isn't enough to write a traditional batch compiler -- you really want to look ahead a bit and consider IDE indexing, incremental compilation, and other uses of the parser and semantic analyzer.
More recently I've been interested in applying this treatment to the C++ compiler. Recently I've been surprising myself quite a bit; I was never interested in C++ compilation at all until the last few months.
Last week I got gcjx to successfully parse and analyze the Classpath generics branch. I only needed a couple of hacks to get there :-). More recently I fixed some of the remaining problems with 1.5 code generation -- I added code to generate bridge methods and handle enum-typed switches. Now I'm going to switch back to the tree back end, with an eye toward merging gcjx to the trunk.
gcjx can now build all of libgcj, at least if you provide it with the correct flags. I've been finding compiler bugs by running a small gcjx-compiled program and looking into the crashes.
GCC as library?Ranjit suggests that GCC might profitably be split into parts with well-defined APIs separating them. I think there's little disagreement on that point -- GCC has been moving in that direction. However, GCC's internals aren't really well-suited to this kind of thing. Still, hopefully someday GCC will end up there. It won't be soon enough to make libjava builds bearable, though; we must find some other solution to that problem. The plan on everybody's lips is splitting the library into multiple pieces; I'll probably look into it more seriously soon.
LLVM and JavaThere's been some work on a JVM based on LLVM. Diversity in the VM space is nice, but at the moment we have too many VMs trying to inhabit the same niche. This makes no sense. Instead, we should be looking at sharing more code, just as we already share the class library, the test suite, and some random other bits. There is no reason we couldn't have a somewhat configurable core VM, implementing things like class layout and runtime linking, that would be shared among all VMs.
Bigger dreams aside, LLVM would have done better to simply pick an existing VM, say kaffe, and target it. That would be simple, even.
As Mark pointed out, last week I sent out a gcjx status note. I've done a lot of gcjx hacking recently, though, and some of this is now out of date.
In particular, I wrote most of the code for the binary
compatibility ABI, and I made the source-to-tree path robust enough to
compile all of libjava. I also wrote most of the tree-lowering
support for the new 1.5 language features; the only remaining things
are the new metadata (which requires work on libgcj as well), and
switch statements of enum type.
As a simple test, I got this program working when compiled to native (linked against a prebuilt libgcj, not one made with gcjx yet):
public class q
{
public static void main(String[] args)
{
for (String x : args)
System.out.println(x);
}
}
Also, I have patches to get a good part of the build working --
you can build the gcj driver now and get at least
partway into the libjava build.
So, expect another gcjx patch flood soon...
Last night, just in time for FOSDEM, I got working assembly code
out of gcjx for the first time. It was a do-nothing program, of
course, but nevertheless this is a big milestone. In particular this
means that a fair amount of tree lowering works; the driver works;
various lang hooks and interconnects with GCC work; and gcjx can write
out Class objects, vtables, and other forms of metadata.
So, what remains on this front is a long debugging war. Along the way I'll need to fix up some details; e.g. the current class format needs an upgrade to understand the new forms of metadata.
glibc wishIn Java it is possible to use class loaders to define multiple classes from a given representation of a class -- you can just pass the same bytes around; each class loader essentially has its own universe of types.
This doesn't translate too well to gcj at the moment, since
dlopen() doesn't do what we want when you try to open a
library more than once.
What would be cool (and I've heard that Solaris has this) is to be able to create new "dlopen contexts" that would allow us to load a given library once per context. Then in libgcj we could simply associate a context with each class loader, and avoid the nasty hack we have to do right now.
A few days ago I finally moved gcjx development from sourceforge
to gcc.gnu.org. The branch is named gcjx-branch. It
isn't fully hooked up to the build system yet, but you can build the
gcjx directory standalone and have a bytecode compiler.
I also recently ran jacks tests of both gcj and gcjx. The results are overwhelmingly in gcjx's favor:
gcjx: Total 4928 Passed 4711 Skipped 45 Failed 172 gcj: Total 4928 Passed 4166 Skipped 44 Failed 718What's funny is that their failures don't overlap very much, and yet they both manage to compile all of Classpath. Partly this can be explained by the fact that compilers tend to do better on correct code than incorrect code, but partly I just observe that even a fairly buggy java compiler is still useful.
Andrew points out that, of course, gcjx will come with its own new undiscovered bugs as well -- and he said that without even looking at the incomplete tree-generating back end. Still, at this point we seem to have a lot of interesting code out there to use as test cases; I'm sure at merge time (I think optimistically it will be sometime this year) we'll have confidence in the result.
gcjx uses a simple version of the Visitor pattern for code generation. I've been thinking about this a bit lately, as experience with gcjx and random discussions with Graydon have been tweaking my interest in language design.
For those who don't know, visitors are basically a way to achieve dispatch on the dynamic type of an argument to a method. This is very handy for doing things like walking the model of a program that is built up inside a compiler.
In gcjx this takes a very simple form. There is an abstract visitor base class which has one abstract method for each object in the model, like:
class visitor {
virtual void visit_block (model_block *,
const std::list<ref_stmt> &) = 0;
...
};
The arguments here are ad hoc, according to the particular object
being visited (it need not be done this way, but it was convenient
for gcjx).
Then each class in the model has its own visit method:
class model_block {
void visit (visitor *v) {
v->visit_block (this, statements);
}
};
As you can see this results in a straightforward way to achieve
multiple dispatch. You simply call the visit method on
any element of the model, and the appropriate method in your visitor
will be called.
One nice thing about this approach is that the compiler will tell you if your visitor is incomplete, since that can only happen if you didn't implement some abstract method. This also means it is easy to add a new class to the model -- all existing visitors will break, making it simple to figure out where to add new methods.
The downside of this approach is that it is inflexible in a few
ways. For instance, consider the tree-generating back end in gcjx.
When compiling to trees, we want to build a new GCC tree object
representing each object in model of the program. So, the obvious
way to do that would be to have the visit method
return a tree.
This is unsatisfactory, though, because it means you have to
modify every class in the model to allow this. This in turn means
that the declaration of tree must be visible globally --
it can no longer be segregated to a single back end. Of course this
could be worked around; e.g., visit could return
void*... but then you lose type safety and have to add
casts all over.
Another approach to this problem is multi-methods, which means doing dispatch on the runtime type of the arguments. This way you can use generic functions instead of visitors, and then easily add new kinds of visitors without modifying the classes in the model.
C++ doesn't directly support this, though apparently it can be done. One drawback I do see here is that it doesn't seem possible to determine when you haven't written a method. The compiler, seemingly, can't tell you... a classic sort of static/dynamic tradeoff. I'm not really all that familiar with existing multimethod implementations, maybe there is some nice way to inform compilers of one's intent here.
A third approach, taken in GCC, is to simply switch
on the type of the object. One advantage of this approach is that it
is often simpler to keep track of local state -- you can write
iterative code instead of recursive code in some places, you don't
have to invert a lot of logic to put things in separate functions,
etc. This also suffers from the problems that arise if you add a new
class.
Coding styles that substitute programmer discipline for compiler errors don't seem to work that well for me. The ideal approach would look somewhat like multimethods, but would let me have the compiler check self-imposed constraints about which methods must exist.