Friday, February 19, 2010

Concurrency in maven ?

Within the maven community, there has been a push towards parallelizing maven itself
to achieve better build performance within multi-module reactor builds. A number of strategies have been tried, and the main two strategies are: Parallel reactor mode (uses module dependency graph to schedule builds that can be built in concurrently side-by side). The other strategy is known as "weave" mode, and it traverses the modules phase-by-phase instead of module-by module (you can read about it here)

Both have "fully" functional implementations available, and weave mode is quite a lot faster than parallel mode. The code is available at http://github.com/krosenvold/maven3. Just build and run with -Dmaven.threads.experimental=4

So what is this post about ? I am the primary author of "weave" mode, and for the last weeks I've been searching for an elusive goal: 1000 consecutive green builds of 1 single project on my CI environment.

Initially I was quite afraid of the thread safety issues withing maven; after all retromounting concurrency to any non-concurrent code can be a daunting task. Fortunately there is a lot of state that is /copied/ in maven reactor mode. From a concurrency perspective, this saves the day.

So why am I not getting my 1000 greens ? Every 3-400 builds it would fail, with strange errors. I asked a few questions (and this one) on stackoverflow.com. It's the file system. The java file system
has no guarantees of /anything/ when it comes to concurrency. The only thing you can be sure about is that the single thread that wrote the file can also read it afterwards.

javac uses the file system. And I was quite baffled by this; in weave mode the javacs are invoked on a pretty tight schedule; they typically come within just a few ms of each other. Every now and then the downstream javac would complain about "bad class files" from the upstream javac. But the scheduling is done properly, and the first javac /was/ done. How to solve it ? Turn on "forkMode" in maven for javac.

I had a chat with the nice folks at #kernel and they told me that all contents of a file should be concurrently visible to /everyone/ upon close() in a modern linux kernel. When I turned on forkMode in javac the problem went away. Because forkMode=true basically delegates the visibility issues to the os.

You /can/ try this yourself if you check out from github and try to build a project. It works best if you do a "mvn -Dmaven.threads.experimental=4 clean install", since that'll write a lot of files.

I'm still scratching my head about what to do with this; given that forking delegates visibility to the underlying os one could just fork everything all the time. Or find some other option.. Suggestions ?

3 comments:

  1. If you are running javac for all projects/modules anyways, why don't you just aggregate their source-paths and maybe even classpaths, so you'll save a good chunk on loading of classes and building class hierarchies between modules...

    ReplyDelete
  2. That's a cool rewrite of the compiler plugin, actually fits quite well with weave mode too, since all projects are in "compile" at the same time.

    ReplyDelete
  3. Wouldn't memory-mapped files, in Java represented by a MappedByteBuffer, provide stronger synchronization guarantees?

    Recent versions of the Eclipse java compiler ECJ also has introduced parallel compilation and might be worth a study.

    ReplyDelete