Varfar: Which MARS?
- Performance hints
- Bad Benchmarks
<to Varfar's RedCoder
Before I started work on RedCoder I did a good search and inspect of several of the MARSs whose source is freely available. If you are a programmer, interested in corewars and interested in the code that runs corewars, then I hope that sharing my thoughts on each will save you time and augment your research.
In my quick reviews, I am not biased towards dissing them, despite seeming to list only bad things. I try to rectify the balance in the recommendations section.
Info: written in c, very small
Good: the (draft) spec contains the sourcecode for an example MARS. Small and clean. ICWS'94 compliant! (hopefully)
Bad: The MARS instruction set has be extended since the draft; therefore this example MARS isn't going to support a lot of the very latest and greatest things. The MARS is incomplete, and does not contain a loader.
Info: written in c, small
Good: pMARS is the original implementation of a MARS. It supports all the current functionality, it runs the tournaments and king-of-the-hill, and it runs on lots of platforms. If you use this as your engine, you'll know that your results are accurate! It supports several different client frontends (Xwindows, DOS windows, command-line, via email etc)
Bad: complicated, obfusicated. Difficult to untangle. It supports all these clients by having lots and lots of #ifdefs all over the place. If you are examining the sourcecode, then it really helps to preprocess that sourcecode to build a sensible view that suits you, and then examine it.
Info: written in c, small
Good: exhaust is a small MARS that is designed to be easily embedded in your applications. The sourcecode is well formatted and well thought out. It supports the same language features (pspace etc) as pMARS. It is continuously validated against pMARS, giving you good trust in it's accuracy.
Bad: it does not load proper redcode assembly files, only preprocessed ones. It does not provide hooks for graphical interfaces to examine games that are in-progress (easy to add, though).
Info: written in c++, provides a template class for you to create your own MARS; big executable, but very fast emulation
Good: nice and object-oriented, the biggest advantage of "quicker MARS" is it's speed. The 'behaviours' template makes it very straightforward to add/ammend the way that your code interacts with the MARS.
Bad: it uses template meta-programming, which is neat yet inflexible. It means that you compile-in your MARS parameters (size of core, number of warriors etc). (I am quite sure that is what the author intended, and for good reasons. See the recommendations where I describe how and when this is actually a good point.) The engine gets it's speed because it has unrolled the entire MARS into one massive switch-statement; the sourcecode is something like 160k lines, and this sourcecode is generated by another program! It takes an age to compile, depending on the way that your compiler works. Gcc 3.2 seems very good and fast at compiling it, but compilers that are good in other areas (e.g. Metrowerks CodeWarrior) might take an age. The engine also doesn't support some of the latest language features, such as pspace. It isn't as well validated as exhaust. It cannot load true redcode assembly programs, only preprocessed ones
Info: written in c, small
Good: the corewars program uses Gtk+ to run on Linux (Gtk+ is ported to win32 IIRC, so maybe a windows version is just a compile away?). The main switch statement is one of the cleanest and easiest to understand. It can parse redcode programs
Bad: Confusingly, it also supports a "corewars" language which isn't the same thing as "redcode". I am unsure if it is very validated, and how complete it's implementation of the latest extensions to the redcode language are. The parser might be slightly different in subtle ways to the pMARS parser.
RedCoder is hosted on this site
Info: written in Object Pascal, bit long
Good: very 'hookable', generates lots of callbacks as it executes. Can be 'debugged' by setting breakpoints and stepping forward one instruction at a time.
Bad: Does not support pspace, is not tested much, is not fast.
You will have different needs from others. Different engines are suited to different tasks, and your choice depends upon your needs. I'll break them down into two categories:
For building a nice GUI..
If you are building a nice GUI, you'll want easy to understand sourcecode from which to hook your GUI. Speed of the underlying engine isn't critical - after all, the responsiveness of the GUI and it's debugging features are critical and the bottleneck - drawing a small rectangle on screen is going to take far longer than the actual emulation of the instruction that small rectangle represents, so your first focus for optimisation should be your graphics routines. The most straightforward to integrate is probably exhaust; however, you really also need a proper redcode parser.
For evolving Warriors..
If you are evolving warriors (using the conventional gene=instruction level) then chances are you aren't interested in the mechanics of a fight, only the outcome. And you will have massive populations (1000s?) per generation, which is a lot of matches, so speed of your MARS is essential. For this, qMARS is ideal.
However, the speed of evolution isn't only limited by the speed of your MARS. Most evolver scripts are heavily file-based, with each warrior being generated into a file. This means that for each battle, two warriors have to be fetched from disk and parsed. In todays desktop computer, holding a few thousand warriors in RAM is trivial, and there is no need to have this binding to the filesystem I/O performance. My approach would be to have your whole generation ready-assembled (in the internal instruction representation of your chosen MARS) as well as a score table. Then, after each generation, you can serialise your generation to disk (in a binary format, probably still using the internal instruction representation of your chosen MARS) along with the scores. I'd make the evolver a standardalone program that was in your computer startup scripts so that it is continuously running in the background. As you shutdown, it can abandon any half-done generation, resuming from the last serialised generation generated. Just leave it a few days, and you'll be more productive than, say, seti! Look at the core war tribes project; pity it doesn't have public source (that I can find).
When evolving warriors, emphasis is upon speed. My recommendations for cutting the process to the bone to get maximum performance are:
- Avoid generating warriors in assembler, generate them in the internal instruction representation of your MARS instead; this avoids the costly parsing
- Keep all your warriors in memory whereever possible, e.g. the whole generation and any control warriors
- Cull still-born warriors by running them alone in the MARS for a set number of cycles (500 or 1000?) and seeing if they die
- If you have a rule such as 'only the n fittest warriors can breed each generation', have a sorted list of this top number. After the first n warriors
have been evaluated, you can start to drop unfit warriors as soon as they have crossed the threshold where even if they win every other match, they will not better
the nth warrior
Staring at the sourcecode of so many MARSs has made me aware of the basic formulae they imploy. They are all basically similiar, if not pretty much the same. I'll first describe how they work, and then the various alternatives that might be explored (but I can make no claim to having examined their merits).
The internal instruction representation
MARS generally encode the op-code, op-modifier and both operand addressing modes into a single word (32-bits is the word-size of the x386 target). The operands are then stored in two dedicated words.
The Big Switch
The Big Switch (or case-statement if you prefer) is the central loop of almost all MARSs. It consists of the following steps:
- load the instruction into a copy of the instruction
- for first the A-field, and then the B-field, evaluate them; this involves determining the line that the operand addresses, and the value found there (stored in a copy, since some addressing modes can modify the real underneath operand)
- use a switch statement (a big if-the-else-if device) based upon the instruction
- for that instruction, use a nested switch-statement to do the appropriate action based upon the op-modifier.
indeed, the sources of most of these switches is so similiar, it really had to come from the same source originally! QMARS is special in that it's speed is due to the fact that it unrolls the nesting switch statements and does one Absolutely-Whooping-Switch. Also, it is assumed that modern compilers (e.g. gcc 3.2) can actually turn a very large switch statement into an array of jump addresses, and therefore not actually do an if-then-else-if bit comparing the instruction to each possible instruction sequentially until a match is found. But this is hidden from the source view, fortunately.
Unrolling the Switch Explicitly In Code
Idea: Imagine the encoded instruction word being used as an offset into an array of function callbacks..; you can have null callbacks to fill the gaps that are illegal instruction combinations
Predetermining the Instruction Callback
Idea: For each instruction, as you assemble it, determine the callback that can execute it. Add the address of this callback to your internal binary instruction representation. When you use MOV.I, remember to copy this callback address too!
Using Proper Object Orientation
Idea: CInstruction is an abstract base class, CMOV and CDAT etc are concrete implementations, COperand is an abstract base class that is implemented by CDirect and CImmediate etc. Then the execution is just a case of calling virtual methods. This might not be efficient (although modern compilers like gcc 3.2 really try to minimise 'abstraction performance penalty') but it sure would be nice and readable!
Idea: For the c++ programmers, you might consider making the substeps of the big switch statement into templates, and then combining them into classes of each legal instruction. It's just an idea, ok! ;-)
Idea: Studying the big switch, you'll see the same things being done again and again in very subtle combinations; there is a lot of duplication. Instead of having the conventional one assembly statement = one internal instruction representation, have a simplier instruction set internally where one redcode assembly statement is actually represented by several internal instructions. I'll say no more, but this idea has merits that aren't immediately obvious!
Thanks to Martin Ankerl, author of QMars and contributor of optimisations to Exhaust for his valuable input.
Some instructions are more common than others. Extensive data has been gathered from warriors, showing that some instructions are far more common than all other instructions put together! The most common instructions, by percent, are:
Martin reports a 12% speed-up when treating
mov.i as a special case in the loop in Exhaust! Clearly a very real optimisation. (Some compilers may reorder switch statements or unroll them.)
Modern memory is relatively slow. Or rather, fast memory is relatively expensive. A compromise used by most desktop processors is caching. A small amount of fast expensive memory is near the processor and a large amount of slow cheap memory is further away. Inbetween there may be several graduations of cheapness/slowness. If you can manage to fit all your code and the core inside the processor's fastest memory, it will run that much faster. That or get an expensive processor with lots of fast, expensive close memory..
Someone someday might work out how to get a graphics card with it's relatively large amount of expensive, fast memory to run a MARS. The world of 'custom shaders' beckons closer examination ;-)
Be wary of benchmarks - that is a good maxim generally. Be extra careful of MARS benchmarks. Most new MARS are benchmarked against pMARS for a set series of warriors. However, most new MARS don't load proper redcode, but require that the assembly file be preprocessed (ironically by pMARS). This means that their parsing stage is much simpler and, hopefully, much quicker. Now there are circumstances (e.g. evolving warriors) where you may know you do not need to preprocess your input files, but otherwise beware of the difference in loading that pMARS and the new MARS undertake, and that benchmarks generally put the two together..
Great! If you want to chat about the internals of MARSs, I might not be very experienced (yet), but I sure am enthusiastic! Or if you have corrections for me, I apologise and please send them to me so I can ammend this and my other pages! My email address is howard_harry at hotmail dot youknow. If you don't get a response, it is because I've moved on because of spam. But I do monitor the corewars usenet, so post there! And thanks for reading right to the bottom ;-)