dollop, a Python Lisp interpreter

June 15, 2009 at 09:00 AM | categories: Uncategorized | View Comments

Via Hacker News, Hans Nowak's proof of concept:

The proof-of-concept implementation that uses this concept is called dollop and is available at github. (Requires Python 3.0.) The name is because it's only a "dollop of Lisp" (or rather, Scheme); it only supports a few special forms (begin, define, if, lambda), and a few functions for example programs (+, -, *, =, list). It cuts corners in other ways as well, as my goal was to get a working proof-of-concept out, not to write a complete Scheme interpreter.

Hans explains how he avoids stack overflow, due to tail recursion, by replacing the machine call stack with an explicit stack data structure in his interpreter. A tail call replaces the token at the end of this stack, instead of pushing a new one.

This is a technique you can apply in an interpreter based around an eval-apply cycle: with a full compiler, the principle is the same, but you have to detect tail calls in advance and generate the correct bytecode -- say, a .NET tail.call instruction or an x86 jmp opcode.

Read and Post Comments

Lisp compiler in F#: What's next?

June 09, 2009 at 11:12 PM | categories: Compiler | View Comments

This is part 5 of a series of posts on my Lisp compiler written in F#. Previous entries: Introduction, Parsing with fslex and fsyacc, Expression trees and .NET methods, IL generation | Browse the full source of the compiler on GitHub

This post marks the end of the first series of Lisp compiler posts, since we're at the point where the code does something useful while still being compact enough to explain in a few blog posts. In the future it should make an interesting test bed for learning about compiler techniques, which I hope to cover here.

Here's some of the ideas I'd like to try out:

  • Implement all the Lisp functionality from Jonathan Tang's Write Yourself a Scheme in 48 Hours tutorial. Jonathan explains how to implement a Scheme interpreter in Haskell, which is a similar goal to my Scheme-like compiler in F#. (It was actually his tutorial that first gave me the idea of doing it in F#.)
  • .NET integration:
    • Ability to create new .NET objects, call instance methods, and access properties and events. Currently we're restricted to static methods.
    • Ability to define your own .NET classes. One thing I'd like to be able to do is implement the NUnit tests for this project directly in Lisp, which means the compiler needs to be able to generate instance methods with custom attributes applied.
    • A full System.CodeDom.Compiler implementation: Lisp on ASP.NET anyone?
  • Optimisations:
    • Tail call optimisation: replace the call, ret sequence with tail.call, ret. This is less an optmisation and more a necessity, since recursion is currently the only way to implement looping, and we need tail calls to avoid overflowing the stack. The tail opcode prefix is recognised directly by the CLR: another approach would be for the compiler to implement a recursive function as a while loop. Tail calling on .NET is a moderately interesting topic in its own right: see the links from this Stack Overflow question to get an idea of the issues involved.

      Edit: as Paul points out, the trick is to optimise all tail calls, not just tail recursion. It's easy to come up with a pair of functions that call each other: if we just looked for recursive calls back to the same function, we'd blow up the stack in this situation. Luckily for us, the IL tail prefix is valid on any call, as long as it comes just before a ret, so we don't need to be too clever.

    • Arithmetic optimisations: something as simple as simplifying constant expressions at compile time
  • A command line compiler, with parameters similar to those of csc.exe or fsc.exe
  • Don't generate IL directly to ILGenerator; assemble F# data structures representing the IL, so that we can apply IL optimisations using standard F# constructs such as pattern matching
  • Implement flow control within the compiler on top of a flow control graph data structure, along the lines of basic blocks. This will make optimisations of program flow easier to implement.

Let me know if you've found this series useful so far, or if you have any corrections or suggestions on what I've been writing.

Read and Post Comments

Vague class names: "Manager" and "Helper"

June 06, 2009 at 06:48 PM | categories: Uncategorized | View Comments

From journal.stuffwithstuff.com:

Do not use "Manager" or "Helper" or other null words in a type name.

If you need to add "Manager" of "Helper" to a type name, the type is either poorly named or poorly designed. Likely the latter.

I always agreed with this: names like ConnectionManager and XmlHelper are too vague to be useful. But I never came up with a retort as pithy as this one of Robert Nystrom's:

Types should manage and help themselves.
Read and Post Comments

Lisp compiler in F#: IL generation

June 06, 2009 at 02:47 PM | categories: Compiler | View Comments

This is part 4 of a series of posts on my Lisp compiler written in F#. Previous entries: Introduction, Parsing with fslex and fsyacc, Expression trees and .NET methods | Browse the full source of the compiler on GitHub

What we've done up to this point is:

  • Taken a string representing a Lisp program and turned it into an F# data structure representing a list of s-expressions
  • Reformatted these s-expressions so we can take some shortcuts when we generate IL
  • Written a couple of helper functions that pick the right overload of a .NET method (which our hello world program needs in order to call Console.WriteLine)

In this post I'm going to cover the final step in the compilation process, the generation of the IL itself. By IL, I'm referring to the Common Intermediate Language, the low-level machine-independent opcodes that the .NET JIT compiler turns into native machine code at runtime. The virtual machine on which IL is based operates using a stack for operands and results, rather than the registers that are found on x86 processors. (Using a virtual stack makes IL more portable across CPUs with different register layouts; it's up to the JIT compiler to assign machine registers to stack locations as they're needed.) You can view the IL contained within any .NET assembly using the ildasm tool, which is included with Visual Studio and in the .NET SDK, or within .NET Reflector.

An IL hello world looks like this:

// Push a System.String instance onto the stack. 
// The words "Hello world" are embedded within the executable.
ldstr      "Hello, world"

// Pop one System.Object instance from the stack and call 
// System.Console.WriteLine. This method is declared void, so 
// nothing is pushed onto the stack after the method returns.
call       void [mscorlib]System.Console::WriteLine(object)

// Return to this method's caller
ret

Along with the IL, the assembly -- an EXE or a DLL -- contains various metadata, which describes the types and methods defined within the assembly (such as the Program class and its Main method that contains our code above, and the words "Hello world"), as well as the references the assembly's code makes to the outside world (such as the details of the System.Console.WriteLine method). We don't have to write any of this metadata ourselves, though: the System.Reflection.Emit namespace contains a nice set of types that allow us to construct assemblies, types and methods without worrying about putting together the bytes of the final EXE or DLL file.

We put together the IL itself using the ILGenerator class:

generator.Emit(OpCodes.Ldstr, "Hello world")
generator.Emit(OpCodes.Call, typeof<Console>.GetMethod("WriteLine", [| typeof<obj> |]))
generator.Emit(OpCodes.Ret)

For the purposes of this blog post, we're going to simplify the code generation even further, and use the DynamicMethod class, which allows us to generate a stream of IL opcodes in memory, then execute them, without having to write a full assembly file to disk. (The code on GitHub demonstrates both approaches -- see Compiler.fs for the code that writes an assembly file.)

We're going to write the compile function that I talked about in the last post:

let rec compile
    (generator : ILGenerator)
    (defineMethod : string -> Type -> Type list -> #MethodInfo * ILGenerator)
    (env : Map<string, LispVal>)
    (value : LispVal) 
    : (Map<string, LispVal>)

We'll have given to us: an ILGenerator instance; a function that creates a new method and returns an ILGenerator for that method; a map containing any variables and functions declared earlier on in this function; and a LispVal representing the line of code we're being asked to generate IL for. We'll return a map containing all of the variables and functions originally passed to us, plus any new variables or functions we might have declared in this line of code.

CodeGenerator.fs

let rec compile (generator : ILGenerator) defineMethod =

Detour: Currying and partial application

First of all, notice that the F# declaration of the compile function looks nothing like the function interface I just talked about. That's because we're going to define compile as a function that accepts two arguments and returns another function that accepts the next one, which returns yet another function that accepts the last argument and returns a value. In fact, all F# functions work this way, as functions that accept one argument and return another function that accepts the next. From the point of view of the caller, it's not possible to tell the difference between a function written like this:

<code>let f = 
    fun a ->
        fun b ->
            fun c ->
                fun d -> 
                    printfn "called function f with %d %d %d %d" a b c d
</code>

and one like this:

let f a b c d = 
    printfn "called function f with %d %d %d %d" a b c d

In the first example, the programmer defines f as a series of functions in which you pass a value to one function in order to obtain the next: in this case, the programmer is said to be currying. In the second example, the programmer states all of f's parameters on one line, yet we can still call f piece by piece ("partially apply") if we want to: the F# language performs the currying automatically.

let rec compile' env =

So far we've consumed the generator, defineMethod and env parameters. Now we're writing a compile' function, which we're going to use not only when the compile function itself is called, but also recursively from within the compile' function. We do this because we want to be able to generate IL using the same generator and defineMethod parameters originally passed to us, but with different values for env and value.

function
| ArgRef index -> 
    generator.Emit(OpCodes.Ldarg, index)
    env

ArgRef: The compile' function immediately performs pattern matching against its last argument: the F# function keyword acts like fun and match combined. We'll handle each of the LispVal cases in alphabetical order: first is ArgRef, which is a value we'll find within a function's environment. There's one of these for each of the function's parameters, and since IL refers to its arguments using numerical indices, so will we. We use the ldarg opcode to fetch the index'th argument and push it onto the VM stack. We don't change the function's environment, so we return env unchanged.

| Atom a -> 
    a |> ident env |> compile' env

Atom: An atom is a string that refers to a variable or function in the function's environment. We deal with these by calling ident to look up the string in the current environment, then recursively calling compile' to generate the IL for whatever ident finds. In practice this case is only used for variable and function argument lookups; function names are also atoms, but the only thing we do to functions is call them, and we call functions lower down when we encounter an atom within List node. (In a later version we might want to treat function names that appear outside of function calls like C# does, and turn them into delegates.)

| Bool b -> 
    let opCode = 
        if b 
        then OpCodes.Ldc_I4_1 
        else OpCodes.Ldc_I4_0
    generator.Emit opCode
    env

Bool: Bools are the first of our constants to appear. IL represents bools as integers, and it has built-in opcodes whose purpose is to load an integer between 0 and 8 onto the stack, so we generate either ldc.i4.1 for true or ldc.i4.0 for false here. (Our handling of true and false is completely wrong from a Lisp point of view, which uses nil for false and anything else for true.)

let emitIf opCode env thenValue elseValue =
    let thenLabel = generator.DefineLabel()
    let endLabel = generator.DefineLabel()
    generator.Emit(opCode, thenLabel)
    elseValue |> compile' env |> ignore
    generator.Emit(OpCodes.Br, endLabel)
    generator.MarkLabel thenLabel
    thenValue |> compile' env |> ignore
    generator.MarkLabel endLabel

| IfPrimitive (ListPrimitive (Equal, [ a; b ]), thenValue, elseValue) ->
    let env' = a |> compile' env
    let env'' = b |> compile' env'
    emitIf OpCodes.Beq env'' thenValue elseValue
    env''

| IfPrimitive (testValue, thenValue, elseValue) ->
    let env' = testValue |> compile' env
    emitIf OpCodes.Brtrue env' thenValue elseValue
    env'

IfPrimitive: The built-in (if test then else) form. There's a special case for (if (= a b) then else), because IL has its own combined branch-on-equal opcode, beq. If we defined more built-in comparison forms besides Equal, we'd have more special cases here for the other opcodes.

| LambdaDef _ -> 
    raise
        <| new NotImplementedException("didn't expect lambda outside variable")

| LambdaRef _ -> 
    raise <| Compiler("can't compile lambda - try invoking it instead")

LambdaDef, LambdaRef: Here we state two limitations on functions: the first happens if we try to use the (lambda) form outside of a variable declaration, and the second occurs if we try to use a function name outside of a function call, which I mentioned above. In a future version we might want to turn these into .NET delegates, although at the moment we have no way of specifying the delegate type (in fact, we have no way to instantiate instances of .NET objects at all).

| List (Atom a :: args) ->
    match lambdaIdent args env a with
    | LambdaRef (methodInfo, isParamArray, parameterTypes) -> 

List: Here's where we generate .NET method calls, which in Lisp appear as a list, with the function name (an atom) appearing first, followed by the function's arguments, if any. We start by looking up the function name in the environment and expecting it to resolve to a LambdaRef. I talked about the lambdaIdent function in the last post, and it is this function that picks the right method overload given a set of arguments.

        let emitBoxed (expectedType : #Type) env x =
            let env' = compile' env x
            match typeOf env x with
            | a when not expectedType.IsValueType && a.IsValueType -> 
                generator.Emit(OpCodes.Box, a)

            | _ ->
                ()

            env'

A helper function to automatically box instances of value types when needed, so that, for instance, we can pass ints to Console.WriteLine. Note that we need to annotate the F# function signature: expectedType : #Type denotes that expectedType is Type, or one of Type's subclasses. We do this in order to access its IsValueType property.

        let rec emitArgs (parameterTypes : #Type list) env args =
            match args, parameterTypes with
            | arg :: otherArgs, [ parameterType ] when isParamArray ->
                let elementType = parameterType.GetElementType()

                let rec emitArrayInit env position =
                    function
                    | value :: values ->
                        generator.Emit(OpCodes.Dup)
                        generator.Emit(OpCodes.Ldc_I4, int position)
                        let env' = emitBoxed elementType env value
                        generator.Emit(OpCodes.Stelem, elementType)
                        emitArrayInit env' (position + 1) values

                    | [ ] -> 
                        env

                generator.Emit(OpCodes.Ldc_I4, List.length args)
                generator.Emit(OpCodes.Newarr, elementType)
                emitArrayInit env 0 args

            | arg :: otherArgs, parameterType :: otherParameterTypes ->
                emitArgs otherParameterTypes (emitBoxed parameterType env arg) otherArgs

            | [ ], [ ] -> 
                env

            | _ :: _, [ ] -> 
                raise <| new InvalidOperationException(sprintf "got %d too many args" <| List.length args)

            | [ ], _ :: _ -> 
                raise <| new InvalidOperationException(sprintf "got %d too few args" <| List.length parameterTypes)

        let env' = args |> emitArgs parameterTypes env
        generator.Emit(OpCodes.Call, methodInfo)
        env'

The emitArgs helper function emits the IL to put the function's arguments on the stack. For functions that don't take a variable number of arguments we can do this by calling compile' on each node, since compile' leaves the node's value on the VM stack. Calling variable argument functions (again, such as Console.WriteLine) is slightly more intricate, since the last parameter is an array. We use the newarr opcode to instantiate the array, then stelem in a loop to insert arguments into the array one by one.

Since we only deal with static functions at the moment (whether they're .NET methods or our own lambdas) it's safe to use the call opcode in all circumstances. If we were calling instance methods we'd need to be able to call virtual methods using callvirt, and we'd need to automatically box value types when calling interface methods (such as 4.CompareTo(5)) and methods defined on System.Object (such as 4.ToString()). By limiting ourselves to static methods we've avoided these details.

    | v -> raise 
        <| new NotImplementedException(sprintf "can't invoke variable %A" v)

| List (fn :: args) -> 
    raise <| new NotImplementedException(sprintf "can't invoke value %A" fn)

| List [ ] -> 
    raise <| Compiler("can't invoke empty list")

Because we don't know about delegates, we can't call anything other than real functions.

| ListPrimitive (op, args) -> 
    match args with
    | arg :: otherArgs ->
        let opCode = 
            match op with
            | Add -> OpCodes.Add
            | Subtract -> OpCodes.Sub
            | Multiply -> OpCodes.Mul
            | Divide -> OpCodes.Div
            | Equal -> OpCodes.Ceq

        let coerceToInt env x = 
            let env' = compile' env x
            match typeOf env x with
            | t when t = typeof<obj> -> generator.Emit(OpCodes.Call, typeof<Convert>.GetMethod("ToInt32", [| typeof<obj> |]))
            | t when t = typeof<int> -> ()
            | t -> raise <| new NotImplementedException("expected int, got " + t.Name)
            env'

        let emitBinaryOp env arg =
            let env' = coerceToInt env arg
            generator.Emit opCode
            env'

        let env' = coerceToInt env arg
        otherArgs |> List.fold emitBinaryOp env'

    | l -> 
        raise <| Compiler(sprintf "cannot compile list %A" l)

ListPrimitive: The built-in forms for arithmetic and equality are defined as ListPrimitive nodes, and the IL for each of these is fairly similar. Note that our arithmetic operators can handle any number of arguments, and we handle this by pushing the first value onto the VM stack, then using F#'s fold function to apply one of IL's binary operators to each pair. We use Convert.ToInt32 to coerce any strange values to integers. (Note that List.fold is new in the F# May 2009 CTP -- previously this function was called List.fold_left, to match OCaml.)

| Number n -> 
    generator.Emit(OpCodes.Ldc_I4, n)
    env

| String s -> 
    generator.Emit(OpCodes.Ldstr, s)
    env

Number, String: Another couple of constant types: all numbers are integers, and we use the ldc.i4 opcode to push them onto the stack. We saw ldstr in the hello world example at the top of this post.

| VariableDef (name, value) ->
    match value with
    | LambdaDef (paramNames, body) ->
        let (lambdaInfo, lambdaGenerator) = 
            defineMethod
                name
                (typeOf env body)
                (List.replicate (List.length paramNames) (typeof<int>))

        let envWithLambda = 
            env 
            |> (LambdaRef (lambdaInfo, false, (List.map (fun _ -> typeof<int>) paramNames)) 
                |> Map.add name)

        let (envWithLambdaArgs, _) = 
            paramNames 
            |> ((envWithLambda, 0) 
                |> List.fold (fun (env, index) name -> (Map.add name (ArgRef index) env, index + 1)))

        body |> compile lambdaGenerator defineMethod envWithLambdaArgs |> ignore
        lambdaGenerator.Emit(OpCodes.Ret)
        envWithLambda

    | _ ->
        let local = generator.DeclareLocal(typeOf env value)
        let envWithVariable = Map.add name (VariableRef local) env
        compile' envWithVariable value |> ignore
        generator.Emit(OpCodes.Stloc, local)
        envWithVariable

VariableDef: We use VariableDef nodes to declare both functions and variables. There are in fact three possibilities here:

  1. (define func (arg1 arg2) body)
  2. (define func (lambda (arg1 arg2) body))
  3. (define variable value)

The first two cases are treated the same (in fact, the first case is transformed to the second by insertPrimitives, which we saw in the last post), and both of them define a new named function:

  1. We call defineMethod to obtain a MethodInfo and an ILGenerator for a new method
  2. So that the function can call itself recursively, we insert the function into the environment before we start generating the function's IL. This changed environment is the one that feeds into the next line of code in the function where the declaration appears.
  3. We set up a new environment based on the declaring function. This new environment gains a set of ArgRef values that represent the function's arguments.
  4. We call compile recursively, with the new ILGenerator that defineMethod gave us

Detour: Why do we keep a list of parameters if we've got a MethodInfo?

I had originally wanted to represent both our own functions and .NET methods with a .NET MethodInfo object: we need to know the types of the function's parameters in order to call it, and I assumed we could call MethodInfo.GetParameters to obtain them. Unfortunately we're only allowed to call GetParameters on real methods, not DynamicMethod or MethodBuilder, so I had to track the parameter types explicitly.

To declare a variable, we first use ILGenerator.DeclareLocal to allocate a new IL local variable, then use the stloc opcode to assign a value to it. As with function declarations, we change the function's environment, and return the changed environment so that it can be fed into the next line of code.

| VariableRef local -> 
    generator.Emit(OpCodes.Ldloc, local)
    env

VariableRef: Last in our alphabetical list of expression tree nodes is VariableRef, which is the node we inserted into the environment just a moment ago to represent local variables. IL gives us the ldloc opcode, which fetches the value from a local variable and places it into the VM stack.

compile'

In the last line of code, we're outside of the compile' definition and back in compile. Recall that we declared compile as a function that accepts two arguments and returns a function that accepts two more: here we're returning that function.

To finish up, let's compile the Lisp code I wrote in the last post, which generates most of the features in our code generator:

(define (fact n) (if (= n 0) 1 (* n (fact (- n 1)))))
(Console.WriteLine "6! = {0}" (fact 6))
(Console.WriteLine "What is your name?")
(Console.WriteLine "Hello, {0}" (Console.ReadLine))

IL disassembly

.class public auto ansi sealed Program
       extends [mscorlib]System.Object
{
  .method public static void  Main() cil managed
  {
    // (Console.WriteLine "6! = {0}" (fact 6))
    IL_0000:  ldstr      "6! = {0}"
    IL_0005:  ldc.i4     0x6
    IL_000a:  call       int32 Program::fact(int32)
    IL_000f:  box        [mscorlib]System.Int32
    IL_0014:  call       void [mscorlib]System.Console::WriteLine(string, object)

    // (Console.WriteLine "What is your name?")
    IL_0019:  ldstr      "What is your name\?"
    IL_001e:  call       void [mscorlib]System.Console::WriteLine(object)

    // (Console.WriteLine "Hello, {0}" (Console.ReadLine))
    IL_0023:  ldstr      "Hello, {0}"
    IL_0028:  call       string [mscorlib]System.Console::ReadLine()
    IL_002d:  call       void [mscorlib]System.Console::WriteLine(string, object)
    IL_0032:  ret
  }

  .method private static int32  fact(int32 A_0) cil managed
  {
    // (if (= n 0) ...
    IL_0000:  ldarg      A_0
    IL_0004:  nop
    IL_0005:  nop
    IL_0006:  ldc.i4     0x0
    IL_000b:  beq        IL_002d

    // else: (* n (fact (- n 1)))
    IL_0010:  ldarg      A_0
    IL_0014:  nop
    IL_0015:  nop
    IL_0016:  ldarg      A_0
    IL_001a:  nop
    IL_001b:  nop
    IL_001c:  ldc.i4     0x1
    IL_0021:  sub
    IL_0022:  call       int32 Program::fact(int32)
    IL_0027:  mul
    IL_0028:  br         IL_0032

    // then: 1
    IL_002d:  ldc.i4     0x1
    IL_0032:  ret
  }
}

And just for fun, using Reflector to disassemble into C#:

C# decompilation

public sealed class Program
{
    private static int fact(int num1)
    {
        return ((num1 == 0) ? 1 : (num1 * fact(num1 - 1)));
    }

    public static void Main()
    {
        Console.WriteLine("6! = {0}", fact(6));
        Console.WriteLine("What is your name?");
        Console.WriteLine("Hello, {0}", Console.ReadLine());
    }
}
Read and Post Comments

Moving a Perforce project to GitHub

June 02, 2009 at 08:41 PM | categories: Uncategorized | View Comments

I keep all of my home projects in a local Perforce instance. I wanted to put my Lisp compiler code onto GitHub; since I'm not paying for a Perforce licence, I'm limited to a single user.

The GitHub user account and repository setup process was pretty straightforward, until it came to adding files. I haven't been able to find a single set of instructions for this, but I pieced it together thanks to Stack Overflow:

  1. You will need the git-p4 Python script from here: http://repo.or.cz/w/git.git?a=tree;f=contrib/fast-import;hb=HEAD
  2. From the some empty directory -- not the one that contains the Perforce project -- run the following command to import the entire Perforce version history:
    python git-p4 clone --destination=. //depot/path/to/project/...@all
  3. You now have a brand new local git repository that contains a clone of the project from Perforce. You need to push it to GitHub:
    git remote add origin git@github.com:username/repository.git
    git push origin master
  4. To push subsequent Perforce changes to GitHub:
    git-p4 sync
    git push origin master

Edit: I almost forgot to include the URL of the compiler source

Edit 2: Turns out you mustn't run git commands from a Perforce local directory: as always, Perforce doesn't like it if something else (in this case git) interferes with its local files.

Read and Post Comments

« Previous Page -- Next Page »