Lisp compiler in F#: Parsing with fslex and fsyacc

May 31, 2009 at 05:12 PM | categories: Compiler | View Comments

This is part 2 of a series of posts on my Lisp compiler written in F#. Previous entry: Introduction | Browse the full source of the compiler on GitHub

In this post I'll talk about the fslex and fsyacc tools, which, as their names suggest, are similar to the widely used lex and yacc tools, which generate lexical scanners and parsers respectively.

These tools generate two F# modules that work together:

  • fslex generates code that splits a string into tokens, according to a set of rules similar to regular expressions
  • fsyacc generates code that recognises sequences of tokens, and does something with them, such as building an expression tree

fslex and fsyacc are both standalone executables that are installed in the bin subdirectory of an F# distribution. Although can invoke them by hand, their command line parameters and error handling are fairly basic. Luckily F# also ships with a pair of MSBuild tasks that you can put in your .fsproj file.

  1. Open your .fsproj file in a text editor (or, in Visual Studio, unload the project, then click 'Edit ProjectName.fsproj' on the unloaded project's right click menu)
  2. Towards the bottom you should see a line as follows:

    <Import Project="$(MSBuildExtensionsPath)\FSharp\1.0\Microsoft.FSharp.Targets" />

    Under this line, paste the following:

    <Import Project="$(MSBuildExtensionsPath)\FSharp\1.0\FSharp.PowerPack.Targets" />

  3. If you have the F# May 2009 CTP (version or newer, remove the Microsoft.FSharp.Targets line. If you have the September 2008 CTP (version or older, leave the Microsoft.FSharp.Targets line alone. (The May version of FSharp.PowerPack.Targets has its own import of Microsoft.FSharp.Targets, so leaving the import in the .fsproj file causes an MSBuild error.)
  4. Slightly higher up in the .fsproj file, you'll see an <ItemGroup> element that contains your source files. Add the following lines; put them at the top of the list so that fslex and fsyacc get run before any of the compilation steps:

    <FsYacc Include="YourParser.fsy" />
    <FsLex Include="YourScanner.fsl" />

  5. Save and close the .fsproj file in the text editor, then reload it in Visual Studio to ensure it still loads OK. (You did keep the original version in source control, right?) The .fsy and .fsl files will be missing at this point, but I'm going to discuss them in a moment.
  6. Extra step for the May CTP: The DLL that implements the <FsLex> and <FsYacc> tasks is installed to the wrong directory. You'll need to follow this set of instructions and copy the FSharp.PowerPack.Build.Tasks.dll file to the right place.

Once you've successfully run fslex and fsyacc and generated F# source for the first time, you'll need to include these F# source files in your project. Because the file generated by fslex relies on fsyacc's token definitions, you'll need to include the generated fsyacc file first -- remember that build order is significant in F#, unlike in C#. Feel free to check this generated source into source control, but remember to check it out before making changes to the .fsl and .fsy sources; these tools thrown an access denied exception if the generated source file is read only.

The .fsy and .fsl files should look familiar if you've used lex and yacc before. Here are the rules I'm using to parse Lisp s-expressions:

FSYacc.fsy -- generates FSYacc.fsi and FSYacc.fs

open Tim.Lisp.Core

%start parse
%token <string> Identifier
%token <string> Text
%token <int> Digits
%token Apostrophe LeftParen RightParen Eof
%type <Tim.Lisp.Core.LispVal list> parse


Expr: Identifier { Atom $1 }
    | Text { String $1 }
    | Digits { Number $1 }
    | LeftParen ExprList RightParen { $2 |> List.rev |> List }
    | Apostrophe Expr { List [ Atom "quote"; $2 ] }

ExprList: Expr { [ $1 ] }
        | ExprList Expr { $2 :: $1  }

parse: ExprList Eof { List.rev $1 }

Note that the fsyacc source defines the list of tokens that it expects the fslex source to emit. Because this is F#, the list of %code statements above results in the generation of one of our old friends, the discriminated union:

FSYacc.fsi -- generated from FSYacc.fsy

// Signature file for parser generated by fsyacc
type token = 
  | Apostrophe
  | LeftParen
  | RightParen
  | Eof
  | Digits of (int)
  | Text of (string)
  | Identifier of (string)
// snip rest of generated code

The rest of the file consists of BNF-like rules; here I'm recognising the key parts of s-expression syntax and emitting an expression tree in the form of my own LispVal data structure, which is another discriminated union.

The .fsl file takes text, applies regular expressions to it, and returns a stream of tokens:

FSLex.fsl -- generates FSLex.fs

open System
open System.Text
open Microsoft.FSharp.Text.Lexing
open FSYacc

let digit = ['0'-'9']
let alpha = ['a'-'z' 'A'-'Z']
let whitespace = [' ' '\t']
let newline = ('\n' | '\r' '\n')
let identifier = [^'"' '0'-'9' '(' ')' ' ' '\t' '\n' '\r']
rule tokenize = parse
| whitespace        { tokenize lexbuf }
| newline           { lexbuf.EndPos <- lexbuf.EndPos.NextLine; tokenize lexbuf }
| ['-']?digit+      { Digits <| Int32.Parse(Encoding.UTF8.GetString(lexbuf.Lexeme)) }
| '('               { LeftParen }
| ')'               { RightParen }
| '\''              { Apostrophe }
| '"' [^'"']* '"'   { let s = Encoding.UTF8.GetString(lexbuf.Lexeme) in Text <| s.Substring(1, s.Length - 2) }
| eof               { Eof }
| identifier+       { Identifier <| Encoding.UTF8.GetString(lexbuf.Lexeme) }

fslex regular expressions aren't entirely standard:

  • White space is ignored
  • Literal characters are enclosed in single quotes
  • You're allowed to use let to define aliases (see digit, alpha, whitespace, newline and identifier above)

Each rule block defines a function in the generated F# source that accepts an Microsoft.FSharp.Text.Lexing.LexBuffer<byte> and returns the token type defined by the generated fsyacc source. Although LexBuffer is a generic type, fslex prior to May only generates ASCII parsers, hence the calls to Encoding.UTF8.GetString above. The May CTP adds a --unicode command line parameter, which appears to enable the use of LexBuffer<char>, but I haven't found how to enable this option via the MSBuild task.

Given the generated source code, it's relatively straightforward to write a facade around it. Note that you'll need to add a reference to Microsoft.FSharp.PowerPack.dll in order to use the types in the Microsoft.FSharp.Text.Lexing namespace.


open System.Text
open Microsoft.FSharp.Text.Lexing

module Parser =
    // Accepts a string and either returns a list of LispVal objects or throws an exception
    let parseString (s : string) = FSYacc.parse FSLex.tokenize <| LexBuffer<_>.FromBytes(Encoding.UTF8.GetBytes(s))

In the next post, I'll take the LispVal objects returned by the parseString function and use them to generate actual IL.

Read and Post Comments

Lisp compiler in F#: Introduction

May 30, 2009 at 10:34 PM | categories: Compiler | View Comments

Browse the full source of the compiler on GitHub

I started learning functional programming in F# and Haskell around 6 months ago, and one of the programs I've been writing in order to learn F# is a small Lisp compiler, with syntax similar to Scheme:

(define (fact n) (if (= n 0) 1 (* n (fact (- n 1)))))
(Console.WriteLine "6! = {0}" (fact 6))
(Console.WriteLine "What is your name?")
(Console.WriteLine "Hello, {0}" (Console.ReadLine))

I'm planning on posting a series of articles explaining how the compiler demonstrates some useful F# features. By way of an introduction, I'd like to talk about two of my favourite features from functional languages: discriminated unions and pattern matching.

At their simplest, discriminated unions are equivalent to .NET enums, since they allow a strongly-types variable to hold one of a fixed set of values:

 * Rough C# equivalent:
 * public enum ListOp { Add, Subtract, Multiply, Divide, Equal }

type ListOp = Add
            | Subtract 
            | Multiply 
            | Divide 
            | Equal

But each of these possible values can be tagged with a data structure, which means that discriminated unions are useful where an object-orientated language would need a hierarchy of classes:

 * Rough C# equivalent:
 * public abstract class LispVal { }
 * public sealed class ArgRef : LispVal
 * {
 *     public ArgRef(int index)
 *     {
 *         Index = index;
 *     }
 *     public int Index { get; private set; }
 * }
 * public sealed class Atom : LispVal
 * {
 *     public Atom(string text)
 *     {
 *         Text = text;
 *     }
 *     public string Text { get; private set; }
 * }
 * etc., for the other cases
type LispVal = 
             | ArgRef of int                                
             | Atom of string
             | Bool of bool
             | IfPrimitive of LispVal * LispVal * LispVal
             | LambdaDef of string list * LispVal
             | LambdaRef of MethodInfo * bool * Type list
             | List of LispVal list
             | ListPrimitive of ListOp * LispVal list
             | Number of int
             | QuotePrimitive of LispVal
             | String of string
             | VariableDef of string * LispVal
             | VariableRef of LocalBuilder

When faced with the task of writing code that understands the C# class hierarchy above, I'd implement the Visitor pattern: I'd define an interface called ILispValVisitor, and every time I needed to process a LispVal somehow, I'd declare a nested class that implemented the right ILispValVisitor methods. Fairly straightforward, right? -- a boilerplate class implementation in every piece of code that needs to look inside a LispVal.

Functional languages have a much more elegant alternative, in the form of pattern matching operators. There's no real equivalent to pattern matching in C#, although the concept could be similar to if, switch or the visitor pattern depending on the situation:

 * Rough C# equivalent: an implementation of the visitor pattern
let rec typeOf (env : Map) = function
    | ArgRef _ -> typeof
    | Atom a -> a |> ident env |> typeOf env
    | Bool _ -> typeof
    | IfPrimitive (_, thenValue, elseValue) ->
        match typeOf env thenValue with
        | t when t = typeOf env elseValue -> t
        | _ -> raise &lt;| Compiler("expected 'then' and 'else' branches to have same type")

    | LambdaDef (_, body) -> typeOf env body
    | LambdaRef (methodBuilder, _, _) -> methodBuilder.ReturnType
    | List (Atom a :: args) -> a |> lambdaIdent args env |> typeOf env
    | List (fn :: _) -> raise &lt;| Compiler(sprintf "can't invoke %A" fn)
    | List [ ] -> raise &lt;| Compiler("can't compile empty list")
    | ListPrimitive _ -> typeof
    | Number _ -> typeof
    | String _ -> typeof
    | QuotePrimitive _ -> typeof
    | VariableDef _ -> typeof
    | VariableRef local -> local.LocalType

The Lisp compiler I wrote relies on a discriminated union, LispVal for storing expression trees, and pattern matching for most of its processing. I'll post more in-depth articles covering the source code in detail, including:

  • Parsing with fslex and fsyacc
  • Code generation using System.Reflection.Emit
  • Calling the compiler from C#
Read and Post Comments

Eric Lippert on infoof()

May 22, 2009 at 06:42 PM | categories: Uncategorized | View Comments

I previously mentioned some hypothetical C# fieldof and methodof operators, which would obtain FieldInfo and MethodInfo objects, in the same way that typeof retrieves Type at runtime.

Eric Lippert explains the pros and cons of a combined infoof operator, which mainly centre around how hard it would be to pick the right method overload each time.

In the comments, Jonathan Pryor points out my favourite replacement for methodof:

Fastest by far in my testing -- faster than string-based Reflection, in fact -- is Mike's "Poor's man infoof" using the Delegate.Method property. This effectively bypasses most of the Reflection infrastructure (no Type.GetMember(), etc.), and is the closest we can get to IL member lookup resolution.

Read and Post Comments