Phillip Trelford's Array

POKE 36879,255

Parsing C#

C# is a mainstream programming language based on Java, that runs on Windows via .Net and cross platform via Mono. On .Net , C# is used primarily in the enterprise for web and desktop applications. With Mono, C# can target a range of popular platforms including iOS and Andriod, and is primarily employed in mobile application development via Xamarin’s tools and game development via Unity.

Microsoft’s current shipping C# compiler is proprietary code written in C++. Mono’s C# compiler, is open source and written in C#. In 2008 Microsoft announced the development of a new proprietary C# compiler written in C#, codenamed Roslyn, with the last preview released back in 2012. One of Roslyn’s key aims to expose the output of the parser as an abstract syntax tree (AST) that can be inspected by third party developers for tooling.

To fill the void, it feels like every man and his dog has written a parser for C#, most commonly to provide developer productivity tools in Visual Studio.

Third party vendors with C# parsers include:

On the open source side, Irony is a particularly interesting project allowing grammars to be specified in C# and includes samples for parsing many languages including C# and Java.

Parsing C#

You can think of C# and Java as managed C with classes. That is C# at it’s core is an imperative programming language, composed of statement blocks with additional syntax for class declaration and a dot operator for invoking members on a class.

Under the hood instance members simply take the instance as the first argument:

C C#
struct Point { 
  float X; float Y; 
};
float length(Point p) {
  return sqrt(p.X*p.X + p.Y*p.Y);
}
class Point {
  float X; float Y;
  public float Length() {
    return Math.Sqrt(X * X + Y * Y);
  }
}

The dot operator is probably *the* killer feature of class based languages like Java and C# as it allows developers to explore existing APIs via code completion in an IDE.

Parsing C# statements is not hugely dissimilar to parsing Small Basic statements which I have covered in a recent article. The only real difference is that C style languages typically employ semicolons between statements and curly braces to delimit statement blocks.

Parsing C# files involves finding using directives and namespace blocks. Then inside the namespace blocks finding type blocks including class, interface, struct, enum and delegate constructs. Inside the class blocks the member fields, properties, methods and constructors. Finally inside the members the statement blocks.

A few weeks back I was stuck on a a slow running train for a couple of hours, and had a go at writing a simple C# parser in F# using discriminated unions for the abstract syntax tree (AST) and FParsec for parsing. The same approach I’d used to parse Small Basic. I managed to sketch out a rough AST and a parser for namespaces and class declarations with around 100 lines of code. After a few more train journeys I managed to cover members and statement blocks, getting close to parsing C# 1.1. The prototype C# AST and parser is available as an F# snippet that can be run as a script inside F# interactive.

As an aside Neil Danson has recently constructed a compiler for the AST in F# that builds .Net assemblies. Basically this requires a number of passes, including building classes and members and then emitting IL code for statements, which is mostly a one-to-one mapping.

AST

The smallest building block in the AST is an expression which includes values, variables, member invocations and operators:

type Expr = 
    | Value of Literal
    | Variable of VarName
    | MethodInvoke of MemberName * Arg list
    | PropertyGet of MemberName
    | Cast of TypeName * Expr
    | InfixOp of Expr * string * Expr
    | PrefixOp of string * Expr
    | PostfixOp of Expr * string
    | TernaryOp of Expr * Expr * Expr

Then comes statements including the standard imperative loops and control flow:

type Statement =
    | Assignment of Init
    | PropertySet of MemberName * Expr
    | Action of Expr
    | If of Expr * Block
    | IfElse of Expr * Block * Block
    | Switch of Expr * Case list
    | For of Init list * Condition * Iterator list * Block
    | ForEach of Define * Expr * Block
    | While of Expr * Block
    | DoWhile of Block * Expr
    | Throw of Expr
    | Try of Block
    | Catch of TypeName * Block
    | Finally of Block
    | Lock of Expr * Block    
    | Using of Expr * Block
    | Label of LabelName
    | Goto of LabelName
    | Break
    | Continue
    | Return of Expr

Statements are organized inside members with a myriad of optional modifiers:

// Modifiers
type Access = Public | Private | Protected | Internal
type Modifier = Static | Sealed | Override | Virtual | Abstract
type ParamType = ByValue | ByRef | Out | Params 
// Members
type Member =
    | Field of Access * Modifier option * IsReadOnly * 
               ReturnType * Name * Expr option
    | Property of MemberInfo * Block option * Block option
    | Method of MemberInfo * Param list * Block
    | Constructor of Access * Modifier option * Name * Param list * 
                     PreConstruct option * Block

Members are organized inside types:

type CSharpType = 
    | Class of Access * Modifier option * Name * Implements * Members
    | Struct of Access * Name * Member list
    | Interface of Access * Name * Implements * Member list
    | Enum of Access * TypeName * EnumValue list
    | Delegate of Access * Name * ReturnType * Param list    

Types are optionally organized inside namespaces:

type Import = 
    | Import of Name list
    | Alias of Name * Name list
type NamespaceScope =
    | Namespace of Import list * Name list * NamespaceScope list
    | Types of Import list * CSharpType list

You can see the full source for the AST and parser in the F# snippet.

Parser

The basic structure of the parser is very similar to the Small Basic parser implementation. The main difference I found is that C# has a lot more syntactic noise to support classes.

As an example the following code fragment parses property declarations:

let pget = str_ws "get" >>. pstatementblock
let pset = str_ws "set" >>. pstatementblock
let ppropertyblock =
    between (str_ws "{") (str_ws "}") ((opt pget) .>>. (opt pset))
let pproperty = 
    pipe2 pmemberinfo ppropertyblock
     (fun mi (gblock,sblock) -> Property(mi,gblock,sblock))

Building with FParsec iteratively in F# interactive made light work of sketching out a parser and refining the AST.

Conclusions

In the spirit of build one to throw away from the Mythical Man Month, it feels like the combination of F# and FParsec is ideal for quickly putting together an AST and proving it with a working prototype parser. Which I feel would be a good approach to take if you wanted to build a production C# compiler in a timely manner.

Sketching out an AST and parser has been an interesting, but not overly taxing exercise and has given me a few insights into C#’s syntax. My overriding feeling is that the original layout of the language was more influenced by ease of parsing than ease of use for the developer, with minimal type inference, and the use of constructs like ref and out parameters over support for returning multiple values via tuples.

C#’s var keyword gives simple type inference for local variables, however given the type of the right hand expression is already known inference here seems like a nop.

I don’t plan to take the C# parser any further, there is already a lot of offerings out there in this space. But if you’re interested in understanding how imperative languages can be constructed I think it’s probably a fun exercise to try.

Comments (2) -

  • MBR

    1/22/2014 3:01:14 PM |

    There is also the MetaSpec C# parser which seems similar in spirit to Inevitable's offering. (But which is pricier FWIR.) http://www.csharpparser.com/

  • Marc Sigrist

    1/30/2014 1:58:34 PM |

    [C# has...] "a myriad of modifiers": Yes, and it even has a combined "protected internal" access modifier (not mentioned in the AST). Thanks for this succinct introduction on writing a C# parser in F#.

Pingbacks and trackbacks (1)+

Comments are closed