Phillip Trelford's Array | POKE 36879,255

String and StringBuilder revisited

2. April 2015 .Net, C#, F# Comment (12)

I came across a topical .Net article by Dave M Bush published towards the tail end of 2014 entitled String and StringBuilder where he correctly asserts that .Net’s built-in string type are reference types and immutable. All good so far.

The next assertion is that StringBuilder will be faster than simple string concatenation when adding more than 3 strings together, which is probably a pretty good guess, but lets put it to the test with 4 strings.

The test can be performed easily using F# interactive (built-in to Visual Studio) with the #time directive:

open System.Text

#time

let a = "abc"
let b = "efg"
let c = "hij"
let d = "klm"

for i = 1 to 1000000 do
   let e = StringBuilder(a)
   let f = e.Append(b).Append(c).Append(d).ToString() 
   ()
// Real: 00:00:00.317, CPU: 00:00:00.343, GC gen0: 101, gen1: 0, gen2: 0
   
for i = 1 to 1000000 do
   let e = System.String.Concat(a,b,c,d)
   ()
// Real: 00:00:00.148, CPU: 00:00:00.156, GC gen0: 36, gen1: 0, gen2: 0

What we actually see is that for concatenating 4 strings StringBuilder takes twice as long as using String.Concat (on this run 0.317ms vs 0.148ms) and generates approximately 3 times as much garbage (gen0: 101 vs gen0: 36)!

Underneath the hood the StringBuilder is creating an array to append the strings into. When appending if the current buffer length is exceeded (the default is 16) then a new array must be created. When ToString is called it may, based on a heuristic, decide to return the builder’s array or allocate a new array and copy the value into that. Therefore the performance of StringBuilder is dependent on the initial capacity of the builder and the number and lengths of the strings to append.

In contrast, String.Concat (which the compiler resolves the ‘+’ operator to) calculates the length of the concatenated string from the lengths of the passed in strings, then allocates a string of the required size and copies the values in, ergo, in many scenarios it will require less copying and less allocation.

When concatenating 2, 3 or 4 strings we can take advantage of String.Concat’s optimized overloads, after this the picture changes as an array argument must be passed which requires an additional allocation. However String.Concat may still be faster than StringBuilder in some scenarios where the builder requires multiple reallocations.

But wait there’s more, going back to the ‘+’ operator, if we assign the integer literal expression 1 + 2 + 3 the compiler can reduce the value to 6, equally if we define the strings as const string then the compiler can apply the string concatenations at compile time leading to, in this contrived example, no cost whatsoever.

The moral of the story is when it comes to performance optimization - measure, measure, measure.

Parsing with SNOBOL

31. March 2015 Software Craftsmanship, F# Comment (1)

Just before Christmas I came across some Java source code by “Uncle” Bob Martin aimed at “demystifying compilers” which expends about 600 lines of code to parse the following simple finite state machine:

Actions: Turnstile
FSM: OneCoinTurnstile
Initial: Locked
{
Locked Coin Unlocked {alarmOff unlock}
Locked Pass Locked  alarmOn
Unlocked Coin Unlocked thankyou
Unlocked Pass Locked lock
}

For fun I knocked up a broadly equivalent parser in F# using FParsec which was just under 40 lines of code, and posted the code on this blog.

The post generated some interest, and I even got a mention on Twitter from “Uncle” Bob himself:

@djidja8 @ptrelford nice. It’s even smaller in snobol.
— Uncle Bob Martin (@unclebobmartin) December 10, 2014

I’d not seen SNOBOL before, but given Mr Martin’s recommendation I popped over to the SNOBOL page on WikiPedia and liked what I saw:

SNOBOL rivals APL for its distinctiveness in format and programming style, both being radically unlike more "standard" procedural languages such as BASIC, Fortran, or C.

SNOBOL first appeared in 1962 and appears to have been popular in US Universities as a text manipulation language in the 70s and 80s. The language supports pattern matching over text combined with assembler like control flow using labels and goto (like C#).

As a text manipulation language, SNOBOL code feels a little more readable than the new norm - regular expressions .

If you’d like to take it out for a spin there’s an open source SNOBOL IDE, with syntax colouring support, for Linux and Windows called TkS*LIDE.

SNOBOL Interpreter

Nowadays when I want to learn a new language I often start by implementing it, to this end over the course of about a week I built a SNOBOL interpreter with just enough functionality to run the samples on the Wikipedia page along with some more involved samples from other sources.

The SNOBOL interpreter is about 400 LOC and available as an F# Snippet.

Finite State Machine in SNOBOL

Armed with a basic knowledge of SNOBOL, I could now answer the question, is the implementation even smaller in SNOBOL.

The answer is a resounding yes, and here’s the 34 lines of SNOBOL code that proves it:

Disclaimer: unlike the FParsec version there’s no error handling/error messages and the FSM must be layed out in a specific format.

Conclusions

The SNOBOL finite state machine parser, like the FParsec based parser, fits on a page and is an order of magnitude shorter than the broadly equivalent clean Java implementation written by Uncle Bob Martin that aimed to demystify compiler writing.

Will I be switching from FParsec to SNOBOL for parsing? Probably not, FParsec is at least as expressive, provides pretty good error messages for free and runs on the CLR.

Special thanks to Uncle Bob Martin for the SNOBOL tip Smile

Top 100 .Net Bloggers from 2014

30. March 2015 .Net, F# Comment (1)

In my last post I covered the top 100 .Net bloggers since 2008, based on links posted on Alvin Ashcraft's Morning Dew. This (intentionally) captured many bloggers that are no longer actively blogging, but equally still have interesting content to consume.

For completeness here's the ranking for the years 2014 and 2015 (up to last Friday) which may better capture active .Net bloggers:

Rank	Name	2014	2015	Total
1	Sean Sexton	195	0	195
2	Raymond Chen	86	17	103
3	Greg Duncan	74	14	88
4	Scott Hanselman	50	7	57
5	Peter Vogel	44	12	56
6	Brian Harry	46	8	54
7	Ricardo Peres	38	13	51
8	Oren Eini	32	12	44
9	Eric Lippert	44	0	44
10	Sacha Barber	31	7	38
11	Martin Hinshelwood	25	5	30
12	Eric Battalio	27	2	29
13	Carl Franklin & Richard Campbell	16	10	26
14	Jonathan Allen	17	9	26
15	Sasha Goldshtein	19	7	26
16	Dhananjay Kumar	25	1	26
17	James Montemagno	17	7	24
18	Jimmy Bogard	18	6	24
19	Willy-P. Schaub	19	4	23
20	Mike Taulty	21	1	22
21	Nicholas Blumhardt	18	3	21
22	S.Somasegar	17	3	20
23	Rob Eisenberg	13	7	20
24	Kathleen Dollard	20	0	20
25	Jeremy Clark	10	9	19
26	Jon Skeet	16	3	19
27	Phillip Trelford	17	2	19
28	Michael Crump	13	5	18
29	Immo Landwerth	13	5	18
30	Rory Becker	18	0	18
31	Rowan Miller	15	2	17
32	Sanjay Sharma	17	0	17
33	Jesse Liberty	15	1	16
34	Charles Sterling	15	1	16
35	Miguel de Icaza	12	3	15
36	Steve Smith	15	0	15
37	Bnaya Eshet	5	9	14
38	Scott Guthrie	12	2	14
39	Gael Fraiteur	11	3	14
40	Bill Wagner	11	3	14
41	Mary Jo Foley	12	2	14
42	Rick Strahl	7	7	14
43	Kim Spilker	14	0	14
44	Tatworth	14	0	14
45	MS Downloads	13	0	13
46	John Montgomery	8	4	12
47	Jeff Martin	9	3	12
48	Kerry Meade	10	2	12
49	Latish Sehgal	12	0	12
50	Richard Carr	12	0	12
51	Jonathan Wood	8	3	11
52	K. Scott Allen	8	3	11
53	Susan Ibach	7	4	11
54	Filip Ekberg	11	0	11
55	Mads Kristensen	8	2	10
56	Robert Green	9	1	10
57	Bertrand Le Roy	8	2	10
58	Daria Dovzhikova	10	0	10
59	CodePlex	10	0	10
60	Laurent Bugnion	6	3	9
61	Erik EJ	8	1	9
62	Iris Classon	6	3	9
63	Pete D.	4	5	9
64	DevToolsGuy	3	6	9
65	Dave M. Bush	7	2	9
66	Cameron Taggart	8	1	9
67	Deborah Kurata	8	1	9
68	Julie Lerman	7	2	9
69	Anand Narayanaswamy	9	0	9
70	Philip Fu	9	0	9
71	Glenn Block	6	2	8
72	The .NET Team	6	2	8
73	Jeremy Likness	5	3	8
74	Shawn Wildermuth	6	2	8
75	Ondrej Balas	7	1	8
76	Kunal Chowdhury	6	2	8
77	Adam Anderson	8	0	8
78	Jeremy D. Miller	8	0	8
79	Schabse Laks	8	0	8
80	Sam Sabri	8	0	8
81	Frans Bouma	5	2	7
82	Jean-Marc Prieur	5	2	7
83	Sergio De Simone	6	1	7
84	David Voyles	4	3	7
85	Dmitri Nesteruk	2	5	7
86	Nick Randolph	5	2	7
87	Alois Kraus	6	1	7
88	Jef Claes	6	1	7
89	Eric Sink	6	1	7
90	Josh Morales	6	1	7
91	Terje Sandstrom	7	0	7
92	Xinyang Qiu	7	0	7
93	Jon Galloway	7	0	7
94	John Papa	7	0	7
95	Daniel Rubino	7	0	7
96	Matthieu Mezil	7	0	7
97	Angelos Petropoulos	3	3	6
98	Peter Kellner	3	3	6
99	Dror Helper	5	1	6
100	Tom Warren	3	3	6

This definitely brings up some new names alongside the old familiar ones :)

Script

For the analysis we employed a simple F# script, using FShapr.Data’s CSV Type Provider for types over the data set and Taha Hachana’s XPlot library for charting.

Here’s the code for the top 100:

open FSharp.Data

let [<Literal>] path = @"LinksTo2015.csv"
type Posts = CsvProvider<path>
let posts = Posts.Load(path)

let topAuthors n =
   posts.Rows
   |> Seq.where (fun row -> row.Year >= 2014)
   |> Seq.where (fun row -> row.Tag.Contains ".NET" || row.Tag.Contains "Top")
   |> Seq.groupBy (fun row -> row.Author) 
   |> Seq.map (fun (author,rows) -> author, rows |> Seq.toArray)
   |> Seq.sortBy (fun (_,rows) -> -rows.Length)
   |> Seq.take n
   |> Seq.toList

let top100 = topAuthors 100

For the table I simply used another short snippet to transform the results to text for an HTML table.