Fast StringReader for .NET
While driving home from work today, I was thinking of how to write an algorithm in C# 2.0 that takes a string representing a multi-line text file, and return the lines, one at a time.
I know Mono's System.IO.StringReader class can do this, but I was thinking there must be a more efficient way of doing it. Here's what I came up with:
public class LineReader { private string _source; public LineReader(string src) { _source = src; } public IEnumerator GetEnumerator() { bool finished=false; int nextChar=0; int readTo; string nextLine; while(!finished) { readTo = _source.IndexOf ('\n', nextChar); if (readTo==-1){ readTo=_source.Length-1; finished=true; if (readTo<nextChar){ yield break; } } nextLine = _source.Substring(nextChar,readTo-nextChar); nextChar = readTo+1; yield return nextLine; } } }
Timing tests show that with the following test harness running on my 1.33GHz iBook G4 running Mono 1.2.5, parsing a 1000 line text file 1000 times takes 2.7 seconds on average using this algorithm and 3.4 seconds on average using StringReader.
public static void Main(string[] args) { StreamReader sr = new StreamReader("sample.text"); string test = sr.ReadToEnd(); System.DateTime start = DateTime.Now; for (int i=0;i<1000;i++){ LineReader lineReader = new LineReader(test); foreach (string s in lineReader){ } } System.DateTime finish = DateTime.Now; double seconds = (((double)finish.Ticks) - ((double)start.Ticks)) / 10000000; Console.WriteLine("{0} seconds",seconds); }
Yes, StringReader handles both LF and CRLF line endings, and has a PushBack method to allow jumping back to the previous line, but for pure speed of parsing standard text with LF line endings, this algorithm seems to be around 20% faster than StringReader, and 30% faster than using Split to turn large string into an array of one string per line.
Update on 14th November 2007:
I ran the same tests on Windows XP with Microsoft .NET 2.0, on a dual core 3GHz PC, and produced this graph which compares various ways of splitting the large string into one string per line:























