Fast StringReader for .NET

While driving home from work today, I was thinking of how to write an algorithm in C# 2.0 that takes a string representing a multi-line text file, and return the lines, one at a time.

I know Mono's System.IO.StringReader class can do this, but I was thinking there must be a more efficient way of doing it. Here's what I came up with:

 public class LineReader
 {
     private string _source;
 
     public LineReader(string src)
     {
         _source = src;
     }
 
     public IEnumerator GetEnumerator()
     {
         bool finished=false;
         int nextChar=0;
         int readTo;
         string nextLine;
         while(!finished) {
             readTo = _source.IndexOf ('\n', nextChar);
             if (readTo==-1){
                 readTo=_source.Length-1;
                 finished=true;
                 if (readTo<nextChar){
                     yield break;
                 }
             }
             nextLine = _source.Substring(nextChar,readTo-nextChar);
             nextChar = readTo+1;
             yield return nextLine;
         }
     }
 }
 

Timing tests show that with the following test harness running on my 1.33GHz iBook G4 running Mono 1.2.5, parsing a 1000 line text file 1000 times takes 2.7 seconds on average using this algorithm and 3.4 seconds on average using StringReader.

 public static void Main(string[] args)
 {
     StreamReader sr = new StreamReader("sample.text");
     string test = sr.ReadToEnd();
     System.DateTime start = DateTime.Now;
     for (int i=0;i<1000;i++){
         LineReader lineReader = new LineReader(test);
         foreach (string s in lineReader){
         }
     }
     System.DateTime finish = DateTime.Now;
     double seconds = (((double)finish.Ticks) - ((double)start.Ticks)) / 10000000;
     Console.WriteLine("{0} seconds",seconds);
 }
 

Yes, StringReader handles both LF and CRLF line endings, and has a PushBack method to allow jumping back to the previous line, but for pure speed of parsing standard text with LF line endings, this algorithm seems to be around 20% faster than StringReader, and 30% faster than using Split to turn large string into an array of one string per line.

Update on 14th November 2007:
I ran the same tests on Windows XP with Microsoft .NET 2.0, on a dual core 3GHz PC, and produced this graph which compares various ways of splitting the large string into one string per line:

Graph

kick it on DotNetKicks.com

7 Comments
mcgurkWednesday, November 14th 2007
Hey, what about splitting via regex?

Also: captcha sucks. nine added to eleven is twenty, not 20.

Also: fangirl.
SandyThursday, November 15th 2007
Thanks for the feedback.

My test harness for the LineReader obtains a string object for each line. Doing this with a regular expression results in code like this (unless my understanding of .NET's regexes is not as good as it should be):

Regex re = new Regex("([^\n]*)\n",RegexOptions.Compiled);
MatchCollection col = re.Matches(test);
foreach (Match match in col)
{
string line = match.Value;
}

This is actually quite slow, as is re.Split. I wonder if there is a better way to get that string for each line using a regex.
KevinThursday, November 15th 2007
Using DateTime.Now isn't very accurate for performance analysis. Performance 1000 iterations of parsing a 16000 line file on my PC shows that your algorithm is just slight slower (~ 750ms) than the .NET StringReader. Here is the benchmark code I used.

------------

static void Main(string[] args)
{
// Setup
string data = File.ReadAllText("C:\\test.txt");
Stopwatch watch = new Stopwatch();

// StringReader
watch.Start();
for (int i = 0; i < 1000; i++)
{
using (StringReader reader = new StringReader(data))
{
string line;
while (null != (line = reader.ReadLine()))
{
}
}
}
watch.Stop();
Console.WriteLine("StringReader: " + watch.ElapsedMilliseconds.ToString() + " ms.");

// LineReader from (http://sorn.net/blog/2007/11/Fast-StringReader-for-.NET)
watch.Reset();
watch.Start();
for (int i = 0; i < 1000; i++)
{
LineReader reader = new LineReader(data);
foreach (string line in reader)
{
}
}
watch.Stop();
Console.WriteLine("LineReader: " + watch.ElapsedMilliseconds.ToString() + " ms.");
Console.ReadKey();
}
Sandy DunlopThursday, November 15th 2007
Kevin, here is the output I get from your code running on a 1.33GHz iBook with Mono:

StringReader: 2916 ms.
LineReader: 2703 ms.

Do you know why the StringReader timings are different than when I use DateTime.Now to time it?
KevinThursday, November 15th 2007
DateTime.Now has a lower resolution than using StopWatch, which can cause inaccurate results (especially for smaller intervals). I ran some more tests in both release and debug mode and it looked like the two were neck and neck the whole way. I saw some variations each way, which didn't exceed much more than half a second.

Kevin
SandyThursday, November 15th 2007
Oh well, I thought I'd found something useful. Thanks for having a look at it. I'll be using StopWatch from now on.
michaelSaturday, November 17th 2007
If you have a pre-compiled regex before you enter the test loop, it should run much faster.

Post a comment

What is fourteen added to four ?:
Name: URL:
Comment: