Welcome to AddressOf.com Sign in | Join | Help

Split'ing Strings and the Performance Implications

Jason IM'd me with a question last night about how to split a string using a delimiter with multiple characters.  Of course, my first answer was to use the VB Split method (he's doing this from C#).  He also found that there is another way using System.Text.RegularExpressions.RegEx.Split (hmmm... that's easy to remember).  So since there were two ways to do it, he decided to do some performance testing.  Before getting the results from him, I decided to do a quick and dirty test as well.  In my testing, I also included System.String.Split.  System.String.Split doesn't allow for multiple character delimiters, so thus the reason for the original question.  However, with single character delimiters, System.String.Split is definitely faster than the VB Split method.  What I did notice though is that I didn't see any real difference between a single character and multiple character performance difference when using VB Split.  System.Text.RegularExpressions.RegEx.Split on the other hand was very noticeably slower.  Of course, after getting the results from Jason, our data didn't seem to match.  He was only showing a minor difference between the two.  It turns out we are measuring differently.  He's taking the difference between each test, summing that together and taking the average.  Mine is a summed difference in the total timing over several iterations.  To me, it's more important to look at the how long 1000 iterations took.  We are both right to a degree, but I think my measurement more accurately shows the timing difference.  And as such, there is a huge difference between the two.  I've modified Jason's example to show both of these values and it doesn't require opening the data in Excel.  You can get it here.

So, what's the rule of thumb on this one?  Well, if you are using a single character delimiter and concerned with performance, use System.String.Split.  If you are in need of splitting a string using a multiple character delimiter use the VB Split method (or Microsoft.VisualBasic.Strings.Split for you C# folks).

Also, I think it's interesting to point out how presenting data using one method vs. another kind of follows along with this document and the reasons why you should (as Jason pointed on on his site) take the time to validate the information yourself.

Published Wednesday, March 30, 2005 9:47 PM by CorySmith
Filed under: ,

Comments

# re: Split'ing Strings and the Performance Implications

Thursday, March 31, 2005 6:28 PM by Philip
I won't dispute your results until I have access to a place to compile and run it :)

Frankly, I'm not surprised if the vb split is faster for a fixed string, since the regex engine takes a pattern and is more general purpose, while the vb method simply uses .IndexOf iteratively to get the next split point which would not allow a pattern. I think it's a shame that the BCL string's split doesn't have a similar multi-char overload.

However, looking through your tests, I have a few questions that may change your results

0) your "percentage" is actually just the difference in seconds multiplied by 1000. So a .182 sec difference in a 218 second run would have the winner reported as 182% faster

1) you randomly generate the strings and delimiters for each call - that means that the vb and regex methods get a different string for each iteration. wouldn't it be best to use the same string to both calls for each iteration?

2) Any split scenarios I can think of would have a stable delimiter (like, always split on "-;-" or some such) instead of a random delimiter per string. It would be more common to create a new compiled regex item and split on that, but tests on a compiled regex aren't included. (and if you're only doing 1 split, it seems silly to optimize that line anyway)

3)it seems like the vast majority of the tests will be of string lenghts between 10,000 and 999,999. (either with random true or false) that means that with 1000 "iterations", the 999 and less bucket will probably be tested less than 10 times, which may not be enough to reduce the variance due to machine noise.

# re: Split'ing Strings and the Performance Implications

Thursday, March 31, 2005 7:54 PM by Cory Smith
Philip, I agree with most of your points. I think my quick-n-dirty test is more realistic in showing the performance difference. The modified code (that's based on Jason's code) tries to show the difference based on many cases. In the end, as I stated above, there are really only two main categories for a multiple character delimiter... being new line delimiters and comma-seperated typed lines. I'm sure there are others, but I think the tests for these two types would show what the "average" would be.

# re: Split'ing Strings and the Performance Implications

Friday, April 01, 2005 11:07 AM by Philip
Cory: once I got home I tested, and the VB split is faster even than a compiled regex (used the same delimiter for every pass, so used a compiled regex). So for any scenario I could find, the Split function from the VisualBasic namespace was fastest for splitting on a string instead of a character.

Thanks, good to know!

(I see that .net 2.0 has a String.Split(string[], StringSplitOptions) overload, I'll have to test that speed and see what it runs like.)

# Spliting Hairs about Splitting Strings

Friday, April 01, 2005 11:51 PM by Randomize

# re: Split'ing Strings and the Performance Implications

Wednesday, April 06, 2005 2:36 PM by Jason Bock
As I mentioned here:

http://www.jasonbock.net/JB/Default.aspx?blog=entry.20050330T173718

There's a bit of clarification to what I found. Essentially, for small strings, the "VB" way is much better. For very large strings, the difference become very small. So to say "he was only showing a minor difference between the two" isn't accurate, although my first result:

http://www.jasonbock.net/JB/Default.aspx?blog=entry.20050330T150618

hinted at that, but that was because I was looking at ALL of the data no matter how big the string was.

# Lightweight Searches in .NET?

Friday, June 03, 2005 4:53 AM by Randomize

# re: Split'ing Strings and the Performance Implications

Friday, June 16, 2006 10:58 AM by Jon Baggaley
Here's a simple alternative example where I have just replaced a set of characters with a single one that I know will not exist (in this case "¬" and then I can just split using that...


v_strScript=strScript.Replace("\nGO\r"," ¬ ");
string [] BatchList=strScript.Split(Convert.ToChar("¬"));
Anonymous comments are disabled