Revisiting C# and Java RegEx BenchmarkRevisiting C# and Java RegEx BenchmarkRevisiting C# and Java RegEx Benchmark 01/18/2004 03:45 AM Last year, these benchmark  ;results became hot points of contention between Java and .NET developers.What the results suggested was that Java regular expression engines are significantly faster than .NET's Regex. I thought it might be fun to port one of the fastest Java regular expression engines to J# and see how it performs compared to .NET's Regex. I chose the dk.brics.automaton engine because it seemed easiest to port. It was. When I ran a straight-forward C# port of regtest.java on the J# version of dk.brics.automaton and compiled singleline Regex, I got these results:
I also ran regtest.java on the original dk.brics.automaton and Java's built-in regular expression engine. Results were:
Based on these admittedly informal results, Regex performance is probaly not caused by bad design or implementation of regular expression but by performance issues that may exist within CLR and core classes. Since I lack the enthusiasm to dig into the innards except in pursuit of a critical bug, I'll leave it up to the CLR team to chase further. IMHO, .NET performance is 'good enough' for server-side use at this time so please don't misinterpret this post as an attempt to pull .NET down in favor of Java. BTW, I won't be using my port of dk.brics.automaton in production because it's seems to miss some patterns that it should have found. This is a GrokNews Entry: (what is grok?)Revisiting C# and Java RegEx BenchmarkGrok Headline matches for Revisiting C# and Java RegEx BenchmarkJava Regex WranglingJava Regex Wrangling 08/22/2004 05:32 PM I needed a quick and dirty tokenizer for a big chunk of XML-ish text to feed into some Java code so I was going to fire up Perl, then I remembered that modern Java comes with its own regular-expression library. Hey, it’s good! I put it together in quick-n-dirty hacker style, and it ran over a 100M file, finding fifteen million tokens, in about three minutes of CPU time on my 1.25GHz PowerBook. Quite respectable, but, I thought with a snicker, I bet Perl can beat that. (Perl’s regex engine is generally regarded as the state of the art.) So I whacked together a Perl version (I’m considerably more practiced at Perl than Java) and hrumph, Perl 5.8.1 fell over and refused to run the big regex. Experts may read on to look at the gory details. [Update: some have, the perl version is flawed, stand by for further reports.]... Perl vs. Java RegExPerl vs. Java RegEx 08/28/2004 01:03 AM Tim Bray compares Perl and Java regular expression performance with the result of Java performing twice as fast as Perl when output performance is factored out. Fantastic. I knew Java regular expression library was fast but I didn't know it was this fast. Even more encouraging, there are even faster third party regular expression libraries for Java. I wonder if .NET 2.0 makes up for the lackluster RegEx performance in .NET 1.1. Revisiting SGML on the webRevisiting SGML on the web 05/23/2002 10:39 PM Revisiting The Unsubscribe LinkRevisiting The Unsubscribe Link 06/01/2004 02:03 PM In just about every silly "profile of a spammer," you tend to hear them say two things: (1) they don't send out porn spam and (2) they really do remove those who unsubscribe from their spam. Of course, most people are unlikely to believe either of those claims (for good reason), but with the passage of CAN-SPAM (which requires a "working" opt-out link) the debate keeps returning to whether or not you actually should "opt-out" of spam - since it's well known that many spammers only use that information to confirm that you're a "live one," and make sure you get plenty more spam. Sooner or later, someone had to test it out, and now an anti-spam company is claiming that only 10 to 15% of opt-out spam links are invalid - which sounds impossibly low. Of course, they don't break out just how much additional spam you will get for the few untrustworthy opt-out links. In fact, it's unclear how they really know if the opt-out works. You may not get spam from the identical spammer, but they could just as easily resell your live info to other spammers, and you have no way of knowing it was because you "opted-out." Or, more likely, they'll just start spamming you from one of the hundred other identities they have, so they can claim that the you're no longer receiving spam from that one entity, but you never opted out of the other 99. Revisiting SGML on the Web (xmlhack)Revisiting SGML on the Web (xmlhack) 05/30/2002 02:41 PM revisiting dunbar's numberrevisiting dunbar's number 06/26/2004 08:42 PM always good to see a site where the ideas are as pretty as the presentation Regex-Iterator-0.2Regex-Iterator-0.2 08/22/2004 05:58 AM Regex-Iterator-0.1Regex-Iterator-0.1 08/21/2004 05:04 PM Test-Regex-0.01Test-Regex-0.01 08/01/2004 11:30 PM Regex-PreSuf-1.16Regex-PreSuf-1.16 05/12/2004 05:10 PM Revisiting Barcode Replacement SatireRevisiting Barcode Replacement Satire 05/11/2004 03:16 PM A little over a year ago there was a huge media frenzy over a site that let people view and print out barcodes. It was really just a database of barcodes, but the site presented a satirical commercial showing how you could use the site to "name your own price" and re-code any product to a price you preferred. Of course, actually doing the re-coding would be illegal. Running a database telling people how seems perfectly legal... unless you're lawyers at a big company like Wal-Mart. Wal-Mart and a number of other big companies forced the site to shut down, and the folks have now set up the site as a Wal-Mart spoof. John submitted a story about the whole mess one year later. It sounds like those involved didn't expect the level of backlash they got - especially from the press who labeled them as the thieves. Still, they've now got other plans up their sleeves for satirical projects. Revisiting the Nasdaq's Past PainRevisiting the Nasdaq's Past Pain 11/13/2003 08:50 AM TheStreet.com Nov 13 2003 8:40AM ET Congress Revisiting Spam PlansCongress Revisiting Spam Plans 05/25/2004 11:55 AM When the CAN SPAM law was first passed, anyone who thought through what the law actually said realized that it wouldn't work, and some people started asking what was plan B? Instead of just patting themselves on the back, we wanted to know exactly how they would measure the success or failure of the bill, and what they would do in the very likely event that it made the problem worse, not better. The sponsors of the bill never really responded to that question, but just talked about how wonderful it was that they were now banning spam. Except, only five months into the law being in effect and the spam problem is clearly worse, not better. For once, however, it appears that even some folks in Congress realize this and are already interested in revisiting the law. Some of this article is just repeating the things that we posted last week about the FTC exploring other options such as a bounty system encouraging people to track down spammers, but the fact that more politicians are realizing CAN SPAM isn't working is a good thing. Of course, we still haven't heard from the Senators who were so proud of themselves for coming up with the law in the first place. Revisiting "Table Layouts, Revisited"Revisiting "Table Layouts, Revisited" 09/06/2002 08:42 PM A response/reaction to Zeldman's recent reflections on the experience of table versus CSS layout. Open-Source RegexOpen-Source Regex 08/27/2004 02:07 PM A few days ago I wrote a little report on regular-expression performance; it drew a surprising amount of feedback, including one piece that throws an interesting sidelight on the trade-offs around Java and Open Source... Apache Regex ProblemsApache Regex Problems 12/29/2003 08:07 PM Noel Davis looks at problems in Apache, mod_php, XDM, Goahead Web Server, Xerox Document Center, SARAH, phpBB2, OpenBB, SquirrelMail, and pServ. Revisiting the "hardware is free" vision
|
Also check out: |