stargeek
PHP news website logo.
home    PHP scripts    articles    seo tools    links    search    contact    shop    realtors


Revisiting C# and Java RegEx Benchmark







Revisiting C# and Java RegEx Benchmark

Revisiting C# and Java RegEx Benchmark 01/18/2004 03:45 AM

Last year, these benchmark  ;results became hot points of contention between Java and .NET developers.What the results suggested was that Java regular expression engines are significantly faster than .NET's Regex.

I thought it might be fun to port one of the fastest Java regular expression engines to J# and see how it performs compared to .NET's Regex.  I chose the dk.brics.automaton engine because it seemed easiest to port.  It was.  When I ran a straight-forward C# port of regtest.java on the J# version of dk.brics.automaton and compiled singleline Regex, I got these results:

dk.brics.automaton 2303 milliseconds>
Regex 2894 milliseconds>

I also ran regtest.java on the original dk.brics.automaton and Java's built-in regular expression engine.  Results were:

dk.brics.automaton 511 milliseconds>
java.util.regex 1061 milliseconds>

Based on these admittedly informal results, Regex performance is probaly not caused by bad design or implementation of regular expression but by performance issues that may exist within CLR and core classes.  Since I lack the enthusiasm to dig into the innards except in pursuit of a critical bug, I'll leave it up to the CLR team to chase further.

IMHO, .NET performance is 'good enough' for server-side use at this time so please don't misinterpret this post as an attempt to pull .NET down in favor of Java.  BTW, I won't be using my port of dk.brics.automaton in production because it's seems to miss some patterns that it should have found.




This is a GrokNews Entry: (what is grok?)





Similar Items

Revisiting C# and Java RegEx Benchmark

Grok Headline matches for Revisiting C# and Java RegEx Benchmark

Java Regex Wrangling


Java Regex Wrangling 08/22/2004 05:32 PM
I needed a quick and dirty tokenizer for a big chunk of XML-ish text to feed into some Java code so I was going to fire up Perl, then I remembered that modern Java comes with its own regular-expression library. Hey, it’s good! I put it together in quick-n-dirty hacker style, and it ran over a 100M file, finding fifteen million tokens, in about three minutes of CPU time on my 1.25GHz PowerBook. Quite respectable, but, I thought with a snicker, I bet Perl can beat that. (Perl’s regex engine is generally regarded as the state of the art.) So I whacked together a Perl version (I’m considerably more practiced at Perl than Java) and hrumph, Perl 5.8.1 fell over and refused to run the big regex. Experts may read on to look at the gory details. [Update: some have, the perl version is flawed, stand by for further reports.]...

Perl vs. Java RegEx


Perl vs. Java RegEx 08/28/2004 01:03 AM

Tim Bray compares Perl and Java regular expression performance with the result of Java performing twice as fast as Perl when output performance is factored out.  Fantastic.  I knew Java regular expression library was fast but I didn't know it was this fast.  Even more encouraging, there are even faster third party regular expression libraries for Java.  I wonder if .NET 2.0 makes up for the lackluster RegEx performance in .NET 1.1.


Revisiting SGML on the web


Revisiting SGML on the web 05/23/2002 10:39 PM

Revisiting The Unsubscribe Link


Revisiting The Unsubscribe Link 06/01/2004 02:03 PM
In just about every silly "profile of a spammer," you tend to hear them say two things: (1) they don't send out porn spam and (2) they really do remove those who unsubscribe from their spam. Of course, most people are unlikely to believe either of those claims (for good reason), but with the passage of CAN-SPAM (which requires a "working" opt-out link) the debate keeps returning to whether or not you actually should "opt-out" of spam - since it's well known that many spammers only use that information to confirm that you're a "live one," and make sure you get plenty more spam. Sooner or later, someone had to test it out, and now an anti-spam company is claiming that only 10 to 15% of opt-out spam links are invalid - which sounds impossibly low. Of course, they don't break out just how much additional spam you will get for the few untrustworthy opt-out links. In fact, it's unclear how they really know if the opt-out works. You may not get spam from the identical spammer, but they could just as easily resell your live info to other spammers, and you have no way of knowing it was because you "opted-out." Or, more likely, they'll just start spamming you from one of the hundred other identities they have, so they can claim that the you're no longer receiving spam from that one entity, but you never opted out of the other 99.

Revisiting SGML on the Web (xmlhack)


Revisiting SGML on the Web (xmlhack) 05/30/2002 02:41 PM

revisiting dunbar's number


revisiting dunbar's number 06/26/2004 08:42 PM
always good to see a site where the ideas are as pretty as the presentation

Regex-Iterator-0.2


Regex-Iterator-0.2 08/22/2004 05:58 AM

Regex-Iterator-0.1


Regex-Iterator-0.1 08/21/2004 05:04 PM

Test-Regex-0.01


Test-Regex-0.01 08/01/2004 11:30 PM

Regex-PreSuf-1.16


Regex-PreSuf-1.16 05/12/2004 05:10 PM

Revisiting Barcode Replacement Satire


Revisiting Barcode Replacement Satire 05/11/2004 03:16 PM
A little over a year ago there was a huge media frenzy over a site that let people view and print out barcodes. It was really just a database of barcodes, but the site presented a satirical commercial showing how you could use the site to "name your own price" and re-code any product to a price you preferred. Of course, actually doing the re-coding would be illegal. Running a database telling people how seems perfectly legal... unless you're lawyers at a big company like Wal-Mart. Wal-Mart and a number of other big companies forced the site to shut down, and the folks have now set up the site as a Wal-Mart spoof. John submitted a story about the whole mess one year later. It sounds like those involved didn't expect the level of backlash they got - especially from the press who labeled them as the thieves. Still, they've now got other plans up their sleeves for satirical projects.

Revisiting the Nasdaq's Past Pain


Revisiting the Nasdaq's Past Pain 11/13/2003 08:50 AM
TheStreet.com Nov 13 2003 8:40AM ET

Congress Revisiting Spam Plans


Congress Revisiting Spam Plans 05/25/2004 11:55 AM
When the CAN SPAM law was first passed, anyone who thought through what the law actually said realized that it wouldn't work, and some people started asking what was plan B? Instead of just patting themselves on the back, we wanted to know exactly how they would measure the success or failure of the bill, and what they would do in the very likely event that it made the problem worse, not better. The sponsors of the bill never really responded to that question, but just talked about how wonderful it was that they were now banning spam. Except, only five months into the law being in effect and the spam problem is clearly worse, not better. For once, however, it appears that even some folks in Congress realize this and are already interested in revisiting the law. Some of this article is just repeating the things that we posted last week about the FTC exploring other options such as a bounty system encouraging people to track down spammers, but the fact that more politicians are realizing CAN SPAM isn't working is a good thing. Of course, we still haven't heard from the Senators who were so proud of themselves for coming up with the law in the first place.

Revisiting "Table Layouts, Revisited"


Revisiting "Table Layouts, Revisited" 09/06/2002 08:42 PM
A response/reaction to Zeldman's recent reflections on the experience of table versus CSS layout.

Open-Source Regex


Open-Source Regex 08/27/2004 02:07 PM
A few days ago I wrote a little report on regular-expression performance; it drew a surprising amount of feedback, including one piece that throws an interesting sidelight on the trade-offs around Java and Open Source...

Apache Regex Problems


Apache Regex Problems 12/29/2003 08:07 PM
Noel Davis looks at problems in Apache, mod_php, XDM, Goahead Web Server, Xerox Document Center, SARAH, phpBB2, OpenBB, SquirrelMail, and pServ.

Revisiting the "hardware is free" vision
of the future


Revisiting the "hardware is free" vision
of the future
06/01/2004 11:21 PM
You may recall back at the end of March that we had a little diddy on Bill Gates' proclamation that "hardware will be free" in the future. Now Sun is saying that same thing, leaving us to wonder: what will we ever do with all this free hardware?

U.S. Supreme Court Revisiting Intel, AMD
Spat


U.S. Supreme Court Revisiting Intel, AMD
Spat
11/11/2003 06:58 AM
SiliconValley.Internet.com Nov 11 2003 6:29AM ET

C++ Regex Engine 0.05b (Default branch)


C++ Regex Engine 0.05b (Default branch) 04/14/2005 04:12 AM
C++ Regex Engine provides a robust regular expression library for use in C++. The syntax of the regular expression language is very close to the Perl 5 standard. The classes given somewhat mirror those found in Java's regular expression library, but have several improvements. The intended audience of this package is anyone who does a lot of data parsing (e.g. sysadmins), people tired of trying to decipher Perl scripts, and regular expression beginners who know C++. Some knowledge of the STL's vector and string will come in handy, but isn't necessary.
Changes:
Possibly very buggy Unicode support was added. Preliminary tests show that both Unicode and ASCII matching works, though the Unicode aspects have not yet been tested thoroughly.

Crawl Master (OO Perl, LWP, regex)


Crawl Master (OO Perl, LWP, regex) 04/04/2005 10:24 AM
Open List, Inc. - United States, N.Y., New York (2005-04-04)

Data Globbing with MySQL Regex


Data Globbing with MySQL Regex 07/13/2004 11:50 PM

As I become a more experienced developer, I'm learning when you should and shouldn't break the rules. While following every rule of programming and data modeling is wonderful, sometimes you need to bend the rules for the sake of simplicity and expediency.

Always remember, an app in the hand is worth a thousand on the white board.

This being the case, lately I've been known to "glob" up data in database fields. Yes, I know this breaks the first normal form — that of atom icity — but there are times when doing it right would involve three more queries, two more database tables, another UI screen, etc. Often it makes the cure worse than the disease.

For instance, consider this little XML document as the contents of the "children" field for one of the records in my "church_attender" table:


<children>
  <child>
    <first_name>Isabella</first_name> ;
    <last_name>Barker</last_name>
  </child>
  <child>
    <first_name>Gabrielle</first_name&g t;
    <last_name>Barker</last_name>
  </child>
</children>

Now if I never wanted to search for individual children, I would make no excuses for this. It saves us a database table, a join, and a ton of complexity in the interface. Life is good.

Searching a globbed up field is a problem, though. We alluded to it in this post when we said:

However, the problem is that the XML field is a black box that — on most database platforms — you can't look inside. What if you want a list of articles written by a particular author? Well, you need to use SQL to get all the XML back, spin that collection, XPath into every single one to find the value author of the author node, then keep that record it matches.

So what if I want to find a person with a child named Gabrielle? Some databases (Oracle, for one), will let you do something like this:


SELECT * FROM church_attender WHERE XPATH(children,'/child/first_name') = 'Gabrielle'

That'd be great, but I don't have Oracle. However, given our experience this week with MySQL regular expressions, how unacceptable would this be:


SELECT * FROM church_attender WHERE children LIKE '%Gabrielle%' AND children RLIKE '<children>.*<child>.*<first_name> Gabrielle </first_name>.*</child>.*</children>'

(Note that there are some extra spaces in there just so the lines would wrap.)

Yes, yes, I know the database Gods would frown on this, but given the enourmous amount of complexity it would save us, is it acceptable? Does the good outweigh the bad?

Fishing for opinions here. Let's hear them.

Click here to comment on this entry


Regex Wizard Needed for Web Site
Migration


Regex Wizard Needed for Web Site
Migration
08/16/2004 02:38 PM
Federation of American Scientists - United States, DC, Washington (2004-08-16)

Far Cry Benchmark


Far Cry Benchmark 07/20/2004 04:07 PM

VIA AES Benchmark


VIA AES Benchmark 05/20/2004 10:08 AM

XML Benchmark 1.3.0


XML Benchmark 1.3.0 02/10/2004 02:55 PM
A C/C++/Java XML parsers benchmarking tool set.

PHP Benchmark


PHP Benchmark 11/18/2003 10:13 PM
Sebastian is doing some neat work on testing performance. I found it hard to decipher the data, so I graphed it in Excel.

PHP Sebastian Benchmark:

There seems to be some drop-off in performance in PHP 4.3.4. I guess the core developers are putting their energies into PHP 5.

3DX: Benchmark


3DX: Benchmark 04/27/2004 03:59 PM
New Website

Far Cry Benchmark 1.2


Far Cry Benchmark 1.2 07/23/2004 04:21 PM

XML Benchmark 1.2.2


XML Benchmark 1.2.2 10/29/2003 12:11 AM
A C/C++/Java XML parsers benchmarking tool set.

XML Benchmark


XML Benchmark 02/10/2004 01:26 PM
New Results Published

First Athlon64 X2 Benchmark


First Athlon64 X2 Benchmark 04/17/2005 06:17 AM

Benchmark::Timer 0.6


Benchmark::Timer 0.6 09/02/2004 08:55 PM
A Perl extension to benchmark code, with or without statistical confidence.

HardwareOC Far Cry Benchmark 1.3


HardwareOC Far Cry Benchmark 1.3 08/10/2004 10:38 AM

Test-Benchmark-0.003


Test-Benchmark-0.003 12/22/2003 06:32 PM

HardwareOC Far Cry benchmark v1.2.1


HardwareOC Far Cry benchmark v1.2.1 07/26/2004 08:59 AM

Benchmark-Timer-0.6.1


Benchmark-Timer-0.6.1 09/16/2004 05:08 PM

Test-Benchmark-0.002


Test-Benchmark-0.002 12/20/2003 06:05 PM

Test-Benchmark-0.001


Test-Benchmark-0.001 12/16/2003 11:05 PM

OCaml Benchmark


OCaml Benchmark 08/18/2004 05:01 PM
Initial import
Grok Description matches for Revisiting C# and Java RegEx Benchmark
GrokA matches for Revisiting C# and Java RegEx Benchmark

Revisiting C# and Java RegEx Benchmark

The following phrases have been identified by the grok system as matching this entry:

















Also check out:


Grok

Ipod Porn on the
Rise

Brief Abstract of
Wikipedia's
Mesothelioma Cancer
page

Get first aid
instructions in your
cell phone

IE is crap
JSPWiki gains
podcasting support

Bugzero 3.3.2
Kahakai 0.6.1
gGo 0.4.3
old 0.13
JavaThrottle 0.3
Howie The Chatterbot
0.5.0

libcaca 0.8
osdchat 0.1.8
Onyx 5.0.1 (Stable)
MyPasswordSafe
20040118

Management by
Baseball

Surprise!
* Massachusetts
calls Microsoft's
behavior 'troubling'

Microsoft in prison
link-up

World War Two aerial
photographs on the
Internet

British Media Baron
Near Deal to Sell
Major Stake in
Hollinger

Advance in Cancer
Research?

ajc.com | News | In
wireless era, Iowa
stands on ceremony

fashion model or
presidential
candidate?

updated mozilla dev
roadmap

mistakes in the
moral math of
blogging

weblog predictions
for 2004?

Downloading Music
Files Has Become
Quite Simple

New Jukebox A Pocket
Dynamo

Iowa rivals step up
the campaign

Football: Leeds may
sell stadium

Sperm donors 'to
lose anonymity'

Baghdad car bomb
leaves '15 dead'

Jing DAO Framework
The Daily Cartoon
for January 18

Oh so Scandalous!
Ahhh! The kid cut
me!

China takes half of
Intel's Asia sales
ex Japan

IBM to add 15,000
new jobs worldwide
on the heels of
strong earnings
report

Yahoo emerges from
dot-com gloom

Maths curricula to
be computerised

Google Improves
Searches In a Number
of Ways

Airline Gave
Passenger Data to
U.S. Government
-Paper

Small Plane Crashes
in Lake Erie

Writers' Guild
President's Past in
Doubt

Plane with 10 People
Crashes Along
U.S.-Canada Border

Men in Blue Cross
Fraud Case Sentenced

Republicans Warn
Bush on Spending,
Deficits

Large Explosion
Rocks Central
Baghdad-Witnesses

There
Green for JetBlue
Unfair and
unbalanced

More Steve
ImageJ Plugins
New 20GB MP3 player
from Philips

what is grok?