stargeek
PHP news website logo.
home    PHP scripts    articles    seo tools    links    search    contact    shop    realtors


Screen-scraping with WWW::Mechanize







Screen-scraping with WWW::Mechanize

Screen-scraping with WWW::Mechanize 03/13/2003 10:23 AM

Screen-scraping is the job of programmatically navigating through a usually visual task - like a web site - and then dealing with the result; and WWW::Mechanize is the best screen scraper out there for Perl! Chris Ball puts the two things together, to ensure that he never misses his favourite TV shows again...




This is a GrokNews Entry: (what is grok?)





Similar Items

Screen-scraping with WWW::Mechanize

Grok Headline matches for Screen-scraping with WWW::Mechanize

Rhetorics of scraping


Rhetorics of scraping 06/06/2005 12:11 AM
Michael Fry doe s not like people syndicating his comic strip, Over the Hedge:
You are stealing. You are taking money out of my pocket just as surely as if you held a gun to my head and demanded my wallet. By making Hedge so easily and freely available you are undermining the economics that make the comics you so obviously love possible.

United Media does not offer RSS feeds of their strips, with or without advertisements, so therefore these scraped feeds are the only way to follow such comics. Fine, they don't want this scraping to happen, that is their right, but I do find the rhetoric that is used here, completely and utterly stupid.

Why the fuck would removing advertisements be the same as holding a gun against someone's head!?! That is blatantly absurd - the former is the same as going to the toilet during commercial breaks, the latter is a threat to take a life of a person! There is nothing similar in these two cases. There's also the delusion of "lost sale" here... If the Hedge is not available to me via RSS, I'll just simply stop reading it. There is no "lost sale" in advertisements in this case - and even if I went to that site, I would have ad blockers in my browser.

The other side of me just wonders, why is "making Hedge so easily and freely available" undermining economics? If your economics consists of making life difficult and expensive for the users, then perhaps yes, but if your point is to sell books - aren't you better off telling everyone about your great thing? You know, advertising?

Anyway. There are many services that still do this scraping thing, all over the world. All it requires is a few lines of Perl or Python for anyone with an inch of coding ability. If you can read the HTML, you can scrape it. My fear is that once content producers realize this, they will start to offer their products embedded inside Flash files, or custom image plugins, or perhaps in DRM-protected videos (containing nothing but the image). Perhaps all text will be sent as images to stop scraping, or all sites will be turned to Flash. This will kill usability on so many fronts it's not funny anymore, and drive away users instead of getting more of them.

But what should be understood that scraping as such is not legal. You can, by sending a simple email, to shut down an offending site. You can stop it, once it starts to happen, using normal legal recourses. You just can't prevent it without losing your customers. Please don't even try...


Web Scraping Proxy


Web Scraping Proxy 04/16/2004 03:51 PM
DDJ Apr 16 2004 8:13PM GMT

Scraping the Web for Implied Data


Scraping the Web for Implied Data 07/11/2004 07:02 AM
Scraping the Web for Implied Data
http ://searchenginewatch.com/searchday/article.php/3374821

Dr. Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, thinks that there is more implied data (or inferable metadata) than "raw" data on the Web, and that we are barely scratching the surface of it. "Today, all search engines are scraping for some simple forms of implied data: language, locality, etc. What's missing from this list is a nearly infinite collection of relationships that are obvious to most any human reader but extremely difficult to infer from a single document." He gives the example of a very technical document about protein folding, which assumes that the reader would know the specification language and much else about the material being presented. An ordinary reader might sense the document "makes reference to physics in a non-trivial way," an expert would note even more implied facts ("the article may be out-dated by now," "the author is considered an authority in this domain," or "there's an expectation that diseases will be curable if these advances continue," etc.). Flake says: "In total, all of the implied data amounts to the stuff that all of us carry in our heads but no one bothers to write down; yet these factoids are essential to understanding and meaning. Some people in AI have been trying to codify these factoids for decades (and in many forms, from ontologies to databases of common sense). We are now starting to scrape the web for these subtle relationships. The key insight is that it is not enough to look at words, concepts, or documents; one must also look at how all of these things relate to one another. This article has been added to the articles section of Deep Web Research Subject Tracer™ Information Blog.
http://searchenginewatch.com/searchday/article.php/3374821

XRay Web Scraping Tool 2.0


XRay Web Scraping Tool 2.0 12/04/2003 08:26 PM
A GUI-based HTTP monitoring and Web scraping tool.

Scraping the Senate, turning US govt
into structured data


Scraping the Senate, turning US govt
into structured data
09/03/2004 06:29 AM
Cory Doctorow: Paul Ford has written an article for XML.com about his plan to scrape all the information he can about the Senate and convert it into searchable, structured data (much like the UK's brilliant They Work For You project, which does the same for Parliament). He's planning to document his process of converting the Senate's sloppy html into clean XML, and turn the process into a tutorial on how to make the Semantic Web come alive.
Of course screen-scraping is itself a dubious process. When the Senate decides to change its page design, moves the page, or alters the suffix, I'm out of luck. At the same time, it's hard to argue against the fact that the Senate's own web site is a definitive source for up-to-date, reliable information about the current composition of the Senate. This is a situation that we're likely to encounter again: the best, most reliable site to get some information is the worst place to get useful data. Hopefully, as we go forward, we'll have multiple sources of information on various members of the government, and can use them all together.
Link (via Kottke)

Scraping 'Cloth Cap' Deck Chairs
(Reuters)


Scraping 'Cloth Cap' Deck Chairs
(Reuters)
05/17/2004 11:49 AM
Reuters - The quintessentially British seaside town of Blackpool wants to get rid of its candy-striped deck chairs because they are too old-fashioned.

WWW-Mechanize-1.00


WWW-Mechanize-1.00 04/10/2004 06:27 AM

WWW-Mechanize-1.04


WWW-Mechanize-1.04 09/16/2004 07:41 AM

WWW-Mechanize-1.02


WWW-Mechanize-1.02 04/13/2004 11:37 PM

WWW-Mechanize-1.08


WWW-Mechanize-1.08 12/24/2004 12:11 PM

WWW-Mechanize-0.66


WWW-Mechanize-0.66 11/13/2003 06:23 PM

Gopher-Mechanize-0.27


Gopher-Mechanize-0.27 01/24/2004 05:35 AM

WWW-Mechanize-1.03_02


WWW-Mechanize-1.03_02 08/17/2004 12:15 AM

WWW-Mechanize-1.03_01


WWW-Mechanize-1.03_01 05/27/2004 04:48 PM

Test-WWW-Mechanize-0.04


Test-WWW-Mechanize-0.04 07/13/2004 12:23 AM

WWW-Mechanize-Timed-0.42


WWW-Mechanize-Timed-0.42 01/23/2004 02:26 PM

WWW-Mechanize-Shell-0.30


WWW-Mechanize-Shell-0.30 10/31/2003 06:21 PM

WWW-Mechanize-1.13_01


WWW-Mechanize-1.13_01 04/12/2005 04:55 PM

Win32-IE-Mechanize-0.005


Win32-IE-Mechanize-0.005 04/12/2004 04:50 PM

Test-WWW-Mechanize-0.02


Test-WWW-Mechanize-0.02 07/05/2004 06:29 AM

WWW-Mechanize-Shell-0.12


WWW-Mechanize-Shell-0.12 03/20/2003 11:04 PM

WWW-Mechanize-0.71_01


WWW-Mechanize-0.71_01 12/22/2003 05:34 AM

Win32-IE-Mechanize-0.006


Win32-IE-Mechanize-0.006 04/25/2004 12:16 AM

WWW-Mechanize-Cached-1.32


WWW-Mechanize-Cached-1.32 04/11/2004 11:46 PM

WWW-Mechanize-Shell-0.11


WWW-Mechanize-Shell-0.11 03/19/2003 10:24 PM

WWW-Mechanize-Cached-1.24


WWW-Mechanize-Cached-1.24 01/19/2004 05:05 AM

WWW-Mechanize-0.71_02


WWW-Mechanize-0.71_02 12/22/2003 06:32 PM

Test-WWW-Mechanize-Catalyst-0.30


Test-WWW-Mechanize-Catalyst-0.30 03/25/2005 02:13 AM

Win32-IE-Mechanize-0.007_4


Win32-IE-Mechanize-0.007_4 01/04/2005 02:12 AM

Test-WWW-Mechanize-1.05_02


Test-WWW-Mechanize-1.05_02 04/03/2005 04:07 PM

Test-WWW-Mechanize-Catalyst-0.31


Test-WWW-Mechanize-Catalyst-0.31 04/17/2005 10:59 AM

Win32-IE-Mechanize-0.007_3


Win32-IE-Mechanize-0.007_3 12/30/2004 02:17 AM

Too Much Fun for Just One Screen


Too Much Fun for Just One Screen 06/17/2004 05:27 AM
The Legend of Zelda: Four Swords Adventures requires more hardware, and has less-spectacular graphics, than any other GameCube game. But it's still one of the best multiplayer games for any console. A review by Lore Sjöberg.

Screen


Screen 10/02/2002 09:26 AM

When I Do Have A Screen


When I Do Have A Screen 02/05/2005 09:39 PM

When Flash memory's price starts dropping and capacity starts increasing, Apple might well be forced to put a display on its iPod shuffle. And this< /a> is what it might look like.


New iMac -- all screen, no box


New iMac -- all screen, no box 09/01/2004 07:37 AM

A LCD Screen to dream for


A LCD Screen to dream for 06/29/2004 01:00 AM

This is one time in my life when I wished I had a rich family member who I could beg for some pocket change. Engadget has a review of a soon to be release 1 billion LCD by NEC. All I can say is wow and as the reviewer at Engadget is predicting the price will probably be on the extreme high side. But it's always nice to dream. For those of you who are design artist and photographers you need to check this bad boy out. [Engadget]


New Splash Screen


New Splash Screen 03/13/2003 10:14 AM
According to Bug 194291, Mozilla will soon have a new splash screen. No more green lizard everyone love/hates. From the bug description: It is possible that this bug will receive a certain amount of attention. Before you comment here, listen carefully: This is not bug 32218. There will be no...

MPs to get £1.3m security screen


MPs to get £1.3m security screen 04/22/2004 01:15 PM
Plans to install a bullet proof screen in the House of Commons are approved by MPs.
Grok Description matches for Screen-scraping with WWW::Mechanize
GrokA matches for Screen-scraping with WWW::Mechanize

Screen-scraping with WWW::Mechanize

The following phrases have been identified by the grok system as matching this entry:

















Also check out:


Grok

Ipod Porn on the
Rise

Brief Abstract of
Wikipedia's
Mesothelioma Cancer
page

Get first aid
instructions in your
cell phone

IE is crap
JSPWiki gains
podcasting support

This week on Perl 6,
week ending
2003-01-26

Embedding Perl in
HTML with Mason

This week on Perl 6,
week ending
2003-02-02

Improving mod_perl
Sites' Performance:
Part 7

This week on Perl 6,
week ending
2003-02-09

Module::Build
This week on Perl 6,
week ending
2003-02-16

Building a Vector
Space Search Engine
in Perl

This week on Perl 6,
week ending
2003-02-23

Genomic Perl
This week on Perl 6,
week ending
2003-03-02

Improving mod_perl
Sites' Performance:
Part 8

Apocalypse 6
This week on Perl 6,
week ending
2003-03-09

Evolution 1.4
Preview 1.

Sony Z1
Intel does the
Centrino shuffle.

Vendors push
ultrawideband as
wireless
alternative.

The myth of
interference.

Featured articles
Vendor patches to
Apache

When Blimps Attack
They Called Me Mad!
Mad!

Jeremy Will Inherit
the Earth

Extremely Bitter
Divorce Dot Com

Dead Grandmother
Syndrome

Peter Kuper is Porn
Again

Charge That Dirty
Bomb Guy

Velkommen
Design
Lille Gull
Vidar Vang
Dancer in the Dark
Diary of aunt
travelling Magica

Magica has her own
digital camera!

Big events vs small
events

Marius calling!
Christmas getting
closer

Sick!
Merry Christmas!
OS X 10.2.4 (40M)
showed up in
Software Update tod
...

Brent Simmons of
NetNewsWire fame has
defined a co ...

While not directly
related to OS X, I
just had to ...

If you have PHP
installed and you've
loaded up 10. ...

Barebones has
released an update
to BBEdit with th
...

At long last, Quark
has posted a teaser
page about ...

This MacCentral
story will be of
interest to migra
...

We all like to know
about what software
is good, a ...

VoiceBox is a little
OS X app that
converts text f ...

Norse hackers have
released Desktop
Poems for OS X ...

what is grok?