Screen-scraping with WWW::MechanizeScreen-scraping with WWW::MechanizeScreen-scraping with WWW::Mechanize 03/13/2003 10:23 AM Screen-scraping is the job of programmatically navigating through a usually visual task - like a web site - and then dealing with the result; and WWW::Mechanize is the best screen scraper out there for Perl! Chris Ball puts the two things together, to ensure that he never misses his favourite TV shows again... This is a GrokNews Entry: (what is grok?)Screen-scraping with WWW::MechanizeGrok Headline matches for Screen-scraping with WWW::MechanizeRhetorics of scrapingRhetorics of scraping 06/06/2005 12:11 AM Michael Fry doe s not like people syndicating his comic strip, Over the Hedge:
You are stealing. You are taking money out of my pocket just as surely
as if you held a gun to my head and demanded my wallet. By making
Hedge so easily and freely available you are undermining the economics
that make the comics you so obviously love possible.
United Media does not offer RSS feeds of their strips, with or without advertisements, so therefore these scraped feeds are the only way to follow such comics. Fine, they don't want this scraping to happen, that is their right, but I do find the rhetoric that is used here, completely and utterly stupid. Why the fuck would removing advertisements be the same as holding a gun against someone's head!?! That is blatantly absurd - the former is the same as going to the toilet during commercial breaks, the latter is a threat to take a life of a person! There is nothing similar in these two cases. There's also the delusion of "lost sale" here... If the Hedge is not available to me via RSS, I'll just simply stop reading it. There is no "lost sale" in advertisements in this case - and even if I went to that site, I would have ad blockers in my browser. The other side of me just wonders, why is "making Hedge so easily and freely available" undermining economics? If your economics consists of making life difficult and expensive for the users, then perhaps yes, but if your point is to sell books - aren't you better off telling everyone about your great thing? You know, advertising? Anyway. There are many services that still do this scraping thing, all over the world. All it requires is a few lines of Perl or Python for anyone with an inch of coding ability. If you can read the HTML, you can scrape it. My fear is that once content producers realize this, they will start to offer their products embedded inside Flash files, or custom image plugins, or perhaps in DRM-protected videos (containing nothing but the image). Perhaps all text will be sent as images to stop scraping, or all sites will be turned to Flash. This will kill usability on so many fronts it's not funny anymore, and drive away users instead of getting more of them. But what should be understood that scraping as such is not legal. You can, by sending a simple email, to shut down an offending site. You can stop it, once it starts to happen, using normal legal recourses. You just can't prevent it without losing your customers. Please don't even try... Web Scraping ProxyWeb Scraping Proxy 04/16/2004 03:51 PM DDJ Apr 16 2004 8:13PM GMT Scraping the Web for Implied DataScraping the Web for Implied Data 07/11/2004 07:02 AM Scraping the Web for Implied Data http ://searchenginewatch.com/searchday/article.php/3374821 Dr. Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, thinks that there is more implied data (or inferable metadata) than "raw" data on the Web, and that we are barely scratching the surface of it. "Today, all search engines are scraping for some simple forms of implied data: language, locality, etc. What's missing from this list is a nearly infinite collection of relationships that are obvious to most any human reader but extremely difficult to infer from a single document." He gives the example of a very technical document about protein folding, which assumes that the reader would know the specification language and much else about the material being presented. An ordinary reader might sense the document "makes reference to physics in a non-trivial way," an expert would note even more implied facts ("the article may be out-dated by now," "the author is considered an authority in this domain," or "there's an expectation that diseases will be curable if these advances continue," etc.). Flake says: "In total, all of the implied data amounts to the stuff that all of us carry in our heads but no one bothers to write down; yet these factoids are essential to understanding and meaning. Some people in AI have been trying to codify these factoids for decades (and in many forms, from ontologies to databases of common sense). We are now starting to scrape the web for these subtle relationships. The key insight is that it is not enough to look at words, concepts, or documents; one must also look at how all of these things relate to one another. This article has been added to the articles section of Deep Web Research Subject Tracer™ Information Blog. http://searchenginewatch.com/searchday/article.php/3374821 XRay Web Scraping Tool 2.0XRay Web Scraping Tool 2.0 12/04/2003 08:26 PM A GUI-based HTTP monitoring and Web scraping tool. Scraping the Senate, turning US govt
|
Also check out: |