CTDATA: Amazon.com Says You Can't Spider Them, But Companies Do It Anyway

Dave Aiello wrote, "Over the past few days, I've been looking at Amazon.com Web Services, with the idea of using them to quickly answer some questions about books that come up at my office on a daily basis. The Amazon.com web service interface provides a lot of useful information about the books that Amazon sells, but not everything that I need to find out. So, I began to wonder if I could write a program to get that information from Amazon as well."

"This type of program is a special-purpose web client. It connects to a web site in much the same way that Microsoft Internet Explorer, Netscape, or Mozilla does, but it retrieves the information programmatically, rather than interactively. Search engines use web clients that digest entire web pages and follow HTML links-- they're called spiders."

"Amazon's Conditions of Use say that you are not supposed to run spiders against its website. But, I believe I've found a number of situations where spiders are being permitted, either because they help promote Amazon, or they are of great value to a company affiliated with Amazon. Read on for more details...."

Dave Aiello continued:

Amazon.com 's Conditions of Use says:

This license does not include any resale or commercial use of this site or its contents; any collection and use of any product listings, descriptions, or prices; any derivative use of this site or its contents; any downloading or copying of account information for the benefit of another merchant; or any use of data mining, robots, or similar data gathering and extraction tools.

But, an article written by Tim O'Reilly in April talked about the fact that his company developed and uses a data-mining tool on Amazon's website:

There are now dozens of Amazon rank spiders that will help authors keep track of their book's Amazon rank. We have a very powerful one at O'Reilly that provides many insights valuable to our business that are not available in the standard Amazon interface. It allows us to summarize and study things like pricing by publisher and topic, rank trends by publisher and topic over a two-year period, correlation between pricing and popularity, relative market share of publishers in each technology area, and so on. We combine this data with other data gleaned from Google link counts on technology sites, traffic trends on newsgroups, and other Internet data, to provide insights into tech trends that far outstrip what's available from traditional market research firms.

CTDATA.com pointed to O'Reilly's article when it was published. In our article, we said:

The web spidering trend describes the construction of customized web clients (i.e. robots) to traverse web sites and gather data which is assembled and displayed differently from the original presentation. Good examples are search engines like Google, software survey sites like Netcraft, and price comparison sites like ISBN.nu. O'Reilly suggests that many of these spiders could be eliminated if major database-driven web sites built SOAP or XML-RPC interfaces and published APIs to them. But, we would argue that this is unlikely because there is no revenue model for many such interfaces, and unless one emerges, it's hard to imagine large sites willing giving up the ability to display ads directly to the site visitor.

So you see why I make the conclusion that I do. It appears that Amazon permits spiders that are run by its friends. It may even permit spiders that are written by people it has no contractual relationship with, provided the traffic generated does not reach a level where its webmasters would start paying attention.

Amazon.com Says You Can't Spider Them, But Companies Do It Anyway

Post a comment

Search

About CTDATA

CTDATA Services

Categories

Archives