Big Brother

?Scrapers? Dig Deep for Data on Web

Julia Angwin and Steve Stecklow, WSJ
Some of the computer code behind screen-scraper.com's software

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its ?Mood? discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was ?scraping,? or copying, every single message off PatientsLikeMe?s private online forums.

PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online ?buzz? for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

?I felt totally violated,? says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. ?It was very disturbing to know that your information is being sold,? he says. Nielsen says it no longer scrapes sites requiring an individual account for access, unless it has permission.

The market for personal data about Internet users is booming, and in the vanguard is the practice of ?scraping.? Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

The Wall Street Journal?s examination of scraping?a trade that involves personal information as well as many other types of data?is part of the newspaper?s investigation into the business of tracking people?s activities online and selling details about their behavior and personal interests.

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell numbers, photographs and posts on social-network sites.

Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.

One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web?which may include personal information contained in news articles and blog postings?that help corporate clients monitor how they are portrayed. It says it doesn?t gather information from password-protected parts of sites.

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.

Nielsen spokesman Matt Anchin says the company?s reports to its clients include publicly available information gleaned from the Internet, ?so if someone decides to share personally identifiable information, it could be included.?

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.

Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people?s profiles.

Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.

?Social networks are becoming the new public records,? says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminal background checks and ?Date Check,? which promises details about a prospective date for $14.95.

?This data is out there,? Mr. Adler says. ?If we don?t bring it to the consumer?s attention, someone else will.?

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people?s real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou?s people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names or addresses.

Employers, too, are trying to figure out how to use such data to screen job candidates. One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offering limited social-networking data?some of it scraped?to employers about a year ago. ?It?s slowly starting to grow,? says Chris Dugger, national account manager. He says he?s particularly interested in things like whether people are ?talking about how they just ripped off their last employer.?

Scrapers and listening companies say what they?re doing is no different from what any person does when gathering information online?they just do it on a much larger scale.

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.

Some scrapers-for-hire don?t ask clients many questions. ?If we don?t think they?re going to use it for illegal purposes?they often don?t tell us what they?re going to use it for?generally, we?ll err on the side of doing it,? says Todd Wilson, owner of screen-scraper.com, a 10-person firm in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as ?Happy Valley? that specialize in scraping.

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it?s often hired to conduct ?business intelligence,? working for companies who want to scrape competitors? websites.

One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? ?We don?t know,? says Scott Wilson, the owner?s brother and vice president of sales. Another job: attempting to scrape Facebook for a multi-level marketing company that wanted email addresses of users who ?like? the firm?s page?as well as their friends?so they all could be pitched products.

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn?t difficult: Nearly any talented computer programmer can do it. But penetrating a site?s defenses can be tough.

Some professional scrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they?re targeting.

Raids like these are on the rise. ?Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping,? says Marino Zini, managing director of Sentor Anti Scraping System. The company?s Stockholm team blocks scrapers on behalf of website clients.

At Monster.com, the jobs website that stores résumés for tens of millions of individuals, fighting scrapers is a full-time job, ?every minute of every day of every week,? says Patrick Manzo, global chief privacy officer of Monster Worldwide Inc. Facebook, with its trove of personal data on some 500 million users, says it takes legal and technical steps to deter scraping.

Previous ArticleNext Article

Leave a Reply

Your email address will not be published. Required fields are marked *

Send this to a friend

By continuing to use this website I accept the use of cookies. More information

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we'll assume that you are happy to receive all cookies from this website. If you would like to change your preferences you may do so by following the instructions here

Close