US appeals court decides scraping public web data is fine
In 2017, employment analytics firm HiQ filed a lawsuit against LinkedIn’s efforts to block it from scraping data from users’ profiles.
The court barred Linkedin from stopping HiQ scraping data after deciding the CFAA – which criminalises accessing a protected computer – doesn’t apply due to the information being public.
LinkedIn appealed the case and in 2019 the Ninth Circuit Court sided with HiQ and upheld the original decision.
In March 2020, LinkedIn once again appealed the decision on the basis that implementing technical barriers and sending a cease-and-desist letter is revoking authorisation. Therefore, any subsequent attempts to scrape data are unauthorised and therefore break the CFAA.
“At issue was whether, once hiQ received LinkedIn’s cease-and-desist letter, any further scraping and use of LinkedIn’s data was ‘without authorization’ within the meaning of the CFAA,” reads the filing (PDF).
“The panel concluded that hiQ raised a serious question as to whether the CFAA ‘without authorization’ concept is inapplicable where, as here, prior authorization is not generally required but a particular person—or bot—is refused access.”
The filing highlights several of LinkedIn’s technical measures to protect against data-scraping:
- Prohibiting search engine crawlers and bots – aside from certain allowed entities, like Google – from accessing LinkedIn’s servers via the website’s standard ‘robots.txt’ file.
- ‘Quicksand’ system that detects non-human activity indicative of scraping.
- ‘Sentinel’ system that slows (or blocks) activity from suspicious IP addresses.
- ‘Org Block’ system that generates a list of known malicious IP addresses linked to large-scale scraping.
Overall, LinkedIn claims to block approximately 95 million automated attempts to scrape data every day.
The appeals court once again ruled in favour of HiQ, upholding the conclusion that “the balance of hardships tips sharply in hiQ’s favor” and the company’s existence would be threatened without having access to LinkedIn’s public data.
“hiQ’s entire business depends on being able to access public LinkedIn member profiles,” hiQ’s CEO argued. “There is no current viable alternative to LinkedIn’s member database to obtain data for hiQ’s Keeper and Skill Mapper services.”
However, LinkedIn’s petition (PDF) counters that the ruling has wider implications.
“Under the Ninth Circuit’s rule, every company with a public portion of its website that is integral to the operation of its business – from online retailers like Ticketmaster and Amazon to social networking platforms like Twitter – will be exposed to invasive bots deployed by free-riders unless they place those websites entirely behind password barricades,” wrote the company’s attorneys.
“But if that happens, those websites will no longer be indexable by search engines, which will make information less available to discovery by the primary means by which people obtain information on the Internet.”
AI companies that often rely on mass data-scraping will undoubtedly be pleased with the court’s decision.
Clearview AI, for example, has regularly been targeted by authorities and privacy campaigners for scraping billions of images from public websites to power its facial recognition system.
“Common law has never recognised a right to privacy for your face,” Clearview AI lawyer Tor Ekeland once argued.
Clearview AI recently made headlines for offering its services to Ukraine to help the country identify both Ukrainian defenders and Russian assailants who’ve lost their lives in the brutal conflict.
Mass data scraping will remain a controversial subject. Supporters will back the appeal court’s ruling while opponents will join LinkedIn’s attorneys in their concerns about normalising the practice.
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.