CAPTCHA these days is often deployed as a way to stop web scraping, but does it really work and is it worth the cost to website creators?
A Brief History of CAPTCHA
CAPTCHA is an acronym for Completely Automated Public Turing test to tell Computers and Humans Apart. They started off in the early 2000s as puzzles to stop bots from interacting with web pages. Back then, the big problem was spammers sending emails through online forms. By 2014, spam was estimated to make up 90% of all email traffic.
CAPTCHAs were a great solution because humans can do things that bots can't. At the beginning, bots weren't able to recognize text that had been slightly distorted, but over time, and as bots became smarter, CAPTCHAs evolved to use distorted images, image recognition, and other techniques. They even started to be used to solve real-world challenges.
Originally developed at Carnegie Mellon University, reCAPTCHA was designed as a way to use CAPTCHAs to digitize physical books. Two words would be presented to the user, with one using crowdsourcing to work out an unclear word and the other as a control.
Google bought reCAPTCHA in 2009 and since then it has been used to help digitize Google Books. Incorporating its developing mapping database, Street View images then began to be introduced, and these are the image CAPTCHAs most of us encounter these days.
Google now calls reCAPTCHA a “frictionless fraud detection service”, but you'll probably disagree if you've ever struggled to work out what that system considers a bike or a crosswalk.
To be fair to Google, these days “No Captcha reCaptcha” means that a lot of the time you just need to check a box to confirm “I am not a robot” and Google's previous tracking of your internet usage will tell it whether this is the case.
But for most of us we still come across a lot of CAPTCHAs to solve as we use the web. So are CAPTCHAs here to stay and do they work against modern bots?
Why CAPTCHAs Are Bad
From a user's point of view, a CAPTCHA is the exact opposite of good user experience (UX) design. Good UX is supposed to make things easier for the user and make sure that they don't get blocked from interacting with a website. When that website decides to check whether you're a bot, you get stopped dead in your tracks. And sometimes you'll need to make several attempts to solve more than one CAPTCHA. Unless that website is one you really want to interact with, there's a chance that you'll just rage quit and leave.
If you have disabilities, CAPTCHAs are an even more serious problem. W3C argues that CAPTCHAs are effectively a “denial of service to these users”. Audio solutions can help, but they are often very hard to make out, and of course they are also an additional block to people who already have to struggle with enough challenges in their day-to-day lives.
For the majority of people on Earth who don't speak English or use the Western Latin alphabet, ending up on a website where they have to complete a text-based CAPTCHA means a dead end and they just can't get the information they need.
CAPTCHAs also slow down how fast web pages are loaded. Combine that with the UX frustrations and you can be certain that CAPTCHAs are affecting your conversion rate. Getting visitors isn't easy, or cheap, these days, so making it slower and harder for them to use your site is just bad business. Why pay for ads or work hard to get links when the CAPTCHA is going to make a percentage of your visitors just give up?
Why CAPTCHAs Don't Work Anyway
Since the beginning of the CAPTCHA era, spammers, scammers, hackers, and even just programmers who love a challenge have been finding ways to bypass them such as using a captcha-solving service.
At first, this was often just simple ways to exploit the fact that the earliest CAPTCHAs used a limited set of words. Later on, machine learning began to be used to defeat more complex examples. Deep learning is now extremely advanced and can easily deal with the old-school CAPTCHAs.
Headless browsers and methods to control them, such as Puppeteer, Playwright, and Selenium (learn differently here) can even make it possible for coders to program their automated bots to behave like human users, by pausing at suitable times, not visiting too many pages too fast, and so on. Google is smart, but it can be tricked by experts.
But the easiest method is good old human labor. There are plenty of countries in the world where people live on so little money per day that sitting in front of a computer manually solving CAPTCHAs for a few cents each can seem like a dream job. This has been a problem, or opportunity, depending on your point of view, since the early days of CAPTCHA and it still exists as a way to bypass them.
And that disparity between pay across the globe means that companies like Anti Captcha can make it easy to get around even image-based CAPTCHAs, with a vast pool of workers making it fast and cheap to use. This solution is easy to integrate, with other companies using it in ready-made tools such as Apify's Anti Captcha Recaptcha. Combining a tool like that with headless browsers and modern web scraping techniques makes it possible to extract vast amounts of information from sites that consider themselves protected.
While a lot of websites use CAPTCHAs these days to prevent web scraping, this is different from the old days of preventing spam. Web scraping is just automatically reading content from websites with bots and extracting data. It might sound like some kind of hacking, but it is used by some of the biggest companies in the world to get data fast and effectively.
In fact, when Google bots travel around the internet indexing web pages, they're doing much the same thing. Web scraping and automation platforms like Apify are designed to make it possible to automate anything that can be done manually in a web browser and extract data at any scale. That data can then be used in cool new software, applications, or repackaged into tools that add value to internet users everywhere.
Web scraping is the inevitable result of increased automation. While the web was designed for humans, the amount of valuable public data out there can only be efficiently processed by machines. CAPTCHAs are no longer doing what they were designed to do and in fact they have been roped into defending against an imaginary new enemy that is the natural evolution of the drive to automate and optimize.
Ultimately, they don't work against web scraping, and maybe that's actually okay. But until websites accept this, it is clear that there are effective ways to bypass and defeat CAPTCHAs.