The state of web scraping in the AI era: still relevant or a relic?

Table of Contents

Every few months someone declares web scraping dead. APIs cover more ground than they did, and the models can read a page and hand you structured data, so why keep writing parsers? I had the same thought right up until I needed a few hundred pages of reference data for a side project, found no API that exposed it, and went back to the same two Python libraries I’ve used for a decade. Scraping didn’t die. The gap between “the data is on a page” and “the data is in my pipeline” never closed, and that gap is the whole job.

Where it still earns its keep
#

An API is the first thing I look for, and most of the time there isn’t one — or there is, and it’s rate-limited, paywalled, or missing the exact fields I want. The jobs that still land on scraping:

A model needs training or evaluation data from a domain nobody ships an export for. You’re tracking prices or competitors and the numbers live on the page but not in any feed. You want to archive something before a redesign or a shutdown takes it offline. Or it’s a personal automation — a dashboard, a scratch experiment — where standing up a “proper” integration is more work than the thing is worth.

None of that is glamorous. All of it is the kind of work that quietly keeps showing up.

How I pick: Beautiful Soup or Scrapy
#

These get framed as rivals. They’re not — they’re sized for different jobs, and I reach for them at different moments.

Beautiful Soup is what I open when the job is one site and a handful of pages. I pair it with requests, hand it the HTML, and walk the tree. That scrape was forty lines, done in an evening. It’s forgiving with the malformed markup real sites serve, and there’s no ceremony — which is also its limit. The moment I need concurrency, retries, or polite rate-limiting, I’m writing that scaffolding by hand, and every new script starts from a blank file.

That hand-rolled scaffolding is the tell. The first time I catch myself rebuilding a retry loop and a request queue around Beautiful Soup, I’ve outgrown it.

Scrapy is the framework for when the job has scale or a schedule. Request scheduling, concurrency, retries, and a pipeline for cleaning and persisting data as you go all come in the box. The cost is that it has opinions — a project layout, a way it wants spiders and pipelines arranged — and you pay the learning curve up front. For a one-off that tax isn’t worth it. For something that crawls hundreds of pages nightly and feeds a downstream table, it pays for itself the first time a run dies halfway and Scrapy just retries the failures instead of me babysitting it.

So the rule I actually use: prototype and one-offs in Beautiful Soup, anything recurring or large in Scrapy, and treat “I’m reimplementing Scrapy’s features in a script” as the signal to migrate.

What the LLMs actually changed
#

The models are good at the half of the problem that was never the hard part. Hand a model clean scraped text and it’ll summarize, classify, and structure it beautifully. But it still needs the page, and the page still has to be fetched, rendered if it’s JavaScript-heavy, parsed, and cleaned before any of that synthesis happens. The model sits at the output end. Scraping is the input side, and AI made the output side cheaper without touching the input side at all.

If anything the demand went up. More data-hungry systems means more pipelines that start with “get the page.”

The two shifts worth tracking are on the fetch side, not the parse side: more sites render content with JavaScript, which pushes you toward browser-automation tools like Playwright instead of a plain requests call, and the etiquette matters more than ever — read robots.txt, rate-limit yourself, identify your bot, and don’t hammer someone’s server because a model made the downstream work easy. The skill is the same as it was. The pages just got a little harder to reach, and there are more reasons to reach them.

Author

Chandler Thompson

I lead engineering teams and coach the people who run them. This is where I write down what actually worked.

Where it still earns its keep#

How I pick: Beautiful Soup or Scrapy#

What the LLMs actually changed#

Related

Where it still earns its keep
#

How I pick: Beautiful Soup or Scrapy
#

What the LLMs actually changed
#