Skip to main content
  1. Posts/

The state of web scraping in the AI era: still relevant or a relic?

Table of Contents

Web scraping isn’t going away. APIs cover more ground than they used to, and LLMs can do a lot with a paragraph of prose, but neither closes the gap between “data exists on a page” and “data is in your pipeline.” That gap is where scraping lives.

This post covers when scraping still earns its keep, and how to choose between the two Python libraries you’ll keep reaching for: Beautiful Soup and Scrapy.

Why scraping still matters
#

APIs are the first option when they exist. But not every site has one, and the ones that do are often rate-limited, paywalled, or missing the fields you actually want. Scraping fills the gap for:

  1. Data collection for model fine-tuning. Domain-specific datasets that no API ships.
  2. Market research. Pricing, competitor watching, trend tracking — all jobs where the data is on the page but not in an export.
  3. Archival. Capturing content before it disappears behind a redesign, paywall, or shutdown.
  4. Personal projects. Automation, dashboards, side experiments.

The skill is unglamorous and consistently useful.

Beautiful Soup vs. Scrapy
#

Both libraries are good. They’re tools for different jobs, not competitors.

Beautiful Soup
#

A library for parsing HTML and XML. You hand it a document, it gives you a tree you can navigate, search, and modify.

Pros:

  • Easy to set up and read.
  • Lightweight: pairs with requests when you need it and stays out of the way otherwise.
  • Forgiving with messy markup.

Cons:

  • Not built for scale. Concurrency, retries, and rate limiting are on you.
  • No project structure. Every job starts from a blank script.

Use it when: the job is one site, a handful of pages, or you’re prototyping.

Scrapy
#

A full framework. Comes with request scheduling, concurrency, middleware, pipelines, and a project layout.

Pros:

  • Built for scale. Concurrency and retries out of the box.
  • Pipelines for cleaning and persisting data as you scrape.
  • Extensible through middleware and signals.

Cons:

  • Steeper learning curve. The framework has opinions and you’ll spend time learning them.
  • Overkill for a quick one-off.

Use it when: the job spans hundreds of pages, runs on a schedule, or feeds a downstream pipeline.

Picking between them
#

The short version:

  • Beautiful Soup for small, one-off, or exploratory work.
  • Scrapy for scale, schedule, and pipeline integration.

If you find yourself rebuilding concurrency and retry logic around Beautiful Soup, that’s the signal to switch.

The future
#

LLMs are good at synthesizing scraped data, not replacing the scrape. The model still needs the page, and the page still needs to be fetched, parsed, and cleaned. Scraping is the input side of any data-hungry system, AI included.

Ethical scraping (respect robots.txt, rate-limit yourself, identify your bot) and browser-automation tools like Playwright are extending what the skill covers, not retiring it.

What changed in your scraping workflow when LLMs landed? Drop a note in the comments.

Chandler Thompson
Author
Chandler Thompson
Perpetual Hobbyist.

Related