Sitemap Crawler

A background-powered sitemap crawler that extracts meta titles, descriptions, and JSON-LD schema from every URL in a sitemap — built for long-running crawls without blocking the UI.

NextjsNextjsRedisRedis
Sitemap Crawler

What is Sitemap Crawler?

Sitemap Crawler is a web tool that takes a sitemap URL, crawls every page listed in it, and extracts three things: the meta title, the meta description, and any JSON-LD structured data. It runs the crawl in the background using a job queue, streams progress back to the UI in real time, and lets you export the results as a CSV when it's done.

You paste a sitemap URL, hit start, and watch a progress bar fill as the worker chews through hundreds of pages. The results populate a table as the job completes — URL, title, description, and whether schema markup is present. One click exports it all.

The architecture is a Next.js frontend with two API routes for job management, a BullMQ queue backed by Redis for job state, and a separate Node.js worker process that does the actual crawling. This separation means the UI never freezes, the server never times out, and crawls of 300+ pages run reliably.

The Problem

If you manage SEO for a site with hundreds of pages, you need to audit meta tags and structured data regularly. The manual approach — opening each page, inspecting the source, copying the values into a spreadsheet — doesn't scale past a dozen pages.

There are desktop tools and paid SaaS platforms that do this, but they're either heavy (Screaming Frog requires a local install and a license for large crawls), expensive (Ahrefs, SEMrush charge monthly), or limited in what they extract. If you just want meta titles, descriptions, and schema from your own sitemap — nothing more — you're paying for a lot of features you don't need.

The deeper technical problem is that sitemap crawling is inherently long-running. A sitemap with 300 URLs, crawled at a polite 1-second interval, takes 5 minutes. You can't do that in a single API request — serverless functions time out, browsers disconnect, and users lose patience staring at a spinner with no feedback.

Sitemap Crawler solves this with a queue-based architecture. The API route accepts the sitemap URL, queues a job, and returns immediately with a job ID. The worker picks up the job in a separate process, crawls pages one by one, and updates progress after each page. The frontend polls the job status every 2 seconds, updating the progress bar and rendering results when the job completes. The user sees exactly how far along the crawl is at every moment.

How It Works

Queueing the Crawl

When you submit a sitemap URL, the frontend POSTs to /api/crawl. This route adds a job to the BullMQ queue with the sitemap URL as payload and returns the job ID immediately. No crawling happens in this request — it just schedules the work.

const job = await crawlQueue.add("crawl", { sitemapUrl });
return NextResponse.json({ status: "queued", jobId: job.id });

The queue is backed by Redis (Upstash or local), which stores the job state, progress, and results. Jobs are configured to auto-clean — completed jobs are removed after an hour, failed jobs after 24 hours.

The Worker Process

A separate Node.js process runs the worker using tsx. It listens on the crawl-queue and processes jobs as they arrive. For each job, it first parses the sitemap XML using xml2js to extract all URLs. Then it crawls each URL sequentially with a 1-second delay between requests to avoid hammering the target server.

For each page, the worker fetches the HTML with Axios and parses it with Cheerio. It extracts the <title> tag content, the <meta name="description"> content attribute, and any <script type="application/ld+json"> blocks. If a page has multiple JSON-LD blocks, they're concatenated with a pipe separator.

After each page, the worker calls job.updateProgress() with the percentage complete. This writes the progress value to Redis, where the frontend can read it on its next poll.

Live Progress Tracking

The frontend polls /api/status?jobId=... every 2 seconds using a setInterval inside a useEffect. Each poll reads the job state and progress from Redis. The progress value drives a Radix UI progress bar, and the state determines when to stop polling — either on completed (results are rendered) or failed (an error is logged).

This polling approach is intentionally simple. WebSockets would give instant updates, but polling every 2 seconds is good enough for a crawl that takes minutes, and it avoids the complexity of managing socket connections, reconnections, and server-sent events.

CSV Export

Once results are in, the export button generates a CSV client-side using PapaParse. It maps each result to a row with URL, meta title, meta description, and schema columns, creates a Blob, generates an object URL, and triggers a download via a programmatic link click. No server round-trip needed.

Tech Stack

The frontend is Next.js 16 with the app router, using shadcn/ui components for the input, button, table, and progress bar. The job queue uses BullMQ, a Redis-backed queue library designed for exactly this kind of background processing. Redis runs on Upstash for production or locally for development. The worker runs as a standalone Node.js process using tsx for TypeScript execution. HTML parsing uses Cheerio for DOM traversal and xml2js for sitemap XML parsing. HTTP requests go through Axios with a 15-second timeout and a custom user agent. CSV generation uses PapaParse on the client side.

Architecture

The system has three distinct processes that communicate through Redis. The Next.js server handles the frontend and two API routes — one to queue jobs, one to check status. The BullMQ worker runs separately and does all the crawling. Redis sits in the middle, storing job state, progress percentages, and final results.

This separation is the key design decision. The Next.js server never crawls anything — it just reads and writes job metadata. The worker never serves HTTP — it just processes the queue. Redis is the shared state layer that connects them. This means you can deploy the frontend to Vercel and the worker to Railway (or any long-running process host) without either one blocking the other.

Why I Built This

I needed a quick way to audit meta tags across an entire site without installing desktop software or signing up for a paid tool. The data I wanted was simple — title, description, schema — but getting it at scale meant solving the background processing problem properly.

Most crawling scripts I've seen are either synchronous (they block until done) or fire-and-forget (they run but you have no idea where they are). I wanted something in between — a crawl that runs in the background but reports progress in real time, with a clean UI for reviewing and exporting results. The queue + worker pattern handles this well, and it's a pattern that scales to much larger crawls if needed.

GitHubXLinkedInInstagram

/RTSTIC