xhs_video_crawler/docs/superpowers/specs/2026-05-27-xhs-long-queue-downloader-design.md
2026-05-27 16:30:06 +08:00

2.6 KiB

XHS Long Queue Downloader Design

Goal

Add a resumable long-task downloader for collecting large numbers of Xiaohongshu videos, such as 1000 videos, without relying on a single recommendation page pass.

Scope

The feature stays within the existing manually logged-in browser model. It does not automate login, bypass verification, spoof device fingerprints, or call private APIs directly outside what the loaded web pages expose. It improves task durability, source density, and progress tracking.

Architecture

The downloader becomes two-phase while preserving the current one-command UX:

  1. Queue discovery collects note detail URLs from configured sources and writes them to a JSONL queue.
  2. Queue processing opens pending note URLs, extracts video URLs from page state or feed responses, downloads valid videos, and updates each queue item status.

The queue file stores one JSON object per note:

{"note_id":"...","url":"...","source":"video-channel","status":"pending","attempts":0,"downloaded_path":"","last_error":"","updated_at":"..."}

Statuses are pending, downloaded, skipped_image, and failed.

Sources

The first implementation supports:

  • explore: current recommendation page.
  • video-channel: https://www.xiaohongshu.com/explore?channel_id=video as a best-effort source. If Xiaohongshu redirects or changes channel routing, the collector still reads visible /explore/ cards.
  • current-page: process the current browser page.

Future search keyword sources can be added after the queue engine is stable.

Runtime Behavior

A command such as:

./.venv/bin/python XHS.py --source video-channel --target-videos 1000 --queue-file data/xhs_queue.jsonl --max-runtime 7200

will:

  1. Load existing queue records.
  2. Count already downloaded items.
  3. Open the selected source page and collect visible note URLs.
  4. Append new pending records, preserving existing statuses.
  5. Process pending records until target_videos, max_runtime, or queue exhaustion.
  6. If queue is exhausted before target, return to source, scroll, collect more URLs, and continue.

Error Handling

  • Non-video notes become skipped_image.
  • Download failures increment attempts and become failed after retry limit.
  • The queue is rewritten atomically after status changes.
  • Progress logs include downloaded count, skipped count, failed count, and pending count.

Testing

Unit tests cover JSONL queue load/save, deduplication, status updates, source URL selection, target counting, and CLI argument plumbing. Existing download and parsing tests remain in place.