# XHS Long Queue Downloader Design ## Goal Add a resumable long-task downloader for collecting large numbers of Xiaohongshu videos, such as 1000 videos, without relying on a single recommendation page pass. ## Scope The feature stays within the existing manually logged-in browser model. It does not automate login, bypass verification, spoof device fingerprints, or call private APIs directly outside what the loaded web pages expose. It improves task durability, source density, and progress tracking. ## Architecture The downloader becomes two-phase while preserving the current one-command UX: 1. Queue discovery collects note detail URLs from configured sources and writes them to a JSONL queue. 2. Queue processing opens pending note URLs, extracts video URLs from page state or feed responses, downloads valid videos, and updates each queue item status. The queue file stores one JSON object per note: ```json {"note_id":"...","url":"...","source":"video-channel","status":"pending","attempts":0,"downloaded_path":"","last_error":"","updated_at":"..."} ``` Statuses are `pending`, `downloaded`, `skipped_image`, and `failed`. ## Sources The first implementation supports: - `explore`: current recommendation page. - `video-channel`: `https://www.xiaohongshu.com/explore?channel_id=video` as a best-effort source. If Xiaohongshu redirects or changes channel routing, the collector still reads visible `/explore/` cards. - `current-page`: process the current browser page. Future search keyword sources can be added after the queue engine is stable. ## Runtime Behavior A command such as: ```bash ./.venv/bin/python XHS.py --source video-channel --target-videos 1000 --queue-file data/xhs_queue.jsonl --max-runtime 7200 ``` will: 1. Load existing queue records. 2. Count already downloaded items. 3. Open the selected source page and collect visible note URLs. 4. Append new pending records, preserving existing statuses. 5. Process pending records until `target_videos`, `max_runtime`, or queue exhaustion. 6. If queue is exhausted before target, return to source, scroll, collect more URLs, and continue. ## Error Handling - Non-video notes become `skipped_image`. - Download failures increment attempts and become `failed` after retry limit. - The queue is rewritten atomically after status changes. - Progress logs include downloaded count, skipped count, failed count, and pending count. ## Testing Unit tests cover JSONL queue load/save, deduplication, status updates, source URL selection, target counting, and CLI argument plumbing. Existing download and parsing tests remain in place.