xhs_video_crawler/docs/superpowers/specs/2026-05-27-xhs-search-source-design.md
2026-05-27 16:49:36 +08:00

28 lines
1.4 KiB
Markdown

# XHS Search Source Design
## Goal
Allow the resumable queue downloader to use Xiaohongshu search results as a source, so queries such as `猫咪` or `猫咪 搞笑` can collect and download related video notes.
## Scope
This feature reuses the existing manually logged-in Chrome, queue persistence, page card collection, detail-page video extraction, validation, and human browsing cadence. It does not automate login, bypass verification, or call hidden APIs directly.
## CLI
```bash
./.venv/bin/python XHS.py --source search --keyword 猫咪 --target-videos 100 --queue-file data/search_cat_queue.jsonl
```
## Behavior
- `--source search` requires `--keyword`.
- The source URL is `https://www.xiaohongshu.com/search_result?keyword=<encoded keyword>&source=web_search_result_notes&type=51`, which opens the video-filtered search results page.
- Search result cards are collected from both `/explore/<note_id>` and tokenized `/search_result/<note_id>` links.
- Detail links are polled briefly after navigation because Xiaohongshu search result cards are rendered asynchronously.
- Queue mode handles videos, images, failures, retries, and resume semantics exactly like other sources.
## Testing
Unit tests cover search URL encoding, parser defaults, queue-mode CLI plumbing for keyword, `/search_result/` note ID extraction, tokenized search link normalization, and async result-link polling.