xhs_video_crawler/docs/superpowers/specs/2026-05-27-xhs-browser-feed-download-design.md
2026-05-27 13:59:00 +08:00

2.8 KiB

Xiaohongshu Browser Feed Download Design

Goal

Build a first usable Xiaohongshu video downloader that attaches to a manually logged-in Chrome session, listens for official site feed responses, extracts video URLs that the page already received, and downloads a limited number of videos.

Scope

The first version supports https://www.xiaohongshu.com/explore and videos surfaced through feed responses while the visible browser is open. It does not automate login, bypass captcha, generate signatures, replay private APIs directly, or attempt to defeat platform protections.

Architecture

The tool mirrors the existing Douyin project pattern:

  • login_xhs.py starts a visible Chrome instance with a fixed profile directory and a remote debugging port.
  • XHS.py connects to that existing Chrome through DrissionPage, listens for responses whose URL contains feed, recursively extracts mp4 URLs such as master_url and backup_urls, deduplicates them, and downloads videos through requests.
  • Unit tests cover pure parsing, filename, URL choice, and login command construction.

Data Flow

  1. The user runs python3 login_xhs.py.
  2. Chrome opens Xiaohongshu Explore with a persistent local profile.
  3. The user logs in manually and handles any verification.
  4. The user runs python3 XHS.py --max-videos 10.
  5. XHS.py attaches to the Chrome debugging port and starts network listening.
  6. The script opens or refreshes Explore, waits for feed packets, extracts video metadata and downloadable mp4 URLs, and writes files to video/.
  7. The script scrolls gently between waits to trigger more page-loaded feed responses until it downloads the requested limit or reaches empty-response limits.

CLI

python3 login_xhs.py
python3 XHS.py --max-videos 10
python3 XHS.py --browser-port 9224 --max-videos 20 --output-dir video

Error Handling

  • If the browser debugging port is closed, print an actionable message pointing to login_xhs.py.
  • If optional dependencies are missing, print install commands.
  • If no feed data is observed, explain that the user should confirm login, page loading, and scrolling.
  • If one video download fails, continue with later videos.

Testing

Use Python unittest without requiring browser dependencies at import time. Tests should not launch Chrome or make network requests.

Coverage targets:

  • Safe filename generation and byte truncation.
  • Recursive extraction of video candidates from nested Xiaohongshu-like JSON.
  • URL selection preference for master_url and fallback URLs.
  • Output path generation.
  • Browser launch command construction and default CLI values.

Open Constraints

The exact Xiaohongshu response shape may vary. The parser should be tolerant and recursive instead of hard-coding one complete schema from a screenshot.