Add XHS downloader design and plan
This commit is contained in:
parent
5ca45ecc8c
commit
ec5d174bdc
@ -0,0 +1,63 @@
|
||||
# Xiaohongshu Browser Feed Download Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Build a first usable Xiaohongshu downloader that attaches to a manually logged-in Chrome session, listens for feed responses, extracts mp4 URLs, and downloads a limited number of videos.
|
||||
|
||||
**Architecture:** Add `login_xhs.py` for visible Chrome startup and `XHS.py` for browser attachment, feed listening, parsing, scrolling, and downloading. Keep browser/runtime imports lazy so unit tests run without Chrome automation dependencies.
|
||||
|
||||
**Tech Stack:** Python 3, unittest, requests, DrissionPage, macOS Chrome remote debugging.
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
- Create `login_xhs.py`: Chrome launch CLI, debug-port readiness wait, user-facing next command.
|
||||
- Create `XHS.py`: parsing helpers, browser attachment helpers, downloader CLI, feed collection loop.
|
||||
- Create `test_login_xhs.py`: command construction and CLI behavior tests.
|
||||
- Create `test_xhs.py`: parser, filename, output path, browser address, and port-check tests.
|
||||
- Modify `README.md`: install and usage instructions.
|
||||
|
||||
## Task 1: Login Browser Entrypoint
|
||||
|
||||
**Files:**
|
||||
- Create: `test_login_xhs.py`
|
||||
- Create: `login_xhs.py`
|
||||
|
||||
- [ ] Write failing tests for Chrome command construction, parser defaults, profile creation, and missing Chrome path handling.
|
||||
- [ ] Run `python3 -m unittest test_login_xhs.py -v` and verify the tests fail because `login_xhs` does not exist.
|
||||
- [ ] Implement `login_xhs.py` with lazy, testable functions matching the Douyin project pattern.
|
||||
- [ ] Run `python3 -m unittest test_login_xhs.py -v` and verify it passes.
|
||||
|
||||
## Task 2: Pure XHS Parsing Helpers
|
||||
|
||||
**Files:**
|
||||
- Create: `test_xhs.py`
|
||||
- Create: `XHS.py`
|
||||
|
||||
- [ ] Write failing tests for filename sanitization, UTF-8 truncation, choosing video URLs, recursive extraction from a nested feed payload, output path generation, browser address construction, and debug-port validation.
|
||||
- [ ] Run `python3 -m unittest test_xhs.py -v` and verify the tests fail because `XHS` does not exist.
|
||||
- [ ] Implement pure helpers and lazy dependency imports in `XHS.py`.
|
||||
- [ ] Run `python3 -m unittest test_xhs.py -v` and verify it passes.
|
||||
|
||||
## Task 3: Browser Feed Collection CLI
|
||||
|
||||
**Files:**
|
||||
- Modify: `XHS.py`
|
||||
- Modify: `test_xhs.py`
|
||||
|
||||
- [ ] Write failing tests for feed payload extraction from packet response objects and the argument parser defaults.
|
||||
- [ ] Run `python3 -m unittest test_xhs.py -v` and verify the new tests fail.
|
||||
- [ ] Implement DrissionPage attachment, feed listening, gentle scrolling, download loop, and CLI `main`.
|
||||
- [ ] Run `python3 -m unittest test_xhs.py test_login_xhs.py -v` and verify it passes.
|
||||
|
||||
## Task 4: README and Final Verification
|
||||
|
||||
**Files:**
|
||||
- Modify: `README.md`
|
||||
|
||||
- [ ] Update README with setup, login, download, output, and compliance notes.
|
||||
- [ ] Run `python3 -m unittest test_xhs.py test_login_xhs.py -v`.
|
||||
- [ ] Run `git status --short --branch`.
|
||||
- [ ] Commit all implementation changes.
|
||||
- [ ] Push to `origin/main`.
|
||||
@ -0,0 +1,58 @@
|
||||
# Xiaohongshu Browser Feed Download Design
|
||||
|
||||
## Goal
|
||||
|
||||
Build a first usable Xiaohongshu video downloader that attaches to a manually logged-in Chrome session, listens for official site feed responses, extracts video URLs that the page already received, and downloads a limited number of videos.
|
||||
|
||||
## Scope
|
||||
|
||||
The first version supports `https://www.xiaohongshu.com/explore` and videos surfaced through feed responses while the visible browser is open. It does not automate login, bypass captcha, generate signatures, replay private APIs directly, or attempt to defeat platform protections.
|
||||
|
||||
## Architecture
|
||||
|
||||
The tool mirrors the existing Douyin project pattern:
|
||||
|
||||
- `login_xhs.py` starts a visible Chrome instance with a fixed profile directory and a remote debugging port.
|
||||
- `XHS.py` connects to that existing Chrome through DrissionPage, listens for responses whose URL contains `feed`, recursively extracts mp4 URLs such as `master_url` and `backup_urls`, deduplicates them, and downloads videos through `requests`.
|
||||
- Unit tests cover pure parsing, filename, URL choice, and login command construction.
|
||||
|
||||
## Data Flow
|
||||
|
||||
1. The user runs `python3 login_xhs.py`.
|
||||
2. Chrome opens Xiaohongshu Explore with a persistent local profile.
|
||||
3. The user logs in manually and handles any verification.
|
||||
4. The user runs `python3 XHS.py --max-videos 10`.
|
||||
5. `XHS.py` attaches to the Chrome debugging port and starts network listening.
|
||||
6. The script opens or refreshes Explore, waits for feed packets, extracts video metadata and downloadable mp4 URLs, and writes files to `video/`.
|
||||
7. The script scrolls gently between waits to trigger more page-loaded feed responses until it downloads the requested limit or reaches empty-response limits.
|
||||
|
||||
## CLI
|
||||
|
||||
```bash
|
||||
python3 login_xhs.py
|
||||
python3 XHS.py --max-videos 10
|
||||
python3 XHS.py --browser-port 9224 --max-videos 20 --output-dir video
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
- If the browser debugging port is closed, print an actionable message pointing to `login_xhs.py`.
|
||||
- If optional dependencies are missing, print install commands.
|
||||
- If no feed data is observed, explain that the user should confirm login, page loading, and scrolling.
|
||||
- If one video download fails, continue with later videos.
|
||||
|
||||
## Testing
|
||||
|
||||
Use Python `unittest` without requiring browser dependencies at import time. Tests should not launch Chrome or make network requests.
|
||||
|
||||
Coverage targets:
|
||||
|
||||
- Safe filename generation and byte truncation.
|
||||
- Recursive extraction of video candidates from nested Xiaohongshu-like JSON.
|
||||
- URL selection preference for `master_url` and fallback URLs.
|
||||
- Output path generation.
|
||||
- Browser launch command construction and default CLI values.
|
||||
|
||||
## Open Constraints
|
||||
|
||||
The exact Xiaohongshu response shape may vary. The parser should be tolerant and recursive instead of hard-coding one complete schema from a screenshot.
|
||||
Loading…
x
Reference in New Issue
Block a user