douyin-crawler-poc/docs/superpowers/plans/2026-05-06-douyin-recommendation-crawling.md

585 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 抖音推荐流视频抓取实现计划
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** 扩展 Douyin.py 支持抓取抖音推荐流For You页面视频记录博主信息支持滚动加载最多50条
**Architecture:** 新建 `collect_recommendations()` 函数处理推荐流,复用现有的下载和工具函数。通过 `parse_target_input()` 扩展识别推荐流URL。
**Tech Stack:** Python 3, DrissionPage, requests, unittest
---
## 文件结构
| 文件 | 操作 | 说明 |
|------|------|------|
| `Douyin.py` | 修改 | 添加推荐流识别、解析、抓取逻辑 |
| `test_douyin.py` | 修改 | 添加推荐流相关测试 |
---
## Task 1: 推荐流URL识别
**Files:**
- Modify: `Douyin.py:17-19`(添加正则表达式)
- Test: `test_douyin.py`(添加测试)
- [ ] **Step 1: 编写失败测试**
```python
def test_is_recommendation_url_accepts_douyin_homepage(self) -> None:
module = importlib.import_module("Douyin")
self.assertTrue(module.is_recommendation_url("https://www.douyin.com/"))
self.assertTrue(module.is_recommendation_url("https://www.douyin.com"))
self.assertTrue(module.is_recommendation_url("https://www.douyin.com/?from=web"))
self.assertFalse(module.is_recommendation_url("https://www.douyin.com/user/xxx"))
self.assertFalse(module.is_recommendation_url("https://www.douyin.com/video/123"))
```
- [ ] **Step 2: 运行测试确认失败**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_is_recommendation_url_accepts_douyin_homepage -v`
Expected: FAIL with "module has no attribute 'is_recommendation_url'"
- [ ] **Step 3: 实现最小代码**
`Douyin.py` 中添加:
```python
RECOMMENDATION_URL_PATTERN = re.compile(r"^https?://www\.douyin\.com/?(?:\?.*)?$")
def is_recommendation_url(value: str) -> bool:
return bool(RECOMMENDATION_URL_PATTERN.match(value.strip()))
```
- [ ] **Step 4: 运行测试确认通过**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_is_recommendation_url_accepts_douyin_homepage -v`
Expected: PASS
- [ ] **Step 5: 提交**
```bash
git add Douyin.py test_douyin.py
git commit -m "feat: add recommendation URL recognition"
```
---
## Task 2: 扩展目标解析支持推荐流
**Files:**
- Modify: `Douyin.py:52-68`(修改 `parse_target_input`
- Test: `test_douyin.py`(添加测试)
- [ ] **Step 1: 编写失败测试**
```python
def test_parse_target_input_classifies_recommendation_url(self) -> None:
module = importlib.import_module("Douyin")
target = module.parse_target_input("https://www.douyin.com/", source="manual")
self.assertEqual(target.kind, "recommendation")
self.assertEqual(target.value, "https://www.douyin.com/")
self.assertEqual(target.source, "manual")
```
- [ ] **Step 2: 运行测试确认失败**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_target_input_classifies_recommendation_url -v`
Expected: FAIL with "不支持的目标"
- [ ] **Step 3: 修改 `parse_target_input`**
```python
def parse_target_input(value: str, source: str) -> ResolvedTarget:
normalized = value.strip()
if is_recommendation_url(normalized):
return ResolvedTarget(kind="recommendation", value=normalized, source=source)
if is_creator_url(normalized):
return ResolvedTarget(kind="creator", value=normalized, source=source)
# ... 其余保持不变
```
- [ ] **Step 4: 运行测试确认通过**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_target_input_classifies_recommendation_url -v`
Expected: PASS
- [ ] **Step 5: 提交**
```bash
git add Douyin.py test_douyin.py
git commit -m "feat: extend target parsing to support recommendation URLs"
```
---
## Task 3: 增强数据解析提取博主信息
**Files:**
- Modify: `Douyin.py:140-170`(修改 `parse_aweme_items`
- Test: `test_douyin.py`(添加测试)
- [ ] **Step 1: 编写失败测试**
```python
def test_parse_aweme_items_extracts_author_info(self) -> None:
module = importlib.import_module("Douyin")
payload = {
"aweme_list": [
{
"aweme_id": "7619989983668240802",
"desc": "测试视频",
"author": {
"nickname": "测试博主",
"uid": "123456789"
},
"video": {
"play_addr": {
"url_list": ["https://v26-web.douyinvod.com/example/video.mp4"]
}
},
}
]
}
items = module.parse_aweme_items(payload)
self.assertEqual(len(items), 1)
self.assertEqual(items[0]["author_name"], "测试博主")
self.assertEqual(items[0]["author_id"], "123456789")
```
- [ ] **Step 2: 运行测试确认失败**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_aweme_items_extracts_author_info -v`
Expected: FAIL with KeyError or missing author_name
- [ ] **Step 3: 修改 `parse_aweme_items`**
```python
def parse_aweme_items(body: Any) -> list[dict[str, str]]:
# ... 现有代码 ...
for aweme in aweme_list:
# ... 现有视频提取代码 ...
author = aweme.get("author") or {}
author_name = str(author.get("nickname") or "").strip() or "unknown"
author_id = str(author.get("uid") or "").strip() or "unknown"
items.append(
{
"title": title,
"video_id": video_id,
"video_url": choose_video_url([str(url) for url in url_list]),
"author_name": author_name,
"author_id": author_id,
}
)
return items
```
- [ ] **Step 4: 运行测试确认通过**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_aweme_items_extracts_author_info -v`
Expected: PASS
- [ ] **Step 5: 提交**
```bash
git add Douyin.py test_douyin.py
git commit -m "feat: extract author info from aweme items"
```
---
## Task 4: 支持带博主信息的文件名构建
**Files:**
- Modify: `Douyin.py:102-104`(修改 `build_output_path`
- Test: `test_douyin.py`(添加测试)
- [ ] **Step 1: 编写失败测试**
```python
def test_build_output_path_with_author_uses_bracket_format(self) -> None:
module = importlib.import_module("Douyin")
output_path = module.build_output_path(
title="测试标题",
video_id="123456",
author_name="测试博主"
)
self.assertEqual(output_path.as_posix(), "video/[测试博主]测试标题-123456.mp4")
def test_build_output_path_without_author_uses_original_format(self) -> None:
module = importlib.import_module("Douyin")
output_path = module.build_output_path("测试标题", "123456")
self.assertEqual(output_path.as_posix(), "video/测试标题-123456.mp4")
```
- [ ] **Step 2: 运行测试确认失败**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_output_path_with_author_uses_bracket_format -v`
Expected: FAIL with unexpected keyword argument 'author_name'
- [ ] **Step 3: 修改 `build_output_path`**
```python
def build_output_path(
title: str,
video_id: str,
output_dir: Path = Path("video"),
author_name: str | None = None,
) -> Path:
safe_title = sanitize_filename(title, fallback="untitled")
if author_name:
safe_author = sanitize_filename(author_name, fallback="unknown")
filename = f"[{safe_author}]{safe_title}-{video_id}.mp4"
else:
filename = f"{safe_title}-{video_id}.mp4"
return output_dir / filename
```
- [ ] **Step 4: 运行测试确认通过**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_output_path_with_author_uses_bracket_format test_douyin.py::DouyinModuleTests::test_build_output_path_without_author_uses_original_format -v`
Expected: PASS
- [ ] **Step 5: 提交**
```bash
git add Douyin.py test_douyin.py
git commit -m "feat: support author prefix in output filename"
```
---
## Task 5: 实现 `collect_recommendations()` 函数
**Files:**
- Modify: `Douyin.py`(添加新函数)
- Test: `test_douyin.py`(添加测试)
- [ ] **Step 1: 编写失败测试**
```python
def test_collect_recommendations_downloads_videos_with_author_prefix(self) -> None:
module = importlib.import_module("Douyin")
packet = FakePacket(
{
"aweme_list": [
{
"aweme_id": "7619989983668240802",
"desc": "推荐视频1",
"author": {"nickname": "博主A", "uid": "111"},
"video": {
"play_addr": {
"url_list": ["https://v26-web.douyinvod.com/example/video1.mp4"]
}
},
}
]
}
)
page = FakeRuntimePage("https://www.douyin.com/", packet)
with mock.patch.object(module, "import_runtime_dependencies", return_value=(object(), object(), object())):
with mock.patch.object(module, "create_page", return_value=page):
with mock.patch.object(module, "download_video") as mocked_download:
downloaded = module.collect_recommendations(
max_videos=50,
timeout=10,
output_dir=module.Path("video"),
browser_port=None,
)
self.assertEqual(downloaded, 1)
# 验证文件名包含博主前缀
call_kwargs = mocked_download.call_args[1]
self.assertIn("[博主A]", str(call_kwargs["output_path"]))
```
- [ ] **Step 2: 运行测试确认失败**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_collect_recommendations_downloads_videos_with_author_prefix -v`
Expected: FAIL with "module has no attribute 'collect_recommendations'"
- [ ] **Step 3: 实现 `collect_recommendations`**
```python
def collect_recommendations(
max_videos: int,
timeout: int,
output_dir: Path,
browser_port: int | None,
) -> int:
requests_module, chromium_page_cls, chromium_options_cls = import_runtime_dependencies()
headers = build_headers("https://www.douyin.com/")
if browser_port is not None:
ensure_browser_debug_port_ready(browser_port)
page = create_page(chromium_page_cls, chromium_options_cls, browser_port)
page.listen.start(LISTEN_TARGET)
print("[INFO] 正在打开抖音推荐流。若出现登录或验证码,请先在浏览器窗口里完成。")
page.get("https://www.douyin.com/")
time.sleep(3)
downloaded = 0
seen_ids: set[str] = set()
consecutive_empty = 0
max_consecutive_empty = 3
while downloaded < max_videos:
packet = wait_for_aweme_packet(page, timeout=timeout)
if packet is None:
consecutive_empty += 1
if consecutive_empty >= max_consecutive_empty:
print("[INFO] 连续多次未获取到新数据,结束抓取。")
break
scroll_to_next_page(page)
continue
try:
payload = extract_aweme_payload(packet.response)
items = parse_aweme_items(payload)
except Exception as exc:
print(f"[WARN] 解析接口数据失败: {exc}")
consecutive_empty += 1
if consecutive_empty >= max_consecutive_empty:
break
scroll_to_next_page(page)
continue
if not items:
consecutive_empty += 1
if consecutive_empty >= max_consecutive_empty:
break
scroll_to_next_page(page)
continue
consecutive_empty = 0
new_items_in_batch = 0
for item in items:
if item["video_id"] in seen_ids:
continue
if downloaded >= max_videos:
break
seen_ids.add(item["video_id"])
output_path = build_output_path(
title=item["title"],
video_id=item["video_id"],
output_dir=output_dir,
author_name=item.get("author_name"),
)
try:
download_video(
requests_module=requests_module,
headers=headers,
video_url=item["video_url"],
output_path=output_path,
)
except Exception as exc:
print(f"[WARN] 下载失败 {item['video_id']}: {exc}")
continue
downloaded += 1
new_items_in_batch += 1
print(f"[OK] 已保存: {output_path}")
if new_items_in_batch == 0:
consecutive_empty += 1
if consecutive_empty >= max_consecutive_empty:
break
scroll_to_next_page(page)
return downloaded
```
- [ ] **Step 4: 运行测试确认通过**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_collect_recommendations_downloads_videos_with_author_prefix -v`
Expected: PASS
- [ ] **Step 5: 提交**
```bash
git add Douyin.py test_douyin.py
git commit -m "feat: implement collect_recommendations() for For You page"
```
---
## Task 6: 添加 `--max-videos` 命令行参数
**Files:**
- Modify: `Douyin.py:295-305`(修改 `build_parser`
- Modify: `Douyin.py:310-350`(修改 `main`
- Test: `test_douyin.py`(添加测试)
- [ ] **Step 1: 编写失败测试**
```python
def test_build_parser_has_max_videos_argument(self) -> None:
module = importlib.import_module("Douyin")
args = module.build_parser().parse_args(["--max-videos", "30"])
self.assertEqual(args.max_videos, 30)
def test_main_dispatches_recommendation_flow_for_recommendation_url(self) -> None:
module = importlib.import_module("Douyin")
stdout = io.StringIO()
recommendation_target = module.ResolvedTarget(
kind="recommendation",
value="https://www.douyin.com/",
source="current-page",
)
with redirect_stdout(stdout):
with mock.patch.object(module, "resolve_cli_target", return_value=recommendation_target):
with mock.patch.object(module, "collect_recommendations", return_value=5) as mocked_collect:
exit_code = module.main([])
self.assertEqual(exit_code, 0)
mocked_collect.assert_called_once_with(
max_videos=50,
timeout=10,
output_dir=module.Path("video"),
browser_port=9223,
)
```
- [ ] **Step 2: 运行测试确认失败**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_parser_has_max_videos_argument -v`
Expected: FAIL with "unrecognized arguments: --max-videos"
- [ ] **Step 3: 修改 `build_parser` 和 `main`**
```python
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="附着抖音登录浏览器并下载当前页面或指定目标的视频")
parser.add_argument(
"target",
nargs="?",
default=None,
help="可选:博主主页 URL、单视频 URL 或 aweme_id不传则读取当前浏览器页面",
)
parser.add_argument("--pages", type=int, default=1, help="创作者抓取最多处理多少页;默认 1")
parser.add_argument("--timeout", type=int, default=10, help="单次等待接口响应秒数,默认 10")
parser.add_argument(
"--output-dir",
default="video",
help="视频输出目录,默认 video",
)
parser.add_argument(
"--browser-port",
type=int,
default=DEFAULT_BROWSER_PORT,
help="附着到已启动 Chrome 的调试端口,默认 9223",
)
parser.add_argument(
"--max-videos",
type=int,
default=50,
help="推荐流最大抓取数量,默认 50",
)
return parser
def main(argv: list[str] | None = None) -> int:
parser = build_parser()
args = parser.parse_args(argv)
if args.pages <= 0:
parser.error("--pages 必须大于 0")
if args.timeout <= 0:
parser.error("--timeout 必须大于 0")
if args.browser_port is not None and args.browser_port <= 0:
parser.error("--browser-port 必须大于 0")
if args.max_videos <= 0:
parser.error("--max-videos 必须大于 0")
try:
target = resolve_cli_target(args.target, browser_port=args.browser_port)
if target.kind == "creator":
total = collect_videos(
user_url=target.value,
max_pages=args.pages,
timeout=args.timeout,
output_dir=Path(args.output_dir),
browser_port=args.browser_port,
auto_scroll=args.pages > 1,
)
elif target.kind == "recommendation":
total = collect_recommendations(
max_videos=args.max_videos,
timeout=args.timeout,
output_dir=Path(args.output_dir),
browser_port=args.browser_port,
)
elif target.kind == "single-video":
total = collect_single_video(
target=target,
timeout=args.timeout,
output_dir=Path(args.output_dir),
browser_port=args.browser_port,
)
else:
raise RuntimeError(f"不支持的目标类型: {target.kind}")
except RuntimeError as exc:
print(f"[ERROR] {exc}")
return 1
except KeyboardInterrupt:
print("\n[INFO] 用户中断。")
return 130
print(f"[INFO] 处理结束,共下载 {total} 个视频。")
return 0
```
- [ ] **Step 4: 运行测试确认通过**
Run: `python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_parser_has_max_videos_argument test_douyin.py::DouyinModuleTests::test_main_dispatches_recommendation_flow_for_recommendation_url -v`
Expected: PASS
- [ ] **Step 5: 提交**
```bash
git add Douyin.py test_douyin.py
git commit -m "feat: add --max-videos argument and wire recommendation flow in main"
```
---
## Task 7: 运行全部测试并验证
- [ ] **Step 1: 运行全部测试**
Run: `python3 -m pytest test_douyin.py -v`
Expected: 所有测试通过
- [ ] **Step 2: 运行主脚本帮助确认**
Run: `python3 Douyin.py --help`
Expected: 显示包含 `--max-videos` 的帮助信息
- [ ] **Step 3: 提交**
```bash
git add -A
git commit -m "test: verify all tests pass for recommendation crawling feature"
```
---
## 完成标准
1.`Douyin.py` 支持识别 `https://www.douyin.com/` 为推荐流目标
2.`collect_recommendations()` 函数实现滚动加载、最多50条、去重
3. ✅ 视频文件名包含博主昵称:`[博主名]标题-aweme_id.mp4`
4.`--max-videos` 命令行参数可用
5. ✅ 所有现有测试继续通过
6. ✅ 新增测试覆盖推荐流功能