douyin-crawler-poc/docs/superpowers/plans/2026-05-06-douyin-recommendation-crawling.md

19 KiB
Raw Blame History

抖音推荐流视频抓取实现计划

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: 扩展 Douyin.py 支持抓取抖音推荐流For You页面视频记录博主信息支持滚动加载最多50条

Architecture: 新建 collect_recommendations() 函数处理推荐流,复用现有的下载和工具函数。通过 parse_target_input() 扩展识别推荐流URL。

Tech Stack: Python 3, DrissionPage, requests, unittest


文件结构

文件 操作 说明
Douyin.py 修改 添加推荐流识别、解析、抓取逻辑
test_douyin.py 修改 添加推荐流相关测试

Task 1: 推荐流URL识别

Files:

  • Modify: Douyin.py:17-19(添加正则表达式)

  • Test: test_douyin.py(添加测试)

  • Step 1: 编写失败测试

def test_is_recommendation_url_accepts_douyin_homepage(self) -> None:
    module = importlib.import_module("Douyin")
    self.assertTrue(module.is_recommendation_url("https://www.douyin.com/"))
    self.assertTrue(module.is_recommendation_url("https://www.douyin.com"))
    self.assertTrue(module.is_recommendation_url("https://www.douyin.com/?from=web"))
    self.assertFalse(module.is_recommendation_url("https://www.douyin.com/user/xxx"))
    self.assertFalse(module.is_recommendation_url("https://www.douyin.com/video/123"))
  • Step 2: 运行测试确认失败

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_is_recommendation_url_accepts_douyin_homepage -v Expected: FAIL with "module has no attribute 'is_recommendation_url'"

  • Step 3: 实现最小代码

Douyin.py 中添加:

RECOMMENDATION_URL_PATTERN = re.compile(r"^https?://www\.douyin\.com/?(?:\?.*)?$")

def is_recommendation_url(value: str) -> bool:
    return bool(RECOMMENDATION_URL_PATTERN.match(value.strip()))
  • Step 4: 运行测试确认通过

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_is_recommendation_url_accepts_douyin_homepage -v Expected: PASS

  • Step 5: 提交
git add Douyin.py test_douyin.py
git commit -m "feat: add recommendation URL recognition"

Task 2: 扩展目标解析支持推荐流

Files:

  • Modify: Douyin.py:52-68(修改 parse_target_input

  • Test: test_douyin.py(添加测试)

  • Step 1: 编写失败测试

def test_parse_target_input_classifies_recommendation_url(self) -> None:
    module = importlib.import_module("Douyin")
    target = module.parse_target_input("https://www.douyin.com/", source="manual")
    self.assertEqual(target.kind, "recommendation")
    self.assertEqual(target.value, "https://www.douyin.com/")
    self.assertEqual(target.source, "manual")
  • Step 2: 运行测试确认失败

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_target_input_classifies_recommendation_url -v Expected: FAIL with "不支持的目标"

  • Step 3: 修改 parse_target_input
def parse_target_input(value: str, source: str) -> ResolvedTarget:
    normalized = value.strip()
    if is_recommendation_url(normalized):
        return ResolvedTarget(kind="recommendation", value=normalized, source=source)
    if is_creator_url(normalized):
        return ResolvedTarget(kind="creator", value=normalized, source=source)
    # ... 其余保持不变
  • Step 4: 运行测试确认通过

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_target_input_classifies_recommendation_url -v Expected: PASS

  • Step 5: 提交
git add Douyin.py test_douyin.py
git commit -m "feat: extend target parsing to support recommendation URLs"

Task 3: 增强数据解析提取博主信息

Files:

  • Modify: Douyin.py:140-170(修改 parse_aweme_items

  • Test: test_douyin.py(添加测试)

  • Step 1: 编写失败测试

def test_parse_aweme_items_extracts_author_info(self) -> None:
    module = importlib.import_module("Douyin")
    payload = {
        "aweme_list": [
            {
                "aweme_id": "7619989983668240802",
                "desc": "测试视频",
                "author": {
                    "nickname": "测试博主",
                    "uid": "123456789"
                },
                "video": {
                    "play_addr": {
                        "url_list": ["https://v26-web.douyinvod.com/example/video.mp4"]
                    }
                },
            }
        ]
    }
    items = module.parse_aweme_items(payload)
    self.assertEqual(len(items), 1)
    self.assertEqual(items[0]["author_name"], "测试博主")
    self.assertEqual(items[0]["author_id"], "123456789")
  • Step 2: 运行测试确认失败

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_aweme_items_extracts_author_info -v Expected: FAIL with KeyError or missing author_name

  • Step 3: 修改 parse_aweme_items
def parse_aweme_items(body: Any) -> list[dict[str, str]]:
    # ... 现有代码 ...
    
    for aweme in aweme_list:
        # ... 现有视频提取代码 ...
        
        author = aweme.get("author") or {}
        author_name = str(author.get("nickname") or "").strip() or "unknown"
        author_id = str(author.get("uid") or "").strip() or "unknown"
        
        items.append(
            {
                "title": title,
                "video_id": video_id,
                "video_url": choose_video_url([str(url) for url in url_list]),
                "author_name": author_name,
                "author_id": author_id,
            }
        )
    
    return items
  • Step 4: 运行测试确认通过

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_parse_aweme_items_extracts_author_info -v Expected: PASS

  • Step 5: 提交
git add Douyin.py test_douyin.py
git commit -m "feat: extract author info from aweme items"

Task 4: 支持带博主信息的文件名构建

Files:

  • Modify: Douyin.py:102-104(修改 build_output_path

  • Test: test_douyin.py(添加测试)

  • Step 1: 编写失败测试

def test_build_output_path_with_author_uses_bracket_format(self) -> None:
    module = importlib.import_module("Douyin")
    output_path = module.build_output_path(
        title="测试标题", 
        video_id="123456",
        author_name="测试博主"
    )
    self.assertEqual(output_path.as_posix(), "video/[测试博主]测试标题-123456.mp4")

def test_build_output_path_without_author_uses_original_format(self) -> None:
    module = importlib.import_module("Douyin")
    output_path = module.build_output_path("测试标题", "123456")
    self.assertEqual(output_path.as_posix(), "video/测试标题-123456.mp4")
  • Step 2: 运行测试确认失败

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_output_path_with_author_uses_bracket_format -v Expected: FAIL with unexpected keyword argument 'author_name'

  • Step 3: 修改 build_output_path
def build_output_path(
    title: str, 
    video_id: str, 
    output_dir: Path = Path("video"),
    author_name: str | None = None,
) -> Path:
    safe_title = sanitize_filename(title, fallback="untitled")
    if author_name:
        safe_author = sanitize_filename(author_name, fallback="unknown")
        filename = f"[{safe_author}]{safe_title}-{video_id}.mp4"
    else:
        filename = f"{safe_title}-{video_id}.mp4"
    return output_dir / filename
  • Step 4: 运行测试确认通过

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_output_path_with_author_uses_bracket_format test_douyin.py::DouyinModuleTests::test_build_output_path_without_author_uses_original_format -v Expected: PASS

  • Step 5: 提交
git add Douyin.py test_douyin.py
git commit -m "feat: support author prefix in output filename"

Task 5: 实现 collect_recommendations() 函数

Files:

  • Modify: Douyin.py(添加新函数)

  • Test: test_douyin.py(添加测试)

  • Step 1: 编写失败测试

def test_collect_recommendations_downloads_videos_with_author_prefix(self) -> None:
    module = importlib.import_module("Douyin")
    packet = FakePacket(
        {
            "aweme_list": [
                {
                    "aweme_id": "7619989983668240802",
                    "desc": "推荐视频1",
                    "author": {"nickname": "博主A", "uid": "111"},
                    "video": {
                        "play_addr": {
                            "url_list": ["https://v26-web.douyinvod.com/example/video1.mp4"]
                        }
                    },
                }
            ]
        }
    )
    page = FakeRuntimePage("https://www.douyin.com/", packet)
    
    with mock.patch.object(module, "import_runtime_dependencies", return_value=(object(), object(), object())):
        with mock.patch.object(module, "create_page", return_value=page):
            with mock.patch.object(module, "download_video") as mocked_download:
                downloaded = module.collect_recommendations(
                    max_videos=50,
                    timeout=10,
                    output_dir=module.Path("video"),
                    browser_port=None,
                )
    
    self.assertEqual(downloaded, 1)
    # 验证文件名包含博主前缀
    call_kwargs = mocked_download.call_args[1]
    self.assertIn("[博主A]", str(call_kwargs["output_path"]))
  • Step 2: 运行测试确认失败

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_collect_recommendations_downloads_videos_with_author_prefix -v Expected: FAIL with "module has no attribute 'collect_recommendations'"

  • Step 3: 实现 collect_recommendations
def collect_recommendations(
    max_videos: int,
    timeout: int,
    output_dir: Path,
    browser_port: int | None,
) -> int:
    requests_module, chromium_page_cls, chromium_options_cls = import_runtime_dependencies()
    headers = build_headers("https://www.douyin.com/")
    if browser_port is not None:
        ensure_browser_debug_port_ready(browser_port)
    page = create_page(chromium_page_cls, chromium_options_cls, browser_port)
    page.listen.start(LISTEN_TARGET)

    print("[INFO] 正在打开抖音推荐流。若出现登录或验证码,请先在浏览器窗口里完成。")
    page.get("https://www.douyin.com/")
    time.sleep(3)

    downloaded = 0
    seen_ids: set[str] = set()
    consecutive_empty = 0
    max_consecutive_empty = 3

    while downloaded < max_videos:
        packet = wait_for_aweme_packet(page, timeout=timeout)
        if packet is None:
            consecutive_empty += 1
            if consecutive_empty >= max_consecutive_empty:
                print("[INFO] 连续多次未获取到新数据,结束抓取。")
                break
            scroll_to_next_page(page)
            continue

        try:
            payload = extract_aweme_payload(packet.response)
            items = parse_aweme_items(payload)
        except Exception as exc:
            print(f"[WARN] 解析接口数据失败: {exc}")
            consecutive_empty += 1
            if consecutive_empty >= max_consecutive_empty:
                break
            scroll_to_next_page(page)
            continue

        if not items:
            consecutive_empty += 1
            if consecutive_empty >= max_consecutive_empty:
                break
            scroll_to_next_page(page)
            continue

        consecutive_empty = 0
        new_items_in_batch = 0
        
        for item in items:
            if item["video_id"] in seen_ids:
                continue
            
            if downloaded >= max_videos:
                break

            seen_ids.add(item["video_id"])
            output_path = build_output_path(
                title=item["title"],
                video_id=item["video_id"],
                output_dir=output_dir,
                author_name=item.get("author_name"),
            )

            try:
                download_video(
                    requests_module=requests_module,
                    headers=headers,
                    video_url=item["video_url"],
                    output_path=output_path,
                )
            except Exception as exc:
                print(f"[WARN] 下载失败 {item['video_id']}: {exc}")
                continue

            downloaded += 1
            new_items_in_batch += 1
            print(f"[OK] 已保存: {output_path}")

        if new_items_in_batch == 0:
            consecutive_empty += 1
            if consecutive_empty >= max_consecutive_empty:
                break

        scroll_to_next_page(page)

    return downloaded
  • Step 4: 运行测试确认通过

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_collect_recommendations_downloads_videos_with_author_prefix -v Expected: PASS

  • Step 5: 提交
git add Douyin.py test_douyin.py
git commit -m "feat: implement collect_recommendations() for For You page"

Task 6: 添加 --max-videos 命令行参数

Files:

  • Modify: Douyin.py:295-305(修改 build_parser

  • Modify: Douyin.py:310-350(修改 main

  • Test: test_douyin.py(添加测试)

  • Step 1: 编写失败测试

def test_build_parser_has_max_videos_argument(self) -> None:
    module = importlib.import_module("Douyin")
    args = module.build_parser().parse_args(["--max-videos", "30"])
    self.assertEqual(args.max_videos, 30)

def test_main_dispatches_recommendation_flow_for_recommendation_url(self) -> None:
    module = importlib.import_module("Douyin")
    stdout = io.StringIO()
    recommendation_target = module.ResolvedTarget(
        kind="recommendation",
        value="https://www.douyin.com/",
        source="current-page",
    )
    with redirect_stdout(stdout):
        with mock.patch.object(module, "resolve_cli_target", return_value=recommendation_target):
            with mock.patch.object(module, "collect_recommendations", return_value=5) as mocked_collect:
                exit_code = module.main([])
    self.assertEqual(exit_code, 0)
    mocked_collect.assert_called_once_with(
        max_videos=50,
        timeout=10,
        output_dir=module.Path("video"),
        browser_port=9223,
    )
  • Step 2: 运行测试确认失败

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_parser_has_max_videos_argument -v Expected: FAIL with "unrecognized arguments: --max-videos"

  • Step 3: 修改 build_parsermain
def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(description="附着抖音登录浏览器并下载当前页面或指定目标的视频")
    parser.add_argument(
        "target",
        nargs="?",
        default=None,
        help="可选:博主主页 URL、单视频 URL 或 aweme_id不传则读取当前浏览器页面",
    )
    parser.add_argument("--pages", type=int, default=1, help="创作者抓取最多处理多少页;默认 1")
    parser.add_argument("--timeout", type=int, default=10, help="单次等待接口响应秒数,默认 10")
    parser.add_argument(
        "--output-dir",
        default="video",
        help="视频输出目录,默认 video",
    )
    parser.add_argument(
        "--browser-port",
        type=int,
        default=DEFAULT_BROWSER_PORT,
        help="附着到已启动 Chrome 的调试端口,默认 9223",
    )
    parser.add_argument(
        "--max-videos",
        type=int,
        default=50,
        help="推荐流最大抓取数量,默认 50",
    )
    return parser


def main(argv: list[str] | None = None) -> int:
    parser = build_parser()
    args = parser.parse_args(argv)

    if args.pages <= 0:
        parser.error("--pages 必须大于 0")
    if args.timeout <= 0:
        parser.error("--timeout 必须大于 0")
    if args.browser_port is not None and args.browser_port <= 0:
        parser.error("--browser-port 必须大于 0")
    if args.max_videos <= 0:
        parser.error("--max-videos 必须大于 0")

    try:
        target = resolve_cli_target(args.target, browser_port=args.browser_port)
        if target.kind == "creator":
            total = collect_videos(
                user_url=target.value,
                max_pages=args.pages,
                timeout=args.timeout,
                output_dir=Path(args.output_dir),
                browser_port=args.browser_port,
                auto_scroll=args.pages > 1,
            )
        elif target.kind == "recommendation":
            total = collect_recommendations(
                max_videos=args.max_videos,
                timeout=args.timeout,
                output_dir=Path(args.output_dir),
                browser_port=args.browser_port,
            )
        elif target.kind == "single-video":
            total = collect_single_video(
                target=target,
                timeout=args.timeout,
                output_dir=Path(args.output_dir),
                browser_port=args.browser_port,
            )
        else:
            raise RuntimeError(f"不支持的目标类型: {target.kind}")
    except RuntimeError as exc:
        print(f"[ERROR] {exc}")
        return 1
    except KeyboardInterrupt:
        print("\n[INFO] 用户中断。")
        return 130

    print(f"[INFO] 处理结束,共下载 {total} 个视频。")
    return 0
  • Step 4: 运行测试确认通过

Run: python3 -m pytest test_douyin.py::DouyinModuleTests::test_build_parser_has_max_videos_argument test_douyin.py::DouyinModuleTests::test_main_dispatches_recommendation_flow_for_recommendation_url -v Expected: PASS

  • Step 5: 提交
git add Douyin.py test_douyin.py
git commit -m "feat: add --max-videos argument and wire recommendation flow in main"

Task 7: 运行全部测试并验证

  • Step 1: 运行全部测试

Run: python3 -m pytest test_douyin.py -v Expected: 所有测试通过

  • Step 2: 运行主脚本帮助确认

Run: python3 Douyin.py --help Expected: 显示包含 --max-videos 的帮助信息

  • Step 3: 提交
git add -A
git commit -m "test: verify all tests pass for recommendation crawling feature"

完成标准

  1. Douyin.py 支持识别 https://www.douyin.com/ 为推荐流目标
  2. collect_recommendations() 函数实现滚动加载、最多50条、去重
  3. 视频文件名包含博主昵称:[博主名]标题-aweme_id.mp4
  4. --max-videos 命令行参数可用
  5. 所有现有测试继续通过
  6. 新增测试覆盖推荐流功能