目录

java playwright爬虫

# 官网文档

点击打开官网 (opens new window)

# 引入环境

<dependency>
  <groupId>com.microsoft.playwright</groupId>
  <artifactId>playwright</artifactId>
  <version>1.48.0</version>
</dependency>
1
2
3
4
5

api "com.microsoft.playwright:playwright:1.48.0"
1

# 使用的坑

请结合代码注释一起看

  • playwright.chromium().launch 和 playwright.chromium().connectOverCDP
    • launch 启动本地浏览器, playwright.close()会关闭启动的浏览器所有进程
    • connectOverCDP 连接远程调试的浏览器, playwright.close()会关闭本地node进程、关闭打开的上下文页面,但是不会关闭远程浏览器。
  • page.onResponse 要提早设置,否则无法拦截到请求和响应
  • page.locator 设置元素等待时间的话,如果超出时间仍然没获取到则会抛出异常
  • page.locator 获取元素的时候注意时间差,否则会出错
  • nodejs、chrome如果没有特殊需求可以不设置,使用playwright配套的。如果你部署的服务器是centos7 等已经不更新的环境,则需要自己设置,否则会出现无法启动、大量进程占用、大量文件占用等问题

# 本地

Playwright playwright = null;
Browser browser = null;
try{
    //启动参数
    Map<String,String> env = new HashMap<>();
    //如果要跳过浏览器下载则设置为1。否则为0或不设置
    env.put("PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD", "1");
    //如果有指定node环境和版本要求则设置,否则使用内置的版本。如果是centos7等老版本建议设置
    env.put("PLAYWRIGHT_NODEJS_PATH", "/usr/local/bin/node");
    Playwright.CreateOptions createOptions = new Playwright.CreateOptions().setEnv(env);
    //启动进程
    playwright = Playwright.create(createOptions);

    BrowserType.LaunchOptions connectOptions = new BrowserType.LaunchOptions()
                  .setTimeout(5000);
    //如果要启动指定的chrome,则设置。如果是centos7等老版本建议设置
    connectOptions.setExecutablePath(new File("chrome执行文件的路径,举例:win下是C:\\apps\\chrome\\chrome.exe,linux下是/usr/bin/chrome。"));
    //本地启动浏览器
    browser = playwright.chromium().launch(connectOptions);
    //远程启动
    // browser = playwright.chromium().connectOverCDP("chrome调试端口", connectOptions);
    // 创建一个新页面
    Page page = browser.newPage();
    //拦截请求,最好设置在navigate之前,否则有些请求会拦截不及时
    page.onResponse(response -> {
        log.info("拦截到的请求:{}",response.url());
    });

    // 访问 URL
    page.navigate(finalUrl, new Page.NavigateOptions().setTimeout(3000));

    if(!page.isClosed()){
        page.close()
    }

}cateh (Exception e){
    log.info("异常:{}",)
}finally{
    if (browser != null) {
        browser.close();
    }
    if (playwright != null) {
        playwright.close();
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
上次更新: 2024-11-06, 19:27:10
最近更新
01
连接chrome调试
07-23
02
连接chrome调试
07-23
03
2023年度总结
01-03
更多文章>