java playwright爬虫
# 官网文档
# 引入环境
<dependency>
<groupId>com.microsoft.playwright</groupId>
<artifactId>playwright</artifactId>
<version>1.48.0</version>
</dependency>
1
2
3
4
5
2
3
4
5
或
api "com.microsoft.playwright:playwright:1.48.0"
1
# 使用的坑
请结合代码注释一起看
- playwright.chromium().launch 和 playwright.chromium().connectOverCDP
- launch 启动本地浏览器, playwright.close()会关闭启动的浏览器所有进程
- connectOverCDP 连接远程调试的浏览器, playwright.close()会关闭本地node进程、关闭打开的上下文页面,但是不会关闭远程浏览器。
- page.onResponse 要提早设置,否则无法拦截到请求和响应
- page.locator 设置元素等待时间的话,如果超出时间仍然没获取到则会抛出异常
- page.locator 获取元素的时候注意时间差,否则会出错
- nodejs、chrome如果没有特殊需求可以不设置,使用playwright配套的。如果你部署的服务器是centos7 等已经不更新的环境,则需要自己设置,否则会出现无法启动、大量进程占用、大量文件占用等问题
# 本地
Playwright playwright = null;
Browser browser = null;
try{
//启动参数
Map<String,String> env = new HashMap<>();
//如果要跳过浏览器下载则设置为1。否则为0或不设置
env.put("PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD", "1");
//如果有指定node环境和版本要求则设置,否则使用内置的版本。如果是centos7等老版本建议设置
env.put("PLAYWRIGHT_NODEJS_PATH", "/usr/local/bin/node");
Playwright.CreateOptions createOptions = new Playwright.CreateOptions().setEnv(env);
//启动进程
playwright = Playwright.create(createOptions);
BrowserType.LaunchOptions connectOptions = new BrowserType.LaunchOptions()
.setTimeout(5000);
//如果要启动指定的chrome,则设置。如果是centos7等老版本建议设置
connectOptions.setExecutablePath(new File("chrome执行文件的路径,举例:win下是C:\\apps\\chrome\\chrome.exe,linux下是/usr/bin/chrome。"));
//本地启动浏览器
browser = playwright.chromium().launch(connectOptions);
//远程启动
// browser = playwright.chromium().connectOverCDP("chrome调试端口", connectOptions);
// 创建一个新页面
Page page = browser.newPage();
//拦截请求,最好设置在navigate之前,否则有些请求会拦截不及时
page.onResponse(response -> {
log.info("拦截到的请求:{}",response.url());
});
// 访问 URL
page.navigate(finalUrl, new Page.NavigateOptions().setTimeout(3000));
if(!page.isClosed()){
page.close()
}
}cateh (Exception e){
log.info("异常:{}",)
}finally{
if (browser != null) {
browser.close();
}
if (playwright != null) {
playwright.close();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
编辑 (opens new window)
上次更新: 2024-11-06, 19:27:10