为什么会有无头浏览器呢,正常的爬虫用来爬取静态页面是没什么问题的,但是如果是那种页面内的数据都是通过动态加载出来的,这样的话,我们用正常的爬虫技术就爬不了数据。比如这个网站:
https://www.binance.com/zh-CN/markets/overview
里面的数据就爬不了的,这个时候就需要无头浏览器了,就是用无头浏览器来模拟浏览器的请求,提取里面的数据。
我这里用java来提取数据。
maven项目添加这个maven依赖
- 需要添加无头浏览器的依赖:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.13.0</version>
</dependency>
- 下载对应的google驱动。
https://developer.chrome.com/docs/chromedriver/downloads?hl=zh-cn
下载对应的版本,然后解压下来就是三个文件。
然后就可以编写代码来爬取数据了。
- 编写代码
public class WebDriverPool {
private final BlockingQueue<WebDriver> pool;
private final Environment env;
private final int poolSize = 2; // 根据服务器配置调整
public WebDriverPool(Environment env) {
this.env = env;
this.pool = new LinkedBlockingQueue<>(poolSize);
initializePool();
}
private void initializePool() {
for (int i = 0; i < poolSize; i++) {
pool.add(createNewDriver());
}
}
private WebDriver createNewDriver() {
if (ArrayUtil.contains(env.getActiveProfiles(), "dev")) {
// 文件地址
System.setProperty("webdriver.chrome.driver", "/Users/zts/Downloads/chromedriver-mac-arm641/chromedriver");
log.info("使用Mac本地的Chrome驱动");
}
//System.setProperty("webdriver.chrome.driver", "/root/chromedriver-linux64/chromedriver");
ChromeOptions options = new ChromeOptions();
//options.setBinary("/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"); // 显式指定ARM版Chrome
options.addArguments("--no-default-browser-check");
options.addArguments("--disable-extensions");
options.addArguments("--remote-allow-origins=*"); // 允许远程调试
options.setExperimentalOption("excludeSwitches", Arrays.asList("enable-automation", "enable-logging"));
options.addArguments(
"--headless=new",
"--disable-gpu",
"--no-sandbox",
"--disable-dev-shm-usage",
"--remote-allow-origins=*", // 新增必要参数
"--disable-features=EnableNetworkService",
"--force-renderer-accessibility",
"--disable-extensions", // 禁用扩展
"--disable-software-rasterizer", // 禁用GPU渲染
"--disable-images", // 禁止加载图片
"--blink-settings=imagesEnabled=false",
"--disable-javascript"
);
log.info("创建新chrome");
return new ChromeDriver(options);
}
public WebDriver borrowDriver() throws InterruptedException {
WebDriver driver = pool.take();
// 重置浏览器状态
try {
driver.get("about:blank");
} catch (Exception e) {
log.error("重置浏览器状态失败", e);
markAsBroken(driver); // 标记为失效
driver = createNewDriver(); // 创建新实例
}
return driver;
}
public void markAsBroken(WebDriver driver) {
if (driver == null) return;
try {
driver.quit();
pool.remove(driver); // 从池中移除
log.info("remove driver");
pool.offer(createNewDriver()); // 补充新实例
log.info("add driver");
} catch (Exception e) {
log.error("替换失效驱动失败", e);
}
}
// public void returnDriver(WebDriver driver) {
// try {
// // 基础健康检查
// if (driver.getTitle() == null) {
// throw new RuntimeException("Browser instance is unhealthy");
// }
// pool.put(driver);
// } catch (Exception e) {
// log.error("返回驱动到池失败,创建新实例", e);
// driver.quit();
// pool.offer(createNewDriver());
// }
// }
public void returnDriver(WebDriver driver) {
try {
// 第一步:验证驱动是否存活
if (!isDriverAlive(driver)) {
throw new IllegalStateException("Driver session is dead");
}
// 第二步:执行轻量级健康检查(避免页面导航)
((JavascriptExecutor)driver).executeScript("return true;");
// 第三步:放回连接池
pool.put(driver);
} catch (Exception e) {
log.error("返回驱动到池失败,执行替换", e);
driver.quit();
try {
Thread.sleep(2000);
} catch (Exception t) {
log.error("等待2秒失败", t);
}
pool.offer(createNewDriver());
}
}
// 检查驱动是否存活
private boolean isDriverAlive(WebDriver driver) {
try {
driver.getWindowHandle(); // 尝试获取窗口句柄
return true;
} catch (NoSuchSessionException | NullPointerException e) {
return false;
}
}
public void replaceBrokenDriver() {
pool.offer(createNewDriver());
}
public void shutdown() {
pool.forEach(driver -> {
try {
driver.quit();
} catch (Exception e) {
log.warn("关闭WebDriver时发生异常", e);
}
});
pool.clear();
}
}
然后就是爬取里面的数据:
private List<List<String>> performCrawling(WebDriver driver) {
try {
driver.get(TARGET_URL);
// 等待动态内容加wait = {WebDriverWait@2921} 载(根据实际元素调整)
WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(1));
WebElement priceTable = wait.until(ExpectedConditions.visibilityOfElementLocated(
By.cssSelector("div.quote-page-content div.quote-price-table")
));
log.info("页面加载完成,开始抓取数据...");
// 提取所有价格行
List<List<String>> goldData = getLists((JavascriptExecutor) driver);
log.info("抓取完成,数据量: " + goldData.size());
// 打印黄金数据
// System.out.println("======= 黄金价格数据 =======");
// goldData.forEach(row -> {
// System.out.printf("商品: %-10s | 回购: %-8s col: %-10s| 销售: %-8s col: %-10s| 最高: %-8s col: %-10s| 最低: %-8s col: %-10s",
// row.get(0), row.get(1), row.get(2), row.get(3), row.get(4),row.get(5), row.get(6), row.get(7), row.get(8));
// System.out.println();
// });
//driver.quit();
return goldData;
} catch (Exception e) {
log.error("爬取过程中发生未预期错误", e);
throw new RuntimeException("Crawl failed", e);
}
}
TARGET_URL为需要提取的网页url。