nodejs爬虫示例代码

未被防爬虫的网站处理方式

未被防爬虫的网站处理方式比较简单，直接发送请求即可。被防爬虫的网站一般直接请求要么就是加密信息,要么请求被拦截返回的并不是我们需要的数据,具体参考方式二,通用的

安装依赖

npm install axios cheerio

代码

const axios = require('axios');
const cheerio = require('cheerio');
(async () => {
  const base = 'https://www.isouju.com';
  const { data: html } = await axios.get(base, {
    headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' }
  });

  const $ = cheerio.load(html);

  /* 1. 先锁定「短剧」板块 */
  // 找到.block选择器的元素   .has代表这个元素内包含.nav img[alt="短剧"]这个元素
  const dramaBlock = $('.block').has('.nav img[alt="短剧"]');

  /* 2. 遍历每一条 */
  const list = [];
  // find 找到dramaBlock内的所有.list a.item元素
  // each遍历每一条 _代表索引 el代表每个元素
  dramaBlock.find('.list a.item').each((_, el) => {
    const a   = $(el); // $(el)代表当前元素 el是dom对象
    const img = a.find('div.img img'); // 找到img元素
    const src = img.attr('src'); // 获取img的src属性
    // 添加到数组内即可
    list.push({
      title: img.attr('alt') || a.find('p').text().trim(),   // 双保险
      url: base + a.attr('href'),
      poster: src ? src.replace(/^http:/, 'https:').split('?')[0] : ''
    });
  });

  console.log(`共 ${list.length} 条`);
  console.log(JSON.stringify(list, null, 2));
})();

运行

node ./index.js

被防爬虫的网站处理方式

主要是模拟浏览器真实访问拿到真实的html再返回

安装插件

npm install playwright
// 上面执行完后,需要执行以下命令下载
npx playwright install

index.js

// getHtml.js
const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true }); // true=后台跑
  const page = await browser.newPage({
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  });
  // 访问目标页面
  await page.goto('目标网页url', {
    waitUntil: 'networkidle' // 等所有请求完成
  });

  const html = await page.content(); // 拿到最终 HTML
  console.log(html);               // 直接打印
  await browser.close();           // 关闭浏览器
})();

运行

node ./index.js

nodejs爬虫示例代码 ​

未被防爬虫的网站处理方式 ​

安装依赖 ​

代码 ​

运行 ​

被防爬虫的网站处理方式 ​

安装插件 ​

index.js ​

运行 ​

nodejs爬虫示例代码

未被防爬虫的网站处理方式

安装依赖

代码

运行

被防爬虫的网站处理方式

安装插件

index.js

运行