用 NodeJs 写一个爬虫

何小勺字数: 2620 阅读耗时: 6 分钟 2018/05/05 博客独享热度: 119 评论: 0

主要技术栈

request 简化的HTTP客户端。
cheerio 快速，灵活和实施精益专为服务器设计的jQuery核心。

这是2个npm包，先简单说说这2个是啥东西。

`request` 简化的HTTP客户端。

有了这个模块，nodejs 中的http请求变的超简单。

const request = require('request');
request('http://www.baidu.com', function (error, response, body) {
  if (!error && response.statusCode == 200) {
    console.log(body) // 打印baidu首页
  }
})

文档：https://www.npmjs.com/package/request

如果您喜欢使用 Promises ，或者如果您想在 ES2017 中使用 async/ await，可以使用 request-promise；

const request = require('request-promise');
request('http://www.baidu.com').then(function (body) {
  console.log(body) // 打印baidu首页
});

文档：https://www.npmjs.com/package/request-promise

cheerio 专为服务器设计的jQuery。

const cheerio = require('cheerio');
const $ = cheerio.load('<h2 class="title">Hello world</h2>')

$('h2.title').text('Hello there!')
$('h2').addClass('welcome')

$.html()
//=> "<html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>"

案例

有了这两个东西，不我解释也知道他们配合起来的强大之处了吧。

假设我们想获取简书首页的文章信息[滑稽]，那么可以这么做：

const request = require('request-promise');
const cheerio = require('cheerio');

// 请求简书首页
request('https://www.jianshu.com/').then(function (body) {

  // 解析html成jq对象
  const $ = cheerio.load(body);

  // 使用 map 遍历每一篇文章元素
  const list = $('.note-list .have-img').map(function (i, el) {

    // 每一篇文章元素
    const $item = $(el);


    return {

      // 标题
      title: $item.find('.title').text(),

      // 内容
      abstract: $item.find('.abstract').text(),

    };

  }).get();


  // 结果（首页的文章列表数据）
  return list;

});

主要技术栈

request 简化的HTTP客户端。

cheerio 专为服务器设计的jQuery。

案例

`request` 简化的HTTP客户端。