Share&Joy

Ginger' Blog


  • 首页
  • 归档
  • 标签
  •   

© 2018 1-riverfish

Theme Typography by Makito

Proudly published with Hexo

Scrapy食用指南2

发布于 2018-05-08 Scrapy 

Scrapy文档笔记–2

Scrapy指导

假定你的Scrapy已经安装好,若没有安装好请看Scrapy文档笔记1

我们将要爬取quotes.toscrape.com,这篇文章将包含以下任务:

  1. 创建一个新的Scrapy项目
  2. 写一个Spider爬取站点并提取数据
  3. 使用命令行导出爬取的数据
  4. 改变爬虫使其递归的寻找链接
  5. 使用spider参数

Scrapy是用Python写的,如果你对Python还不熟悉,并且学过其他语言,请先阅读官方文档或Dive into Python3;如果这是你第一次接触编程,你可以参考一下Learn Python the hard way,当然你也可以参考the list of Python resources for non-programmer

创建一个项目

在开始scraping前,你需要先创建一个新的Scrapy项目.进入你想要创建项目的目录,运行下面的命令

1
scrapy startproject tutorial

这会创建一个tutorial目录,包含下列内容

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tutorial/
scrapy.cfg # deploy configuration file

tutorial/ # project's Python module, you'll import your code from here
__init__.py

items.py # project items definition file

middlewares.py # project middlewares file

pipelines.py # project pipelines file

settings.py # project settings file

spiders/ # a directory where you'll later put your spiders
__init__.py

我们的第一个爬虫

Spiders是我们自己定义的,Scrapy用来从网页上爬取信息的的类.他必须是scrapy.Spider的子类并且定义了初始请求,在页面中寻找下一个链接,如何解析下载的界面并且提取数据是可选的.

下面是我们第一个Spider类代码,把它保存到tutorial/spiders目录下的quotes_spider.py文件中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

我们可以看到,我们的Spider是scrapy.Spider的子类,我们在QuotesSpider中定义了一些属性和方法:

  • name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent(随后的) requests will be generated successively(依次) from these initial requests.

  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

    The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

运行我们的spider

进入项目顶级目录然后运行

1
scrapy crawl quotes

这条命令运行名字叫quotes的spider,这会发送一些请求到quotes.toscrap.com域名。你会得到类似下面这样的输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
2018-05-02 16:34:31 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot:
2018-05-02 16:34:31 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, lib
it (AMD64)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptograph
2018-05-02 16:34:31 [scrapy.crawler] INFO: Overridden settings: {'BOT_NA
2018-05-02 16:34:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-05-02 16:34:32 [scrapy.middleware] INFO: Enabled downloader middlew
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-02 16:34:32 [scrapy.middleware] INFO: Enabled spider middlewares
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-02 16:34:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-05-02 16:34:32 [scrapy.core.engine] INFO: Spider opened
2018-05-02 16:34:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (
2018-05-02 16:34:32 [scrapy.extensions.telnet] DEBUG: Telnet console lis
2018-05-02 16:34:35 [scrapy.core.engine] DEBUG: Crawled (404) <GET http:
2018-05-02 16:34:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http:
2018-05-02 16:34:35 [quotes] DEBUG: Saved file quotes-1.html
2018-05-02 16:34:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET http:
2018-05-02 16:34:36 [quotes] DEBUG: Saved file quotes-2.html
2018-05-02 16:34:36 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-02 16:34:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 678,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 5976,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 2, 8, 34, 36, 220916),
'log_count/DEBUG': 6,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 5, 2, 8, 34, 32, 396434)}
2018-05-02 16:34:36 [scrapy.core.engine] INFO: Spider closed (finished)

现在检查一下当前文件夹中的文件,你应该注意到有两个新的文件被创建:quotes-1.html and quotes-2.html,其中保存着相应URLS的内容,这是解析函数(parse)的功能。

如果你疑惑我们为啥还不解析HTML,请耐心一点,我们后边会讲

刚刚发生了什么?

Scrapy调度Spider的start_requests方法返回的scrapy.Request对象。一旦收到响应,就实例化Response对象并调用与请求关联的回调方法,并将response作为参数传给回调函数

start_requests方法的捷径

你可以仅仅定义一个start_urls类属性来替代生成start_urls方法,这个属性会被默认生成的start_urls使用来创建你的爬虫的初始请求。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scrapy

class QuotesSpider(scrapy.Spider):
name="quotes"
start_urls=[
"http://quotes.toscrape.com/page/1",
"http://quotes.toscrape.com/page/2",
]

def parse(self,response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename,'wb') as f:
f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

提取数据

了解Scrapy如何提取数据的最好的方式是使用Scrapy Shell的selector,运行下面的命令

1
2
scrapy shell 'http://quotes.toscrape.com/page/1'
//在windows上用双引号替代单引号

Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls containing arguments (ie. & character) will not work.

分享到 

 上一篇: 两只小爬虫 下一篇: Scrapy食用指南1 

© 2018 1-riverfish

Theme Typography by Makito

Proudly published with Hexo