发布于 2018-05-29 Scrapy

Scrapy文档笔记-3

Scrapy shell

The Scrapy shell是一个交互式shell,你可以用它来快速地debug你的scraping代码(主要是数据提取部分)而不必运行你的爬虫.

这个Shell主要用来测试XPath或者CSS表达式并且观察它们如何工作并且数据是如何从你爬取的web pages中提取出来的.更强大的是它允许你交互式地测试你写的表达式当你写爬虫代码时，从而避免每次更改代码时都运行爬虫来测试.

当你熟悉Scrapy shell你会发现它是一个强有力的工具在开发和测试你写的爬虫的时候.

配置Shell

如果你安装了Ipython,那非常好，Scrapy会使用它而不是标准的Python控制台，因为Ipython更加强大而且有五颜六色的输出（:smile:）如果你没有安装Ipython，那么强烈推荐你安装Ipython(某茂林学长是Ipython的拥趸)

当然如果无法安装Ipython,bpython或者标准的python console也是可以的。

通过设置SCRAPY_PYTHON_SHELL环境变量或者在scrapy.cfg中修改配置文件来配置Scrapy Shell

1 2	[settings] shell = ipython

启动shell

通过shell命令运行Scrapy shell

1	scrapy shell <url>

\是你想要爬取的链接

shell也可以将爬取下来的网页保存为本地文件

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

注意:

UNIX-style中，特别注意相对文件路径的使用，不要遗漏./或者../

使用Shell

Scrapy Shell仅仅是一个常规的Python console，只是为了方便多提供了额外的快捷函数

Available Shortcuts

shelp() - 帮助选项
fetch(url[, redirect=True]) -根据给定的url获取新的response并相应地更新所有相关对象.You can optionaly ask for HTTP 3xx redirections to not be followed by passing redirect=False
fetch(request) - 根据给定的requests获取新的response并相应地更新所有相关对象
view(response) - 在本机的浏览器打开给定的 response。其会在response的body中添加一个 [base]tag ，使得外部链接(例如图片及css)能正确显示。注意，该操作会在本地创建一个临时文件，且该文件不会被自动删除

Available Scrapy objects

Scrapy shell自动从被下载的页面创建一些便利的对象,比如 Response对象和Selector对象(for both HTML and XML 内容)

crawler - 当前Crawler对象
spider - 处理URL的spider,对没有Spider处理的URL则为一个Spider对象
request - 最近获取到的页面的Request对象，可以使用replace()修改该request,或者使用fetch()快捷方式来获取新的request
response - 包含最近获取到的页面的Response对象
sel - 根据最近获取到的response构建的 Selector 对象
settings - 当前的 Scrapy settings

Shell 会话示例

下面是一个典型的shell会话示例，我们首先抓取 http://scrapy.org 页面，然后继续抓取 https://reddit.com 页面。最后，我们修改（Reddit）请求方法为POST并重新获取它得到一个错误。我们通过在Windows中键入Ctrl-D（在Unix系统中）或Ctrl-Z结束会话。

需要注意的是，由于爬取的页面不是静态页，内容会随着时间而修改，因此例子中提取到的数据可能与您尝试的结果不同。该例子的唯一目的是让您熟悉Scrapy shell。

首先，我们启动shell:

1	scrapy shell 'http://scrapy.org' --nolog

然后，shell获取URL（使用Scrapy下载器）并打印可用对象和有用的快捷方式列表（您会注意到这些行都以 [s] 前缀开头）：

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000029C9C734860>
[s]   item       {}
[s]   request    <GET http://scrapy.org>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x0000029C9DA47828>
[s]   spider     <DefaultSpider 'default' at 0x29c9dcea2b0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are folowed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

之后，我们可以开始使用对象：

>>> response.xpath('//title/text()').extract_first()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

>>> fetch("http://reddit.com")

>>> response.xpath('//title/text()').extract()
['reddit: the front page of the internet']

>>> request = request.replace(method="POST")

>>> fetch(request)

>>> response.status
404

>>> from pprint import pprint

>>> pprint(response.headers)
{'Accept-Ranges': ['bytes'],
 'Cache-Control': ['max-age=0, must-revalidate'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
 'Server': ['snooserv'],
 'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
                'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
 'Vary': ['accept-encoding'],
 'Via': ['1.1 varnish'],
 'X-Cache': ['MISS'],
 'X-Cache-Hits': ['0'],
 'X-Content-Type-Options': ['nosniff'],
 'X-Frame-Options': ['SAMEORIGIN'],
 'X-Moose': ['majestic'],
 'X-Served-By': ['cache-cdg8730-CDG'],
 'X-Timer': ['S1481214079.394283,VS0,VE159'],
 'X-Ua-Compatible': ['IE=edge'],
 'X-Xss-Protection': ['1; mode=block']}
>>>

在spider中调用shell来查看response

有时您想在spider的某个位置中查看被处理的response，以确认您期望的response到达特定位置。

这可以通过 scrapy.shell.inspect_response 函数来实现。

以下是如何在spider中调用该函数的例子

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

当运行spider时，您将得到类似下列的输出:

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'

接着测试提取代码:

1 2	>>> response.xpath('//h1[@class="fn"]') []

您可以在Web浏览器中打开response，看看它是否是您期望的response：

1 2	>>> view(response) True

最后您可以点击Ctrl-D(Windows下Ctrl-Z)来退出终端，恢复爬取:

1
2
3

>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

请注意，您不能在这里使用 fetch 快捷命令，因为Scrapy引擎被shell阻止。然而，在你离开shell之后，spider会继续爬到它停止的地方，如上图所示。

参考文档-EN

参考文档-ZH

上一篇: Scrapy食用指南4 下一篇: 外科手术队伍

Share&Joy

Ginger' Blog