发布于 2018-05-08 Scrapy

Scrapy文档笔记–1

PRE 前言

事情的经过是这样的:今年清明节左右，和两个同学参加了一个数据挖掘的比赛，我们的选题是影视推荐系统，嗯，温馨又从容。天杀的组委会给的数据，少的可怜的特征，于是我们把目光(屠刀)转向豆瓣，那么~~可怜，幼小又无助~~。在一番摸索(艰苦)之后，果断投入Scrapy框架的怀抱。

INTRO 介绍

Scrapy是一个抓取网页提取有效数据的应用框架，它也可以被用来通过调用APIs来提取数据.

EXAMPLE 例子

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "http://quotes.toscrape.com/tag/humor",
        ]
        
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

保存代码为quotes_spider.py，运行命令runspider

1	scrapy runspider quotes_spider.py -o quotes.json

程序终止后你会发现文件夹中多了quotes.json文件，就像下面这样

[{
    "author": "Jane Austen",
    "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"
},
{
    "author": "Groucho Marx",
    "text": "\u201cOutside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.\u201d"
},
{
    "author": "Steve Martin",
    "text": "\u201cA day without sunshine is like, you know, night.\u201d"
},
...]

What just happened? 发生了什么?

当你运行 scrapy runspider quotes_spider.py时，Scrapy会在代码中寻找爬虫定义并且通过他的crawler engine运行代码。

爬取通过向start_urls属性中定义的URLs发出请求开始并且调用默认的callback方法parse,传递response对象作为一个参数.在回调函数parse中，我们使用CSS Selector遍历 quote 元素，用提取出来的 quote text和author来生成一个Python字典，寻找下一页的链接，安排另一个请求并使用相同的parse作为回调函数.

你会发现Scrapy的一个主要优点就是，请求的调度和处理是异步的，这意味着Scrapy不会等待一个请求完成然后处理它,在这段时间它可以发送另一个请求或者做别的事情.这也意味着其他请求可以继续进行即使一些请求失败或者在处理过程中发生了一些错误.尽管这允许你做非常快的抓取(以一个容错的方式同时发送多个并发的请求).同时，Scrapy给你提供了一些设置选项以控制爬取的礼貌(我的理解是，不要同时搞太多请求以防止对方服务器搞崩溃).你可以设置每次请求间的下载延时，设置每个域名或者每个IP的并发请求数量，甚至可以使用自动限制扩展来自动完成这些事情.

这里使用了反馈输出来生成 JSON 文件，你可以轻松地改变输出文件的格式(XML 或者 CSV)或者存储后端(FTP或者Amazon S3).当然也可以写一个 item pipeline 以在数据库中存储 items.

What else? 其他特性

Scrapy提供了许多强有力的特征

Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
Wide range of built-in extensions and middlewares for handling:
- cookies and session handling
- HTTP features like compression, authentication, caching
- user-agent spoofing
- robots.txt
- crawl depth restriction
- and more
A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!

遇到一点小问题，在网上查了一下，是因为Scrapy版本过低

AttributeError: ‘HtmlResponse’ object has no attribute ‘follow’

INSTALL GUIDE 安装指南

Scrapy支持Python 2.7/3.4及以上版本

如果你使用的是Anaconda或者Miniconda,你可以通过下面的方式获取最新的 scrapy 包.

1	conda install -c conda-forge scrapy

或者你足够熟悉 Python 包的安装管理，你可以通过pip安装scrapy及它的依赖.

1
2
3

python -m pip install --upgrade pip  //升级pip
pip install ipykernel				//一个重要依赖
pip install Scrapy

请注意，为了安装Scrapy,有时候得根据你的操作系统为一些Scrapy的依赖解决一些编译问题，所以确保检查过Platform specific installation notes

我们强烈推荐你在专用的 虚拟环境(dedicated virtualenv) 中安装 Scrapy以避免和你系统中的包冲突.

一些最好知道的事情

Scrapy是用纯Python书写并且依赖几个关键的Python包

lxml, an efficient XML and HTML parser
parsel, an HTML/XML data extraction library written on top of lxml,
w3lib, a multi-purpose helper for dealing with URLs and web page encodings
twisted, an asynchronous networking framework
cryptography and pyOpenSSL, to deal with various network-level security needs

The minimal versions which Scrapy is tested against are:

Twisted 14.0
lxml 3.4
pyOpenSSL 0.14

Scrapy may work with older versions of these packages but it is not guaranteed it will continue working because it’s not being tested against them.

Some of these packages themselves depends on non-Python packages that might require additional installation steps depending on your platform. Please check platform-specific guides below.

In case of any trouble related to these dependencies, please refer to their respective installation instructions:

使用虚拟环境(推荐)

我们推荐在所有平台上在专门的虚拟环境中安装Scrapy

安装 virtualenv

1	pip install virtualenv

安装 scrapy

Windows

1 2	//在 virtualenv 中 //conda install -c conda-forge scrapy

Ubuntu 14.04 or above

//不要使用 ubuntu 提供的 python-scrapy 太旧了

//在 virtualenv 中
//安装依赖
sudo apt-get install python3 python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

pip install scrapy

Virtualenv用户指南

安装

1
2
3

pip install virtualenv
//我遇到了time out error 
//修改设置的超时时间 或者更改下载源

使用

Step1 创建目录

1	mkdir pythonV

Step2 创建一个独立的Python运行环境 scrapyEnv

1
2

virtual --no-site-packages venv
//参数–no-site-packages 已经安装到系统Python环境中的所有第三方包都不会复制过来，这样，我们就得到了一个不带任何第三方包的“干净”的Python运行环境

Step3 进入独立的scrapyEnv环境目录下

//Ubuntu
source scrapyEnv/bin/activate

//windows下
//获得执行权限
Set-ExecutionPolicy -Scope CurrentUser AllSigned
Set-ExecutionPolicy -Scope CurrentUser RemoteSigned
./scrapyEnv/Scripts/activate

Step4 安装各种第三方包

pip install scrapy
//如果 windows下用的是 Anaconda或者Miniconda 则可以使用 conda 进行安装  这种方法是错误的
conda install -c conda-forge scrapy

//正确的方法 
pip install scrapy

在这里遇到一点问题，安装scrapy的时候，报错

error: [WinError 3] 系统找不到指定的路径。: 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\PlatformSDK\\lib'

在网上找了一下方法，比较合适的有两种

安装VS开发工具，也就是VS2015/2017,显然这种方法可行性比较低
在这个网站下载你需要的包的.whl文件，放到对应文件夹底下，我这里是G:\PYTHON\VirtualEnv\scrapyEnv\Lib\site-packages\Twisted-18.4.0-cp36-cp36m-win_amd64.whl

之后运行命令

pip install Twisted-18.4.0-cp36-cp36m-win_amd64.whl

显示安装成功，再运行命令

pip install scrapy

显示安装成功，OK!

Step5 退出当前环境

1	deactivate

Some paths within the virtualenv are slightly different on Windows: scripts and executables on Windows go in ENV\Scripts\ instead of ENV/bin/ and libraries go in ENV\Lib\ rather than ENV/lib/.

To create a virtualenv under a path with spaces in it on Windows, you’ll need the win32api library installed.

上一篇: Scrapy食用指南2 下一篇: 回归

Share&Joy

Ginger' Blog