Scrapy文档笔记–1
PRE 前言
事情的经过是这样的:今年清明节左右,和两个同学参加了一个数据挖掘的比赛,我们的选题是影视推荐系统,嗯,温馨又从容。天杀的组委会给的数据,少的可怜的特征,于是我们把目光(屠刀)转向豆瓣,那么可怜,幼小又无助。在一番摸索(艰苦)之后,果断投入Scrapy框架的怀抱。
INTRO 介绍
Scrapy是一个抓取网页提取有效数据的应用框架,它也可以被用来通过调用APIs来提取数据.
EXAMPLE 例子
1 | import scrapy |
保存代码为quotes_spider.py
,运行命令runspider
1 | scrapy runspider quotes_spider.py -o quotes.json |
程序终止后你会发现文件夹中多了quotes.json
文件,就像下面这样
1 | [{ |
What just happened? 发生了什么?
当你运行 scrapy runspider quotes_spider.py
时,Scrapy会在代码中寻找爬虫定义并且通过他的crawler engine运行代码。
爬取通过向start_urls
属性中定义的URLs发出请求开始并且调用默认的callback方法parse
,传递response对象作为一个参数.在回调函数parse
中,我们使用CSS Selector遍历 quote 元素,用提取出来的 quote text和author来生成一个Python字典,寻找下一页的链接,安排另一个请求并使用相同的parse
作为回调函数.
你会发现Scrapy的一个主要优点就是,请求的调度和处理是异步的,这意味着Scrapy不会等待一个请求完成然后处理它,在这段时间它可以发送另一个请求或者做别的事情.这也意味着其他请求可以继续进行即使一些请求失败或者在处理过程中发生了一些错误.尽管这允许你做非常快的抓取(以一个容错的方式同时发送多个并发的请求).同时,Scrapy给你提供了一些设置选项以控制爬取的礼貌(我的理解是,不要同时搞太多请求以防止对方服务器搞崩溃).你可以设置每次请求间的下载延时,设置每个域名或者每个IP的并发请求数量,甚至可以使用自动限制扩展来自动完成这些事情.
这里使用了反馈输出来生成 JSON 文件,你可以轻松地改变输出文件的格式(XML 或者 CSV)或者存储后端(FTP或者Amazon S3).当然也可以写一个 item pipeline 以在数据库中存储 items.
What else? 其他特性
Scrapy提供了许多强有力的特征
- Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
- An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
- Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
- Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
- Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
- Wide range of built-in extensions and middlewares for handling:
- cookies and session handling
- HTTP features like compression, authentication, caching
- user-agent spoofing
- robots.txt
- crawl depth restriction
- and more
- A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
- Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!
遇到一点小问题,在网上查了一下,是因为Scrapy版本过低
AttributeError: ‘HtmlResponse’ object has no attribute ‘follow’
INSTALL GUIDE 安装指南
Scrapy支持Python 2.7/3.4及以上版本
如果你使用的是Anaconda或者Miniconda,你可以通过下面的方式获取最新的 scrapy 包.
1 | conda install -c conda-forge scrapy |
或者你足够熟悉 Python 包的安装管理,你可以通过pip
安装scrapy及它的依赖.
1 | python -m pip install --upgrade pip //升级pip |
请注意,为了安装Scrapy,有时候得根据你的操作系统为一些Scrapy的依赖解决一些编译问题,所以确保检查过Platform specific installation notes
我们强烈推荐你在专用的 虚拟环境(dedicated virtualenv) 中安装 Scrapy以避免和你系统中的包冲突.
一些最好知道的事情
Scrapy是用纯Python书写并且依赖几个关键的Python包
- lxml, an efficient XML and HTML parser
- parsel, an HTML/XML data extraction library written on top of lxml,
- w3lib, a multi-purpose helper for dealing with URLs and web page encodings
- twisted, an asynchronous networking framework
- cryptography and pyOpenSSL, to deal with various network-level security needs
The minimal versions which Scrapy is tested against are:
- Twisted 14.0
- lxml 3.4
- pyOpenSSL 0.14
Scrapy may work with older versions of these packages but it is not guaranteed it will continue working because it’s not being tested against them.
Some of these packages themselves depends on non-Python packages that might require additional installation steps depending on your platform. Please check platform-specific guides below.
In case of any trouble related to these dependencies, please refer to their respective installation instructions:
使用虚拟环境(推荐)
我们推荐在所有平台上在专门的虚拟环境中安装Scrapy
- 安装 virtualenv
1 | pip install virtualenv |
安装 scrapy
Windows
1
2//在 virtualenv 中
//conda install -c conda-forge scrapyUbuntu 14.04 or above
1
2
3
4
5
6
7//不要使用 ubuntu 提供的 python-scrapy 太旧了
//在 virtualenv 中
//安装依赖
sudo apt-get install python3 python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
pip install scrapy
Virtualenv用户指南
安装
1 | pip install virtualenv |
使用
Step1 创建目录
1 | mkdir pythonV |
Step2 创建一个独立的Python运行环境 scrapyEnv
1 | virtual --no-site-packages venv |
Step3 进入独立的scrapyEnv环境目录下
1 | //Ubuntu |
Step4 安装各种第三方包
1 | pip install scrapy |
在这里遇到一点问题,安装scrapy的时候,报错
error: [WinError 3] 系统找不到指定的路径。: 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\PlatformSDK\\lib'
在网上找了一下方法,比较合适的有两种
安装VS开发工具,也就是VS2015/2017,显然这种方法可行性比较低
在这个网站下载你需要的包的.whl文件,放到对应文件夹底下,我这里是
G:\PYTHON\VirtualEnv\scrapyEnv\Lib\site-packages\Twisted-18.4.0-cp36-cp36m-win_amd64.whl
之后运行命令
pip install Twisted-18.4.0-cp36-cp36m-win_amd64.whl
显示安装成功,再运行命令
pip install scrapy
显示安装成功,OK!
Step5 退出当前环境
1 | deactivate |
Some paths within the virtualenv are slightly different on Windows: scripts and executables on Windows go in
ENV\Scripts\
instead ofENV/bin/
and libraries go inENV\Lib\
rather thanENV/lib/
.To create a virtualenv under a path with spaces in it on Windows, you’ll need the win32api library installed.