开发 · 4月 17, 2022 0

scrapy爬虫大纲

内容纲要

**scrapy爬虫**
· **组件**
o Pipelines存储中间件
§ [超链接 ](https://docs.scrapy.org/en/latest/topics/item-pipeline.html)处理图片
§ [超链接 ](https://docs.scrapy.org/en/latest/topics/media-pipeline.html)数据存储
· sqlite3
o [超链接 ](https://github.com/napoler/scrapy_baidu/blob/master/scrapy_baidu/scrapy_baidu/db.py)子主题
· csv
· JSON
· [超链接 ](#write-items-to-a-json-file)MongoDB
· [超链接 ](#write-items-to-mongodb)ScrapyElasticSearch
§ [超链接 ](https://github.com/jayzeng/scrapy-elasticsearch)过滤重复Duplicates filter
o [超链接 ](#duplicates-filter)items定义字段
o [超链接 ](https://docs.scrapy.org/en/latest/topics/items.html)Item Loaders¶
o [超链接 ](https://docs.scrapy.org/en/latest/topics/loaders.html)项目加载器旨在提供一种灵活、高效且简单的机制,用于扩展和覆盖不同的字段解析规则,无论是通过蜘蛛还是通过源格式(HTML、XML 等),而不会成为维护的噩梦。
o middlewares下载中间件
§ requests
§ [超链接 ](https://docs.python-requests.org/en/latest/)grequests支持并发
§ 无头浏览器
· Splash
· [超链接 ](https://splash.readthedocs.io/en/stable/)selenium
o [超链接 ](https://www.selenium.dev/)Selenium IDE
o Selenium WebDriver
o [超链接 ](https://www.selenium.dev/documentation/webdriver/)Selenium Grid
· requests-HTML
§ [超链接 ](https://docs.python-requests.org/projects/requests-html/en/latest/)代理
· httpproxy
· [超链接 ](https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/httpproxy.html)scrapy-rotating-proxies
o [超链接 ](https://pypi.org/project/scrapy-rotating-proxies/)spiders爬虫
§ [超链接 ](https://docs.scrapy.org/en/latest/topics/spiders.html)选择器selectors
· [超链接 ](https://docs.scrapy.org/en/latest/topics/selectors.html)xpaths
· beautifulsoup4
· CSS
· 子主题
§ 获取参数Spider arguments
· [超链接 ](#spider-arguments)子主题
· 子主题
§ 爬虫类
· XMLFeedSpider
· CrawlSpider
· csvfeedspider
· sitemapspider
o settings
§ ROBOTSTXT_OBEY限制robots协议
§ URLFilter过滤url
§ USER_AGENT浏览器信息
o [超链接 ](https://github.com/lorien/user_agent)commands
o [超链接 ](https://docs.scrapy.org/en/latest/topics/commands.html)链接提取器Link Extractors
· [超链接 ](https://docs.scrapy.org/en/latest/topics/link-extractors.html)**拓展**
o 通用文本爬取
§ html2text
· [超链接 ](https://github.com/Alir3z4/html2text)子主题
§ Text处理
§ [超链接 ](https://github.com/napoler/Terry-toolkit/blob/master/Terry_toolkit/text.py)CxExtractor
§ [超链接 ](https://github.com/napoler/Terry-toolkit/blob/master/Terry_toolkit/CxExtractor.py)html2markdown
§ [超链接 ](https://github.com/baynezy/Html2Markdown)readability自动提取文本
· [超链接 ](https://github.com/buriy/python-readability)readability-lxml
§ pandas
o 关键词提取
§ textrank4zh
· TextRank4Keyword关键词提取
· TextRank4Sentence拆分句子
· jieba
o html解析
§ beautifulsoup4
· **Scrapy shell**
· [超链接 ](https://docs.scrapy.org/en/latest/topics/shell.html)**文档**
o 官方文档
· [超链接 ](https://docs.scrapy.org/)**示例**
o 百度搜索
o [超链接 ](https://github.com/napoler/MagicBaidu)头条
· [超链接 ](https://github.com/napoler/scrapy_news/tree/master/news_toutiao)**其他示例**
o baidu搜索
· [超链接 ](https://github.com/napoler/MagicBaidu)**服务化运行**
o [超链接 ](https://docs.scrapy.org/en/latest/topics/deploy.html)Scrapyd
o [超链接 ](#deploy-scrapyd)子主题
o https://github.com/scrapy-plugins/scrapy-jsonrpc
【金山文档】 scrapy爬虫
https://kdocs.cn/l/cabRhYAh1nV1
·

%d 博主赞过: