200字范文 > [python爬虫之路day20]:CrawSpider爬取微信小程序社区技术帖

[python爬虫之路day20]:CrawSpider爬取微信小程序社区技术帖

时间：2021-09-27 03:50:38

###CrawSpider:

创建CrawlSpider爬虫：

之前创建爬虫的方式是通过scrapy genspider [爬虫名字] [域名]的方式创建的。如果想要创建CrawlSpider爬虫，那么应该通过以下命令创建：

scrapy genspider -t crawl [爬虫名字] [域名]

LinkExtractors链接提取器：

使用LinkExtractors可以不用程序员自己提取想要的url，然后发送请求。这些工作都可以交给LinkExtractors，他会在所有爬的页面中找到满足规则的url，实现自动的爬取。以下对LinkExtractors类做一个简单的介绍：

class scrapy.linkextractors.LinkExtractor(allow = (),deny = (),allow_domains = (),deny_domains = (),deny_extensions = None,restrict_xpaths = (),tags = ('a','area'),attrs = ('href'),canonicalize = True,unique = True,process_value = None)

主要参数讲解：

allow：允许的url。所有满足这个正则表达式的url都会被提取。deny：禁止的url。所有满足这个正则表达式的url都不会被提取。allow_domains：允许的域名。只有在这个里面指定的域名的url才会被提取。deny_domains：禁止的域名。所有在这个里面指定的域名的url都不会被提取。restrict_xpaths：严格的xpath。和allow共同过滤链接。

Rule规则类：

定义爬虫的规则类。以下对这个类做一个简单的介绍：

class scrapy.spiders.Rule(link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None)

主要参数讲解：

link_extractor：一个LinkExtractor对象，用于定义爬取规则。callback：满足这个规则的url，应该要执行哪个回调函数。因为CrawlSpider使用了parse作为回调函数，因此不要覆盖parse作为回调函数自己的回调函数。follow：指定根据该规则从response中提取的链接是否需要跟进。process_links：从link_extractor中获取到链接后会传递给这个函数，用来过滤不需要爬取的链接。

1 allow设置规则的方法：

要能够限制在想要的url上面，不要跟别的url 产生相同的正则。

2.什么情况下使用follow:

爬取页面时，需要将当前条件的Url进行推进，则为true,否则是Fasle

3.什么情况下用callback:

需要爬取该页面的详细数据时，用true,否则不用指定。

下面看代码：

wxapp.spider.py

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom wxapp.items import WxappItemclass WxappSpiderSpider(CrawlSpider):name = 'wxapp_spider'allowed_domains = ['wxapp-']start_urls = ['http://www.wxapp-/portal.php?mod=list&catid=2&page=1']rules = (Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False))def parse_detail(self, response):title=response.xpath("//h1[@class='ph']/text()").get()author_p=response.xpath("//p[@class='authors']")author=author_p.xpath(".//a/text()").get()pub_time=author_p.xpath(".//span/text()").get()article_content=response.xpath("//td[@id='article_content']//text()").getall()content="".join(article_content).strip()item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)yield item

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlimport scrapyclass WxappItem(scrapy.Item):title=scrapy.Field()author=scrapy.Field()pub_time=scrapy.Field()content=scrapy.Field()pass

piplines.py

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.htmlfrom scrapy.exporters import JsonLinesItemExporterclass WxappPipeline(object):def __init__(self):self.fp=open("wxjc.json","wb")self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self,spider):self.fp.close()

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for wxapp project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'wxapp'SPIDER_MODULES = ['wxapp.spiders']NEWSPIDER_MODULE = 'wxapp.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'wxapp (+)'# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 1# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'wxapp.middlewares.WxappSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'wxapp.middlewares.WxappDownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {'wxapp.pipelines.WxappPipeline': 300,}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

from scrapy import cmdlinecmdline.execute("scrapy crawl wxapp_spider".split())

如下有保存的数据截图：

不得不说微信小程序看起来还是挺友好的呢。

下次想冲一下微信小程序。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。