###CrawSpider:
创建CrawlSpider爬虫:
之前创建爬虫的方式是通过scrapy genspider [爬虫名字] [域名]
的方式创建的。如果想要创建CrawlSpider
爬虫,那么应该通过以下命令创建:
scrapy genspider -t crawl [爬虫名字] [域名]
LinkExtractors链接提取器:
使用LinkExtractors
可以不用程序员自己提取想要的url,然后发送请求。这些工作都可以交给LinkExtractors
,他会在所有爬的页面中找到满足规则的url
,实现自动的爬取。以下对LinkExtractors
类做一个简单的介绍:
class scrapy.linkextractors.LinkExtractor(allow = (),deny = (),allow_domains = (),deny_domains = (),deny_extensions = None,restrict_xpaths = (),tags = ('a','area'),attrs = ('href'),canonicalize = True,unique = True,process_value = None)
主要参数讲解:
allow:允许的url。所有满足这个正则表达式的url都会被提取。deny:禁止的url。所有满足这个正则表达式的url都不会被提取。allow_domains:允许的域名。只有在这个里面指定的域名的url才会被提取。deny_domains:禁止的域名。所有在这个里面指定的域名的url都不会被提取。restrict_xpaths:严格的xpath。和allow共同过滤链接。
Rule规则类:
定义爬虫的规则类。以下对这个类做一个简单的介绍:
class scrapy.spiders.Rule(link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None)
主要参数讲解:
link_extractor:一个LinkExtractor
对象,用于定义爬取规则。callback:满足这个规则的url,应该要执行哪个回调函数。因为CrawlSpider
使用了parse
作为回调函数,因此不要覆盖parse
作为回调函数自己的回调函数。follow:指定根据该规则从response中提取的链接是否需要跟进。process_links:从link_extractor中获取到链接后会传递给这个函数,用来过滤不需要爬取的链接。
1 allow设置规则的方法:
要能够限制在想要的url上面,不要跟别的url 产生相同 的正则。
2.什么情况下使用follow:
爬取页面时,需要将当前条件的Url进行推进,则为true,否则是Fasle
3.什么情况下用callback:
需要爬取该页面的详细数据时,用true,否则不用指定。
下面看代码:
wxapp.spider.py
# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom wxapp.items import WxappItemclass WxappSpiderSpider(CrawlSpider):name = 'wxapp_spider'allowed_domains = ['wxapp-']start_urls = ['http://www.wxapp-/portal.php?mod=list&catid=2&page=1']rules = (Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False))def parse_detail(self, response):title=response.xpath("//h1[@class='ph']/text()").get()author_p=response.xpath("//p[@class='authors']")author=author_p.xpath(".//a/text()").get()pub_time=author_p.xpath(".//span/text()").get()article_content=response.xpath("//td[@id='article_content']//text()").getall()content="".join(article_content).strip()item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)yield item
items.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlimport scrapyclass WxappItem(scrapy.Item):title=scrapy.Field()author=scrapy.Field()pub_time=scrapy.Field()content=scrapy.Field()pass
piplines.py
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.htmlfrom scrapy.exporters import JsonLinesItemExporterclass WxappPipeline(object):def __init__(self):self.fp=open("wxjc.json","wb")self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self,spider):self.fp.close()
settings.py
# -*- coding: utf-8 -*-# Scrapy settings for wxapp project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'wxapp'SPIDER_MODULES = ['wxapp.spiders']NEWSPIDER_MODULE = 'wxapp.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'wxapp (+)'# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 1# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'wxapp.middlewares.WxappSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'wxapp.middlewares.WxappDownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {'wxapp.pipelines.WxappPipeline': 300,}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
start.py
from scrapy import cmdlinecmdline.execute("scrapy crawl wxapp_spider".split())
如下有保存的数据截图:
不得不说微信小程序看起来还是挺友好的呢。
下次想冲一下微信小程序。