200字范文 > python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb

python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb

时间：2022-10-01 07:02:01

1.目标界面：/ershoufang/

2.爬取的信息：①标题 ②总价 ③小区名 ④所在地区名 ⑤详细信息 ⑥详细信息里的面积

3.存入：MongoDB

上面链接是东莞的二手房信息，如果需要爬取别的信息更改url即可,因为网页结构没变：

/ershoufang/ 北京二手房信息

/ershoufang/ 广州二手房信息

/ershoufang/tianhe 广州天河区二手房信息

…

下面就是具体的代码了：

ershoufang_spider.py:

import scrapyfrom lianjia_dongguan.items import LianjiaDongguanItem #这是item.py定义的classclass lianjiadongguanSpider(scrapy.Spider):name = "ershoufang" # 爬虫的名字，后面运行要用global start_pagestart_page=1start_urls=["/ershoufang/haizhu/pg"+str(start_page)]def parse(self, response):for item in response.xpath('//div[@class="info clear"]'):yield {"title": item.xpath('.//div[@class="title"]/a/text()').extract_first().strip(),"Community": item.xpath('.//div[@class="positionInfo"]/a[1]/text()').extract_first(),"district": item.xpath('.//div[@class="positionInfo"]/a[2]/text()').extract_first(),"price": item.xpath('.//div[@class="totalPrice"]/span/text()').extract_first().strip(),"area": item.xpath('.//div[@class="houseInfo"]/text()').re("\d室\d厅 \| (.+)平米")[0],"info": item.xpath('.//div[@class="houseInfo"]/text()').extract_first().replace("平米", "㎡").strip()}i=1while i <= 15:j=i+start_pagei = i + 1next_url="/ershoufang/haizhu/pg"+str(j)yield scrapy.Request(next_url,dont_filter=True,callback=self.parse)

items.py:

# -*- coding: utf-8 -*-# Define here the models for your scraped items# See documentation in:# /en/latest/topics/items.htmlimport scrapyclass LianjiaDongguanItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field()info=scrapy.Field()location=scrapy.Field()price=scrapy.Field()Community=scrapy.Field()pass

middlewares.py:

# -*- coding: utf-8 -*-# Define here the models for your spider middleware## See documentation in:# /en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass LianjiaDongguanSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request, dict# or Item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class LianjiaDongguanDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of# installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)from fake_useragent import UserAgentclass UserAgentMiddleware(object):def __init__(self, crawler):super().__init__()self.ua = UserAgent()@classmethoddef from_crawler(cls, crawler):return cls(crawler)def process_request(self, request,spider):# ！这也是写入headers的一种方式，可以写进去request.headers['User-Agent'] = self.ua.random# request.headers['Cookie'] = {"ws":"wer"} 也可以写入print('User-Agent:'+str(request.headers.getlist('User-Agent'))) # 返回一个list，这种读法稳，都能读出来print('User-Agent:' + str(request.headers['User-Agent'])) # 返回一个str，这种读法必须有上面写法对应，否则实际上有也报错def process_response(self, request, response, spider):print("请求头Cookie：" + str(request.headers.getlist('Cookie')))print("响应头Cookie：" + str(response.headers.getlist('Set-Cookie')))print("这是响应头："+str(response.headers))print("这是请求头："+str(request.headers))return response

pipelines.py:

# -*- coding: utf-8 -*-# Define your item pipelines here# Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.utils.project import get_project_settingssettings = get_project_settings()class LianjiaPipeline(object):def __init__(self):host = settings['MONGODB_HOST']port = settings['MONGODB_PORT']db_name = settings["MONGODB_DBNAME"]client = pymongo.MongoClient(host=host, port=port) # 连接mongodbdb = client[db_name] # 连接数据库self.post = db[settings["MONGODB_DOCNAME"]] # self.post定义collectiondef process_item(self, item, spider):zufang = dict(item) # 生成器转成字典格式self.post.insert(zufang) # 插入collectionsreturn item

settings.py:

# -*- coding: utf-8 -*-# Scrapy settings for lianjia_dongguan project# For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:#/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'lianjia_dongguan'SPIDER_MODULES = ['lianjia_dongguan.spiders']NEWSPIDER_MODULE = 'lianjia_dongguan.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'lianjia_dongguan (+)'# Obey robots.txt rulesROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 10DOWNLOAD_DELAY_RANDOM=True# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)# 当COOKIES_ENABLED是注释#的时候scrapy默认没有开启cookie# 当COOKIES_ENABLED设置为False的时候scrapy默认使用了settings里面的cookie# 当COOKIES_ENABLED设置为True的时候scrapy就会把settings的cookie关掉，使用自定义cookieCOOKIES_ENABLED = FalseCOOKIES_DEBUG = False# 是否打印set-cookie# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','Cookie': {'TY_SESSION_ID': '84ad2c08-9ea5-4193-927a-221fd9bae52b', 'lianjia_uuid': '4fc8b1ad-d17f-45c5-85a3-6e0072d35e1e', 'UM_distinctid': '170a9b9075c4d0-0efd9a9b0cf32d-34564a7c-e1000-170a9b9075d52f', '_jzqc': '1', '_jzqy': '1.1583395441.1583395441.1.jzqsr', '_jzqckmp': '1', '_smt_uid': '5e60b270.14483188', 'sajssdk__cross_new_user': '1', '_ga': 'GA1.2.887972367.1583395443', '_gid': 'GA1.2.365215785.1583395443', 'select_city': '441900', '_qzjc': '1', 'Hm_lvt_9152f8221cb6243a53c83b956842be8a': '1583395559', 'sensorsdatajssdkcross': '%7B%22distinct_id%22%3A%22170a9b909b4715-0769015f66a016-34564a7c-921600-170a9b909b553%22%2C%22%24device_id%22%3A%22170a9b909b4715-0769015f66a016-34564a7c-921600-170a9b909b553%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D', 'CNZZDATA1254525948': '426683749-1583390573-https%253A%252F%%252F%7C1583401373', 'CNZZDATA1255604082': '143512665-1583390710-https%253A%252F%%252F%7C1583401510', 'lianjia_ssid': 'd4d26773-ce0d-8cfe-ab64-d39d54960c3c', 'CNZZDATA1255633284': '04093-1583390793-https%253A%252F%%252F%7C1583401593', '_jzqa': '1.637230317292238100.1583395441.1583401552.1583405268.4', 'Hm_lpvt_9152f8221cb6243a53c83b956842be8a': '1583405272', '_qzja': '1.884599034.1583395479079.1583401552386.1583405267952.1583405267952.1583405271972.0.0.0.13.4', '_qzjb': '1.1583405267952.2.0.0.0', '_qzjto': '13.4.0', '_jzqb': '1.2.10.1583405268.1'},'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'lianjia_dongguan.middlewares.LianjiaDongguanSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'lianjia_dongguan.middlewares.LianjiaDongguanDownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.html#ITEM_PIPELINES = {# 'lianjia_dongguan.pipelines.LianjiaDongguanPipeline': 300,#}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'MONGODB_HOST="127.0.0.1"MONGODB_PORT=27017MONGODB_DBNAME="lianjia"MONGODB_DOCNAME="ershoufang"ITEM_PIPELINES={"lianjia_dongguan.pipelines.LianjiaPipeline":300}# IF Retry when failRETRY_ENABLED=True# Retry many times since proxies often fail #总的次数而不是每个ip的次数RETRY_TIMES = 1000# Retry on most error codes since proxies fail for different reasonsRETRY_HTTP_CODES = [500, 503, 504, 400, 404 ,403,408,301,302]DOWNLOADER_MIDDLEWARES = {'lianjia_dongguan.middlewares.UserAgentMiddleware': 200,'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,'scrapy_proxies.RandomProxy': 100,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,}# Proxy list containing entries like# http://host1:port# http://username:password@host2:port# http://host3:port# 这是存放代理IP列表的位置PROXY_LIST = 'C:/Users/wzq1643/Desktop/HTTPSip.txt'#代理模式# 0 = Every requests have different proxy# 1 = Take only one proxy from the list and assign it to every requests# 2 = Put a custom proxy to use in the settingsPROXY_MODE = 0import random#如果使用模式2，将下面解除注释：CUSTOM_PROXY = "http://49.81.190.209"HTTPERROR_ALLOWED_CODES = [301,302]MEDIA_ALLOW_REDIRECTS =True

以上就是scrapy的全部代码了，去年写的，亲测还能用:

(base) PS C:\Users\wzq1643\scrapy\lianjia_dongguan\lianjia_dongguan\spiders> scrapy runspider ershoufang_spider.py

说一下这里用到的反爬技巧：

随机UA：python有个库fake_useragent可以自动生成随机UA，在middleware.py里

参考：/cnmnui/article/details/99852347

IP池：实力有限，我是从网上找的免费的ip，质量可能不高，放到txt里，调用方法在settings.py

参考：/p/c656ad21c42f

cookie池：这里没用到什么cookie池，就是多弄了几个cookie放到settings.py

最后，附上一个用pymongo从MongoDB中将这些数据导出为excel的代码：

import pandas as pdimport pymongo#1.连接mongodbclient=pymongo.MongoClient(host='127.0.0.1' , port=27017)#2.指定数据库和集合:如果没有则会自动创建，db = client['lianjia']#或db=client. pymongo pymong库collection=db.ershoufang #或collection=db["t1"] pymong库的t1集合list=[]for i in collection.find () :list.append(dict(i))print(list)df=pd.DataFrame (list)print(df)df.to_excel("C:/Users/wzq1643/Desktop/gz_ershoufang.xls")

excel打开如下：

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。