200字范文,内容丰富有趣,生活中的好帮手!
200字范文 > Python爬虫小实践:寻找失踪人口 爬取失踪儿童信息并写成csv文件 方便存入数据库...

Python爬虫小实践:寻找失踪人口 爬取失踪儿童信息并写成csv文件 方便存入数据库...

时间:2018-06-27 09:59:36

相关推荐

Python爬虫小实践:寻找失踪人口 爬取失踪儿童信息并写成csv文件 方便存入数据库...

前两天有人私信我,让我爬这个网站,/f...上的失踪儿童信息,准备根据失踪儿童的失踪时的地理位置来更好的寻找失踪儿童,这种事情本就应该义不容辞,如果对网站服务器造成负荷,还请谅解。

这次依然是用第三方爬虫包BeautifulSoup,还有Selenium+Chrome,Selenium+PhantomJS来爬取信息。

通过分析网站的框架,依然分三步来进行。

步骤一:获取/f...这个版块上的所有分页页面链接

步骤二:获取每一个分页链接上所发的帖子的链接

步骤三:获取每一个帖子链接上要爬取的信息,编号,姓名,性别,出生日期,失踪时身高,失踪时间,失踪地点,以及是否报案

起先用的BeautifulSoup,但是被管理员设置了网站重定向,然后就采用selenium的方式,在这里还是对网站管理员说一声抱歉。

1、获取/f...这个版块上的所有分页页面链接

通过分析:发现分页的页面链接处于<div class="pg">下,所以写了以下的代码

BeautifulSoup形式:

[python]view plaincopy1.defGetALLPageUrl(siteUrl):2.#设置代理IP访问3.#代理IP可以上/获取4.proxy_handler=urllib.request.ProxyHandler({'https':'111.76.129.200:808'})5.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()6.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)7.urllib.request.install_opener(opener)8.#获取网页信息9.req=request.Request(siteUrl,headers=headers1orheaders2orheaders3)10.html=urlopen(req)11.bsObj=BeautifulSoup(html.read(),"html.parser")12.html.close()13.#/forum-191-1.html变成,以便组成页面链接14.siteindex=siteUrl.rfind("/")15.tempsiteurl=siteUrl[0:siteindex+1]#/16.tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-17.18.#爬取想要的信息19.bianhao=[]#存储页面编号20.pageUrl=[]#存储页面链接21.templist1=bsObj.find("div",{"class":"pg"})22.fortemplist2intemplist1.findAll("a",href=pile("forum-([0-9]+)-([0-9]+).html")):23.lianjie=templist2.attrs['href']24.#print(lianjie)25.index1=lianjie.rfind("-")#查找-在字符串中的位置26.index2=lianjie.rfind(".")#查找.在字符串中的位置27.tempbianhao=lianjie[index1+1:index2]28.bianhao.append(int(tempbianhao))29.bianhaoMax=max(bianhao)#获取页面的最大编号30.31.foriinrange(1,bianhaoMax+1):32.temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#组成页面链接33.#print(temppageUrl)34.pageUrl.append(temppageUrl)35.returnpageUrl#返回页面链接列表Selenium形式:[python]view plaincopy1.#得到当前板块所有的页面链接2.#siteUrl为当前版块的页面链接3.defGetALLPageUrl(siteUrl):4.#设置代理IP访问5.#代理IP可以上/获取6.proxy_handler=urllib.request.ProxyHandler({'post':'123.207.143.51:8080'})7.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()8.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)9.urllib.request.install_opener(opener)10.11.try:12.#掉用第三方包selenium打开浏览器登陆13.#driver=webdriver.Chrome()#打开chrome14.driver=webdriver.Chrome()#打开无界面浏览器Chrome15.#driver=webdriver.PhantomJS()#打开无界面浏览器PhantomJS16.driver.set_page_load_timeout(10)17.#driver.implicitly_wait(30)18.try:19.driver.get(siteUrl)#登陆两次20.driver.get(siteUrl)21.exceptTimeoutError:22.driver.refresh()23.24.#print(driver.page_source)25.html=driver.page_source#将浏览器执行后的源代码赋给html26.#获取网页信息27.#抓捕网页解析过程中的错误28.try:29.#req=request.Request(tieziUrl,headers=headers5)30.#html=urlopen(req)31.bsObj=BeautifulSoup(html,"html.parser")32.#print(bsObj.find('title').get_text())33.#html.close()34.exceptUnicodeDecodeErrorase:35.print("-----UnicodeDecodeErrorurl",siteUrl)36.excepturllib.error.URLErrorase:37.print("-----urlErrorurl:",siteUrl)38.exceptsocket.timeoutase:39.print("-----sockettimout:",siteUrl)40.41.42.43.while(bsObj.find('title').get_text()=="页面重载开启"):44.print("当前页面不是重加载后的页面,程序会尝试刷新一次到跳转后的页面\n")45.driver.get(siteUrl)46.html=driver.page_source#将浏览器执行后的源代码赋给html47.bsObj=BeautifulSoup(html,"html.parser")48.exceptExceptionase:49.50.driver.close()#Closethecurrentwindow.51.driver.quit()#关闭chrome浏览器52.#time.sleep()53.54.driver.close()#Closethecurrentwindow.55.driver.quit()#关闭chrome浏览器56.57.58.#/forum-191-1.html变成,以便组成页面链接59.siteindex=siteUrl.rfind("/")60.tempsiteurl=siteUrl[0:siteindex+1]#/61.tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-62.63.#爬取想要的信息64.bianhao=[]#存储页面编号65.pageUrl=[]#存储页面链接66.67.templist1=bsObj.find("div",{"class":"pg"})68.#iftemplist1==None:69.#return70.fortemplist2intemplist1.findAll("a",href=pile("forum-([0-9]+)-([0-9]+).html")):71.iftemplist2==None:72.continue73.lianjie=templist2.attrs['href']74.#print(lianjie)75.index1=lianjie.rfind("-")#查找-在字符串中的位置76.index2=lianjie.rfind(".")#查找.在字符串中的位置77.tempbianhao=lianjie[index1+1:index2]78.bianhao.append(int(tempbianhao))79.bianhaoMax=max(bianhao)#获取页面的最大编号80.81.foriinrange(1,bianhaoMax+1):82.temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#组成页面链接83.print(temppageUrl)84.pageUrl.append(temppageUrl)85.returnpageUrl#返回页面链接列表

2.获取每一个分页链接上所发的帖子的链接

每个帖子的链接都位于href下

所以写了以下的代码:

BeautifulSoup形式:

[python]view plaincopy1.#得到当前版块页面所有帖子的链接2.defGetCurrentPageTieziUrl(PageUrl):3.#设置代理IP访问4.#代理IP可以上/获取5.proxy_handler=urllib.request.ProxyHandler({'post':'121.22.252.85:8000'})6.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()7.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)8.urllib.request.install_opener(opener)9.#获取网页信息10.req=request.Request(PageUrl,headers=headers1orheaders2orheaders3)11.html=urlopen(req)12.bsObj=BeautifulSoup(html.read(),"html.parser")13.html.close()14.#/forum-191-1.html变成,以便组成帖子链接15.siteindex=PageUrl.rfind("/")16.tempsiteurl=PageUrl[0:siteindex+1]#/17.#print(tempsiteurl)18.TieziUrl=[]19.#爬取想要的信息20.fortemplist1inbsObj.findAll("tbody",id=pile("normalthread_([0-9]+)")):21.fortemplist2intemplist1.findAll("a",{"class":"sxst"}):22.tempteiziUrl=tempsiteurl+templist2.attrs['href']#组成帖子链接23.print(tempteiziUrl)24.TieziUrl.append(tempteiziUrl)25.returnTieziUrl#返回帖子链接列表Selenium形式:[python]view plaincopy1.#得到当前版块页面所有帖子的链接2.defGetCurrentPageTieziUrl(PageUrl):3.#设置代理IP访问4.#代理IP可以上/获取5.proxy_handler=urllib.request.ProxyHandler({'post':'110.73.30.157:8123'})6.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()7.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)8.urllib.request.install_opener(opener)9.10.try:11.#掉用第三方包selenium打开浏览器登陆12.#driver=webdriver.Chrome()#打开chrome13.driver=webdriver.Chrome()#打开无界面浏览器Chrome14.#driver=webdriver.PhantomJS()#打开无界面浏览器PhantomJS15.driver.set_page_load_timeout(10)16.try:17.driver.get(PageUrl)#登陆两次18.driver.get(PageUrl)19.exceptTimeoutError:20.driver.refresh()21.22.#print(driver.page_source)23.html=driver.page_source#将浏览器执行后的源代码赋给html24.#获取网页信息25.#抓捕网页解析过程中的错误26.try:27.#req=request.Request(tieziUrl,headers=headers5)28.#html=urlopen(req)29.bsObj=BeautifulSoup(html,"html.parser")30.#html.close()31.exceptUnicodeDecodeErrorase:32.print("-----UnicodeDecodeErrorurl",PageUrl)33.excepturllib.error.URLErrorase:34.print("-----urlErrorurl:",PageUrl)35.exceptsocket.timeoutase:36.print("-----sockettimout:",PageUrl)37.38.n=039.while(bsObj.find('title').get_text()=="页面重载开启"):40.print("当前页面不是重加载后的页面,程序会尝试刷新一次到跳转后的页面\n")41.driver.get(PageUrl)42.html=driver.page_source#将浏览器执行后的源代码赋给html43.bsObj=BeautifulSoup(html,"html.parser")44.n=n+145.ifn==10:46.driver.close()#Closethecurrentwindow.47.driver.quit()#关闭chrome浏览器48.return149.50.exceptExceptionase:51.driver.close()#Closethecurrentwindow.52.driver.quit()#关闭chrome浏览器53.time.sleep(1)54.55.driver.close()#Closethecurrentwindow.56.driver.quit()#关闭chrome浏览器57.58.59.#/forum-191-1.html变成,以便组成帖子链接60.siteindex=PageUrl.rfind("/")61.tempsiteurl=PageUrl[0:siteindex+1]#/62.#print(tempsiteurl)63.TieziUrl=[]64.#爬取想要的信息65.fortemplist1inbsObj.findAll("tbody",id=pile("normalthread_([0-9]+)")):66.iftemplist1==None:67.continue68.fortemplist2intemplist1.findAll("a",{"class":"sxst"}):69.iftemplist2==None:70.continue71.tempteiziUrl=tempsiteurl+templist2.attrs['href']#组成帖子链接72.print(tempteiziUrl)73.TieziUrl.append(tempteiziUrl)74.returnTieziUrl#返回帖子链接列表

3.获取每一个帖子链接上要爬取的信息,编号,姓名,性别,出生日期,失踪时身高,失踪时间,失踪地点,以及是否报案,并写入CSV中

通过查看每一个帖子的链接,发现其失踪人口信息都在<ul>标签下,所以编写了以下的代码

BeautifulSoup形式:

[python]view plaincopy1.#得到当前页面失踪人口信息2.#pageUrl为当前帖子页面链接3.defCurrentPageMissingPopulationInformation(tieziUrl):4.#设置代理IP访问5.#代理IP可以上/获取6.proxy_handler=urllib.request.ProxyHandler({'post':'210.136.17.78:8080'})7.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()8.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)9.urllib.request.install_opener(opener)10.#获取网页信息11.req=request.Request(tieziUrl,headers=headers1orheaders2orheaders3)12.html=urlopen(req)13.bsObj=BeautifulSoup(html.read(),"html.parser")14.html.close()15.#查找想要的信息16.templist1=bsObj.find("td",{"class":"t_f"}).ul17.iftemplist1==None:#判断是否不包含ul字段,如果不,跳出函数18.return19.mycsv=['NULL','NULL','NULL','NULL','NULL','NULL','NULL','NULL']#初始化提取信息列表20.fortemplist2intemplist1.findAll("font",size=pile("^([0-9]+)*$")):21.iflen(templist2)==0:22.continue23.tempText=templist2.get_text()24.#print(tempText[0:4])25.if"宝贝回家编号"intempText[0:6]:26.print(tempText)27.index=tempText.find(":")28.tempText=tempText[index+1:]29.#mycsv.append(tempText)30.iflen(tempText)==0:31.tempText="NULL"32.mycsv[0]=tempText33.if"寻亲编号"intempText[0:6]:34.print(tempText)35.index=tempText.find(":")36.tempText=tempText[index+1:]37.iflen(tempText)==0:38.tempText="NULL"39.#mycsv.append(tempText)40.mycsv[0]=tempText41.if"登记编号"intempText[0:6]:42.print(tempText)43.index=tempText.find(":")44.tempText=tempText[index+1:]45.iflen(tempText)==0:46.tempText="NULL"47.#mycsv.append(tempText)48.mycsv[0]=tempText49.if"姓"intempText[0:6]:50.print(tempText)51.index=tempText.find(":")52.tempText=tempText[index+1:]53.#mycsv.append(tempText)54.mycsv[1]=tempText55.if"性"intempText[0:6]:56.print(tempText)57.index=tempText.find(":")58.tempText=tempText[index+1:]59.#mycsv.append(tempText)60.mycsv[2]=tempText61.if"出生日期"intempText[0:6]:62.print(tempText)63.index=tempText.find(":")64.tempText=tempText[index+1:]65.#mycsv.append(tempText)66.mycsv[3]=tempText67.if"失踪时身高"intempText[0:6]:68.print(tempText)69.index=tempText.find(":")70.tempText=tempText[index+1:]71.#mycsv.append(tempText)72.mycsv[4]=tempText73.if"失踪时间"intempText[0:6]:74.print(tempText)75.index=tempText.find(":")76.tempText=tempText[index+1:]77.#mycsv.append(tempText)78.mycsv[5]=tempText79.if"失踪日期"intempText[0:6]:80.print(tempText)81.index=tempText.find(":")82.tempText=tempText[index+1:]83.#mycsv.append(tempText)84.mycsv[5]=tempText85.if"失踪地点"intempText[0:6]:86.print(tempText)87.index=tempText.find(":")88.tempText=tempText[index+1:]89.#mycsv.append(tempText)90.mycsv[6]=tempText91.if"是否报案"intempText[0:6]:92.print(tempText)93.index=tempText.find(":")94.tempText=tempText[index+1:]95.#mycsv.append(tempText)96.mycsv[7]=tempText97.try:98.writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#写入CSV文件99.finally:100.time.sleep(1)#设置爬完之后的睡眠时间,这里先设置为1秒

Selenium形式:

[python]view plaincopy1.#得到当前页面失踪人口信息2.#pageUrl为当前帖子页面链接3.defCurrentPageMissingPopulationInformation(tieziUrl):4.#设置代理IP访问5.#代理IP可以上/获取6.proxy_handler=urllib.request.ProxyHandler({'post':'128.199.169.17:80'})7.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()8.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)9.urllib.request.install_opener(opener)10.11.try:12.#掉用第三方包selenium打开浏览器登陆13.#driver=webdriver.Chrome()#打开chrome14.driver=webdriver.Chrome()#打开无界面浏览器Chrome15.#driver=webdriver.PhantomJS()#打开无界面浏览器PhantomJS16.driver.set_page_load_timeout(10)17.#driver.implicitly_wait(30)18.try:19.driver.get(tieziUrl)#登陆两次20.driver.get(tieziUrl)21.exceptTimeoutError:22.driver.refresh()23.24.#print(driver.page_source)25.html=driver.page_source#将浏览器执行后的源代码赋给html26.#获取网页信息27.#抓捕网页解析过程中的错误28.try:29.#req=request.Request(tieziUrl,headers=headers5)30.#html=urlopen(req)31.bsObj=BeautifulSoup(html,"html.parser")32.#html.close()33.exceptUnicodeDecodeErrorase:34.print("-----UnicodeDecodeErrorurl",tieziUrl)35.excepturllib.error.URLErrorase:36.print("-----urlErrorurl:",tieziUrl)37.exceptsocket.timeoutase:38.print("-----sockettimout:",tieziUrl)39.40.41.while(bsObj.find('title').get_text()=="页面重载开启"):42.print("当前页面不是重加载后的页面,程序会尝试刷新一次到跳转后的页面\n")43.driver.get(tieziUrl)44.html=driver.page_source#将浏览器执行后的源代码赋给html45.bsObj=BeautifulSoup(html,"html.parser")46.exceptExceptionase:47.driver.close()#Closethecurrentwindow.48.driver.quit()#关闭chrome浏览器49.time.sleep(0.5)50.51.driver.close()#Closethecurrentwindow.52.driver.quit()#关闭chrome浏览器53.54.55.#查找想要的信息56.templist1=bsObj.find("td",{"class":"t_f"}).ul57.iftemplist1==None:#判断是否不包含ul字段,如果不,跳出函数58.print("当前帖子页面不包含ul字段")59.return160.mycsv=['NULL','NULL','NULL','NULL','NULL','NULL','NULL','NULL']#初始化提取信息列表61.fortemplist2intemplist1.findAll("font",size=pile("^([0-9]+)*$")):62.tempText=templist2.get_text()63.#print(tempText[0:4])64.if"宝贝回家编号"intempText[0:6]:65.print(tempText)66.index=tempText.find(":")67.tempText=tempText[index+1:]68.#mycsv.append(tempText)69.iflen(tempText)==0:70.tempText="NULL"71.mycsv[0]=tempText72.if"寻亲编号"intempText[0:6]:73.print(tempText)74.index=tempText.find(":")75.tempText=tempText[index+1:]76.iflen(tempText)==0:77.tempText="NULL"78.#mycsv.append(tempText)79.mycsv[0]=tempText80.if"登记编号"intempText[0:6]:81.print(tempText)82.index=tempText.find(":")83.tempText=tempText[index+1:]84.iflen(tempText)==0:85.tempText="NULL"86.#mycsv.append(tempText)87.mycsv[0]=tempText88.if"姓"intempText[0:6]:89.print(tempText)90.index=tempText.find(":")91.tempText=tempText[index+1:]92.#mycsv.append(tempText)93.mycsv[1]=tempText94.if"性"intempText[0:6]:95.print(tempText)96.index=tempText.find(":")97.tempText=tempText[index+1:]98.#mycsv.append(tempText)99.mycsv[2]=tempText100.if"出生日期"intempText[0:6]:101.print(tempText)102.index=tempText.find(":")103.tempText=tempText[index+1:]104.#mycsv.append(tempText)105.mycsv[3]=tempText106.if"失踪时身高"intempText[0:6]:107.print(tempText)108.index=tempText.find(":")109.tempText=tempText[index+1:]110.#mycsv.append(tempText)111.mycsv[4]=tempText112.if"失踪时间"intempText[0:6]:113.print(tempText)114.index=tempText.find(":")115.tempText=tempText[index+1:]116.#mycsv.append(tempText)117.mycsv[5]=tempText118.if"失踪日期"intempText[0:6]:119.print(tempText)120.index=tempText.find(":")121.tempText=tempText[index+1:]122.#mycsv.append(tempText)123.mycsv[5]=tempText124.if"失踪地点"intempText[0:6]:125.print(tempText)126.index=tempText.find(":")127.tempText=tempText[index+1:]128.#mycsv.append(tempText)129.mycsv[6]=tempText130.if"是否报案"intempText[0:6]:131.print(tempText)132.index=tempText.find(":")133.tempText=tempText[index+1:]134.#mycsv.append(tempText)135.mycsv[7]=tempText136.try:137.writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#写入CSV文件138.csvfile.flush()#马上将这条数据写入csv文件中139.finally:140.print("当前帖子信息写入完成\n")141.time.sleep(5)#设置爬完之后的睡眠时间,这里先设置为1秒

现附上所有代码,此代码仅供参考,不能用于商业用途,网络爬虫易给网站服务器造成巨大负荷,任何人使用本代码所引起的任何后果,本人不予承担法律责任。贴出代码的初衷是供大家学习爬虫,大家只是研究下网络框架即可,不要使用此代码去加重网站负荷,本人由于不当使用,已被封IP,前车之鉴,爬取失踪人口信息只是为了从空间上分析人口失踪的规律,由此给网站造成的什么不便,请见谅。

附上所有代码:

[python]view plaincopy1.#__author__='Administrator'2.#coding=utf-83.importio4.importos5.importsys6.importmath7.importurllib8.fromurllib.requestimporturlopen9.fromurllib.requestimporturlretrieve10.fromurllibimportrequest11.frombs4importBeautifulSoup12.importre13.importtime14.importsocket15.importcsv16.fromseleniumimportwebdriver17.18.socket.setdefaulttimeout(5000)#设置全局超时函数19.20.21.22.sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')23.#sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')24.#设置不同的headers,伪装为不同的浏览器25.headers1={'User-Agent':'Mozilla/5.0(WindowsNT6.1;WOW64;rv:23.0)Gecko/0101Firefox/23.0'}26.headers2={'User-Agent':'Mozilla/5.0(WindowsNT6.3;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/45.0.2454.101Safari/537.36'}27.headers3={'User-Agent':'Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.11(KHTML,likeGecko)Chrome/23.0.1271.64Safari/537.11'}28.headers4={'User-Agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/53.0.2785.104Safari/537.36Core/1.53.2372.400QQBrowser/9.5.10548.400'}29.headers5={'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',30.'Connection':'keep-alive',31.'Host':'',32.'Referer':'/forum-191-1.html',33.'Upgrade-Insecure-Requests':'1',34.'User-Agent':'Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/51.0.2704.103Safari/537.36'}35.36.headers6={'Host':'',37.'User-Agent':'Mozilla/5.0(WindowsNT6.1;WOW64;rv:51.0)Gecko/0101Firefox/51.0',38.'Accept':'textml,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',39.'Connection':'keep-alive',40.'Upgrade-Insecure-Requests':'1'41.}42.#得到当前页面失踪人口信息43.#pageUrl为当前帖子页面链接44.defCurrentPageMissingPopulationInformation(tieziUrl):45.#设置代理IP访问46.#代理IP可以上/获取47.proxy_handler=urllib.request.ProxyHandler({'post':'128.199.169.17:80'})48.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()49.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)50.urllib.request.install_opener(opener)51.52.try:53.#掉用第三方包selenium打开浏览器登陆54.#driver=webdriver.Chrome()#打开chrome55.driver=webdriver.Chrome()#打开无界面浏览器Chrome56.#driver=webdriver.PhantomJS()#打开无界面浏览器PhantomJS57.driver.set_page_load_timeout(10)58.#driver.implicitly_wait(30)59.try:60.driver.get(tieziUrl)#登陆两次61.driver.get(tieziUrl)62.exceptTimeoutError:63.driver.refresh()64.65.#print(driver.page_source)66.html=driver.page_source#将浏览器执行后的源代码赋给html67.#获取网页信息68.#抓捕网页解析过程中的错误69.try:70.#req=request.Request(tieziUrl,headers=headers5)71.#html=urlopen(req)72.bsObj=BeautifulSoup(html,"html.parser")73.#html.close()74.exceptUnicodeDecodeErrorase:75.print("-----UnicodeDecodeErrorurl",tieziUrl)76.excepturllib.error.URLErrorase:77.print("-----urlErrorurl:",tieziUrl)78.exceptsocket.timeoutase:79.print("-----sockettimout:",tieziUrl)80.81.82.while(bsObj.find('title').get_text()=="页面重载开启"):83.print("当前页面不是重加载后的页面,程序会尝试刷新一次到跳转后的页面\n")84.driver.get(tieziUrl)85.html=driver.page_source#将浏览器执行后的源代码赋给html86.bsObj=BeautifulSoup(html,"html.parser")87.exceptExceptionase:88.driver.close()#Closethecurrentwindow.89.driver.quit()#关闭chrome浏览器90.time.sleep(0.5)91.92.driver.close()#Closethecurrentwindow.93.driver.quit()#关闭chrome浏览器94.95.96.#查找想要的信息97.templist1=bsObj.find("td",{"class":"t_f"}).ul98.iftemplist1==None:#判断是否不包含ul字段,如果不,跳出函数99.print("当前帖子页面不包含ul字段")100.return1101.mycsv=['NULL','NULL','NULL','NULL','NULL','NULL','NULL','NULL']#初始化提取信息列表102.fortemplist2intemplist1.findAll("font",size=pile("^([0-9]+)*$")):103.tempText=templist2.get_text()104.#print(tempText[0:4])105.if"宝贝回家编号"intempText[0:6]:106.print(tempText)107.index=tempText.find(":")108.tempText=tempText[index+1:]109.#mycsv.append(tempText)110.iflen(tempText)==0:111.tempText="NULL"112.mycsv[0]=tempText113.if"寻亲编号"intempText[0:6]:114.print(tempText)115.index=tempText.find(":")116.tempText=tempText[index+1:]117.iflen(tempText)==0:118.tempText="NULL"119.#mycsv.append(tempText)120.mycsv[0]=tempText121.if"登记编号"intempText[0:6]:122.print(tempText)123.index=tempText.find(":")124.tempText=tempText[index+1:]125.iflen(tempText)==0:126.tempText="NULL"127.#mycsv.append(tempText)128.mycsv[0]=tempText129.if"姓"intempText[0:6]:130.print(tempText)131.index=tempText.find(":")132.tempText=tempText[index+1:]133.#mycsv.append(tempText)134.mycsv[1]=tempText135.if"性"intempText[0:6]:136.print(tempText)137.index=tempText.find(":")138.tempText=tempText[index+1:]139.#mycsv.append(tempText)140.mycsv[2]=tempText141.if"出生日期"intempText[0:6]:142.print(tempText)143.index=tempText.find(":")144.tempText=tempText[index+1:]145.#mycsv.append(tempText)146.mycsv[3]=tempText147.if"失踪时身高"intempText[0:6]:148.print(tempText)149.index=tempText.find(":")150.tempText=tempText[index+1:]151.#mycsv.append(tempText)152.mycsv[4]=tempText153.if"失踪时间"intempText[0:6]:154.print(tempText)155.index=tempText.find(":")156.tempText=tempText[index+1:]157.#mycsv.append(tempText)158.mycsv[5]=tempText159.if"失踪日期"intempText[0:6]:160.print(tempText)161.index=tempText.find(":")162.tempText=tempText[index+1:]163.#mycsv.append(tempText)164.mycsv[5]=tempText165.if"失踪地点"intempText[0:6]:166.print(tempText)167.index=tempText.find(":")168.tempText=tempText[index+1:]169.#mycsv.append(tempText)170.mycsv[6]=tempText171.if"是否报案"intempText[0:6]:172.print(tempText)173.index=tempText.find(":")174.tempText=tempText[index+1:]175.#mycsv.append(tempText)176.mycsv[7]=tempText177.try:178.writer.writerow((str(mycsv[0]),str(mycsv[1]),str(mycsv[2]),str(mycsv[3]),str(mycsv[4]),str(mycsv[5]),str(mycsv[6]),str(mycsv[7])))#写入CSV文件179.csvfile.flush()#马上将这条数据写入csv文件中180.finally:181.print("当前帖子信息写入完成\n")182.time.sleep(5)#设置爬完之后的睡眠时间,这里先设置为1秒183.184.185.#得到当前板块所有的页面链接186.#siteUrl为当前版块的页面链接187.defGetALLPageUrl(siteUrl):188.#设置代理IP访问189.#代理IP可以上/获取190.proxy_handler=urllib.request.ProxyHandler({'post':'123.207.143.51:8080'})191.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()192.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)193.urllib.request.install_opener(opener)194.195.try:196.#掉用第三方包selenium打开浏览器登陆197.#driver=webdriver.Chrome()#打开chrome198.driver=webdriver.Chrome()#打开无界面浏览器Chrome199.#driver=webdriver.PhantomJS()#打开无界面浏览器PhantomJS200.driver.set_page_load_timeout(10)201.#driver.implicitly_wait(30)202.try:203.driver.get(siteUrl)#登陆两次204.driver.get(siteUrl)205.exceptTimeoutError:206.driver.refresh()207.208.#print(driver.page_source)209.html=driver.page_source#将浏览器执行后的源代码赋给html210.#获取网页信息211.#抓捕网页解析过程中的错误212.try:213.#req=request.Request(tieziUrl,headers=headers5)214.#html=urlopen(req)215.bsObj=BeautifulSoup(html,"html.parser")216.#print(bsObj.find('title').get_text())217.#html.close()218.exceptUnicodeDecodeErrorase:219.print("-----UnicodeDecodeErrorurl",siteUrl)220.excepturllib.error.URLErrorase:221.print("-----urlErrorurl:",siteUrl)222.exceptsocket.timeoutase:223.print("-----sockettimout:",siteUrl)224.225.226.227.while(bsObj.find('title').get_text()=="页面重载开启"):228.print("当前页面不是重加载后的页面,程序会尝试刷新一次到跳转后的页面\n")229.driver.get(siteUrl)230.html=driver.page_source#将浏览器执行后的源代码赋给html231.bsObj=BeautifulSoup(html,"html.parser")232.exceptExceptionase:233.234.driver.close()#Closethecurrentwindow.235.driver.quit()#关闭chrome浏览器236.#time.sleep()237.238.driver.close()#Closethecurrentwindow.239.driver.quit()#关闭chrome浏览器240.241.242.#/forum-191-1.html变成,以便组成页面链接243.siteindex=siteUrl.rfind("/")244.tempsiteurl=siteUrl[0:siteindex+1]#/245.tempbianhaoqian=siteUrl[siteindex+1:-6]#forum-191-246.247.#爬取想要的信息248.bianhao=[]#存储页面编号249.pageUrl=[]#存储页面链接250.251.templist1=bsObj.find("div",{"class":"pg"})252.#iftemplist1==None:253.#return254.fortemplist2intemplist1.findAll("a",href=pile("forum-([0-9]+)-([0-9]+).html")):255.iftemplist2==None:256.continue257.lianjie=templist2.attrs['href']258.#print(lianjie)259.index1=lianjie.rfind("-")#查找-在字符串中的位置260.index2=lianjie.rfind(".")#查找.在字符串中的位置261.tempbianhao=lianjie[index1+1:index2]262.bianhao.append(int(tempbianhao))263.bianhaoMax=max(bianhao)#获取页面的最大编号264.265.foriinrange(1,bianhaoMax+1):266.temppageUrl=tempsiteurl+tempbianhaoqian+str(i)+".html"#组成页面链接267.print(temppageUrl)268.pageUrl.append(temppageUrl)269.returnpageUrl#返回页面链接列表270.271.#得到当前版块页面所有帖子的链接272.defGetCurrentPageTieziUrl(PageUrl):273.#设置代理IP访问274.#代理IP可以上/获取275.proxy_handler=urllib.request.ProxyHandler({'post':'110.73.30.157:8123'})276.proxy_auth_handler=urllib.request.ProxyBasicAuthHandler()277.opener=urllib.request.build_opener(urllib.request.HTTPHandler,proxy_handler)278.urllib.request.install_opener(opener)279.280.try:281.#掉用第三方包selenium打开浏览器登陆282.#driver=webdriver.Chrome()#打开chrome283.driver=webdriver.Chrome()#打开无界面浏览器Chrome284.#driver=webdriver.PhantomJS()#打开无界面浏览器PhantomJS285.driver.set_page_load_timeout(10)286.try:287.driver.get(PageUrl)#登陆两次288.driver.get(PageUrl)289.exceptTimeoutError:290.driver.refresh()291.292.#print(driver.page_source)293.html=driver.page_source#将浏览器执行后的源代码赋给html294.#获取网页信息295.#抓捕网页解析过程中的错误296.try:297.#req=request.Request(tieziUrl,headers=headers5)298.#html=urlopen(req)299.bsObj=BeautifulSoup(html,"html.parser")300.#html.close()301.exceptUnicodeDecodeErrorase:302.print("-----UnicodeDecodeErrorurl",PageUrl)303.excepturllib.error.URLErrorase:304.print("-----urlErrorurl:",PageUrl)305.exceptsocket.timeoutase:306.print("-----sockettimout:",PageUrl)307.308.n=0309.while(bsObj.find('title').get_text()=="页面重载开启"):310.print("当前页面不是重加载后的页面,程序会尝试刷新一次到跳转后的页面\n")311.driver.get(PageUrl)312.html=driver.page_source#将浏览器执行后的源代码赋给html313.bsObj=BeautifulSoup(html,"html.parser")314.n=n+1315.ifn==10:316.driver.close()#Closethecurrentwindow.317.driver.quit()#关闭chrome浏览器318.return1319.320.exceptExceptionase:321.driver.close()#Closethecurrentwindow.322.driver.quit()#关闭chrome浏览器323.time.sleep(1)324.325.driver.close()#Closethecurrentwindow.326.driver.quit()#关闭chrome浏览器327.328.329.#/forum-191-1.html变成,以便组成帖子链接330.siteindex=PageUrl.rfind("/")331.tempsiteurl=PageUrl[0:siteindex+1]#/332.#print(tempsiteurl)333.TieziUrl=[]334.#爬取想要的信息335.fortemplist1inbsObj.findAll("tbody",id=pile("normalthread_([0-9]+)")):336.iftemplist1==None:337.continue338.fortemplist2intemplist1.findAll("a",{"class":"sxst"}):339.iftemplist2==None:340.continue341.tempteiziUrl=tempsiteurl+templist2.attrs['href']#组成帖子链接342.print(tempteiziUrl)343.TieziUrl.append(tempteiziUrl)344.returnTieziUrl#返回帖子链接列表345.346.347.348.#CurrentPageMissingPopulationInformation("/thread-213126-1-1.html")349.#GetALLPageUrl("/forum-191-1.html")350.#GetCurrentPageTieziUrl("/forum-191-1.html")351.352.if__name__=='__main__':353.csvfile=open("E:/MissingPeople.csv","w+",newline="",encoding='gb18030')354.writer=csv.writer(csvfile)355.writer.writerow(('宝贝回家编号','姓名','性别','出生日期','失踪时身高','失踪时间','失踪地点','是否报案'))356.pageurl=GetALLPageUrl("/forum-191-1.html")#寻找失踪宝贝357.#pageurl=GetALLPageUrl("/forum-189-1.html")#被拐宝贝回家358.time.sleep(5)359.print("所有页面链接获取成功!\n")360.n=0361.fortemplist1inpageurl:362.#print(templist1)363.tieziurl=GetCurrentPageTieziUrl(templist1)364.time.sleep(5)365.print("当前页面"+str(templist1)+"所有帖子链接获取成功!\n")366.iftieziurl==1:367.print("不能得到当前帖子页面!\n")368.continue369.else:370.fortemplist2intieziurl:371.#print(templist2)372.n=n+1373.print("\n正在收集第"+str(n)+"条信息!")374.time.sleep(5)375.tempzhi=CurrentPageMissingPopulationInformation(templist2)376.iftempzhi==1:377.print("\n第"+str(n)+"条信息为空!")378.continue379.print('')380.print("信息爬取完成!请放心的关闭程序!")381.csvfile.close()

写成的CSV文件截图:

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。