200字范文 > python 正则表达式 re findall 返回能匹配的字符串

python 正则表达式 re findall 返回能匹配的字符串

时间：2020-12-25 02:13:30

python 正则表达式 re findall 方法能够以列表的形式返回能匹配的子串。

re.findall(pattern, string[, flags]):

搜索string，以列表形式返回全部能匹配的子串。先看个简单的代码：

import re

p = pile(r'\d+')

print p.findall('one1two2three3four4')

### output ###

# ['1', '2', '3', '4']

稍微复杂点比如:

info = '<a href="">baidu</a>' 我们的需求是通过正则表达式提取网址和锚文本，那可以用到

findall()

importre

relink='<a href="(.*)">(.*)</a>'

info='<a href="">baidu</a>'

cinfo=re.findall(relink,info)

printcinfo

输出的结果：[('', 'baidu')] 返回的是一个列表，列表里面是匹配的结果形成的元组形式。如果你需要用正则替换的话，可以看下python re sub

以下是一个网站地图爬虫，其中用到了re.findall 语法

import urllib2

import re

def download(url,user_agent='wswp', num_retries=2):

print 'downloading:',url

headers={'User-agent':user_agent}

request=urllib2.Request(url,headers=headers)

try:

html=urllib2.urlopen(url).read()

except urllib2.URLError as e:

print 'download error:', e.reason

html=None

if num_retries>0:

if hasattr(e, 'code') and 500<=e.code<600:

#recursively retry 5XX http errors

return download(url, user_agent,num_retries-1)

return html

def crawl_sitemap(url):

#download the sitemap file

sitemap=download(url)

#extract the sitemap links

links = re.findall('<loc>(.*?)</loc>',sitemap)

#download each link

for link in links:

html=download(link)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。