python 爬虫小知识

各个方法的简介

secret_code = 'hadkflifexxIxxfasdjifja134xxlovexx23345sddfxxyouxx8dfse'

（ . ）的使用举例

a ='xy123' b=re.findall('x.',a) 占位符，匹配x后的几位数，就有几个点 print (b)

*的使用举例

a='xyxy123' b=re.findall('x*',a) 寻找x的位置并返回，['x', '', 'x', '', '', '', '', '']例如1,3出现了x，其他没有出现所以为空 print (b)

?的使用

a='xyx123' b=re.findall('x?',a) print (b)

important is (.*?)

（ .）的使用贪心算法
` b=re.findall(‘xx.xx’,secret_code)
print(b)#[‘xxIxxfasdjifja134xxlovexx23345sddfxxyouxx’] *（* .*?）的使用非贪心算法＊ c=re.findall(‘xx.*?xx’,secret_code)#[‘xxIxx’, ‘xxlovexx’, ‘xxyouxx’]
print(c) `

使用括号与不使用括号的区别 *

d=re.findall('xx(.*?)xx',secret_code) print (d) ['I', 'love', 'you'] for each in d: print (each); 显示： I
love
you `

s='''sdfxxhello xxfsdfxxworldxxasf''' d=re.findall('xx(.*?)xx',s) #['fsdf'] k=re.findall('xx(.*?)xx',s,re.S)#['hello\n', 'world'] re.S 使 . 的范围包括 \n print(d) print(k)

findall 与 Search 的区别 *

findall 会一直匹配完
search 只会匹配第一个，不在继续

s2='asdfxxIxx123xxlovexxdef' f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(2) print(f) f2=re.findall('xx(.*?)xx123xx(.*?)xx',s2) print(f2[0][1]) 显示 love print(f2[0][0]) 显示 I

sub使用方法举例

替换 *

s='1324646' o=re.sub('1(.*?)6','55',s) #5546

正则表达式符号与方法-常用技巧

不同的导入方式，使用方法不同 *

1.：import re 推荐使用
1. :from re import
2. :from re import findall,search,sub,S
3. :不需要使用complie
4. :使用\d+匹配纯数字
匹配数字 *

a='djdd123456kkl' b=re.findall('(\d+)',a) print(b)

python 爬虫的一个小小脚本

old_url='http://www.jikexueyuan.com/course/android/?pageNum=2'
total_page=20
f=open('text.html','r',encoding= 'utf-8')
Html=f.read()
f.close()
爬取标题
title =re.search('<title>(.*?)</title>',Html,re.S).group(1)  #re.S表示多行匹配
print (title)
爬取链接
Link = re.findall('href="(.*?)"',Html)
print (Link)
先取大不取小
text=re.findall('<ul>(.*?)</ul>',Html,re.S)[0]
text_show=re.findall('>(.*?)</a>',text,re.S)
for i in text_show:
    print (i)
sub实现翻页
for i in range(2,total_page+1):
    new_link=re.sub('pageNum=\d+','pageNum=%d'%i,old_url,re.S)
    print(new_link) `

python 开发