python爬虫

请求库(requests)

安装

1
pip install requests

用法

带参数get请求

1
2
3
import requests
response=requests.get("http://httpbin.org/get?name=Tome&age=22")
print(response.text)
1
2
3
4
5
6
7
import requests
data={
'name':'Tom',
'age':'22'
}
response=requests.get("http://httpbin.org/get",params=data)
print(response.text)

获取二进制数据(用于保存图片视频)

1
2
3
4
5
6
7
8
9
10
11
import requests
response=requests.get("https://github.com/favicon.ico")
print(response.content)
```
图片保存为本地文件
```
import requests
response=requests.get("https://github.com/faviccon.ico")
with open('favicon','wb') as f:
f.write(response.content)
f.close()

添加headers

1
2
3
4
5
6
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
response=requests.get('https://zhihu.com/explore',headers=headers)
print(response.text)

基本post请求

1
2
3
4
5
6
7
import requests
data={'name':'Tom','age':'22'}
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
response=requests.post('https://httpbin.org/post',data=data,headers=headers)
print(response.text)

文件上传

1
2
3
4
import requests
files={'file':open('favicon.ico','rb')}
response=requests.post('https://httpbin.org/post',files=files)
print(response.text)

会话维持

1
2
3
4
5
import requests
s=requests.session()
s.get('http://httpbin.org/cookies/set/number/123456')
response=s.get('http://httpbin.org/cookies')
print(response.text)

证书验证

1
2
3
import requests
response=requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

解析库(BeautifulSoup)

安装

1
pip3 install bs4
1
pip install lxml

标准选择器

1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup
import requests
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
html=requests.get('https://zhihu.com/explore',headers=headers).text
soup = BeautifulSoup(html,'lxml')
s=soup.find_all('h2')#用标签选择
#s=soup.find_all(attrs={'data-za-element-name':'Title'})#用attrs选择
#s=soup.find_all(text=re.compile("考研"))#正则表达式匹配内容
for a in s:
print(a.string)#打印内容

css选择器

获取内容及属性值

1
2
3
4
5
6
7
8
9
10
11
from bs4 import BeautifulSoup
import requests
import re
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
html=requests.get('https://zhihu.com/explore',headers=headers).text
soup = BeautifulSoup(html,'lxml')
s=soup.select('h2 a')#css选择逐层查找
for a in s:
print(a.get_text(strip=True),'https://zhihu.com'+a.attrs['href'])#获取内容,属性并去除空白

通过类名查找

1
s=soup.select('.question_link')#css通过类名查找

转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 jaytp@qq.com

文章标题:python爬虫

本文作者:子非鱼

发布时间:2018-11-04, 11:19:29

最后更新:2018-10-07, 09:07:30

原始链接:https://Wangsr.cn/2018/11/04/2018-2018-01-30-python爬虫/

版权声明: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。

目录
×

喜欢就点赞,疼爱就打赏