python爬虫

python学习笔记爬虫

创建时间:2018-11-04 11:19

阅读:

请求库（requests）
1. 安装
2. 用法
解析库（BeautifulSoup）

请求库（requests）

安装

1	pip install requests

用法

带参数get请求

1
2
3

import requests
response=requests.get("http://httpbin.org/get?name=Tome&age=22")
print(response.text)

import requests
data={
	'name':'Tom',
	'age':'22'
}
response=requests.get("http://httpbin.org/get",params=data)
print(response.text)

获取二进制数据（用于保存图片视频）

import requests
response=requests.get("https://github.com/favicon.ico")
print(response.content)
```  
图片保存为本地文件
``` 
import requests
response=requests.get("https://github.com/faviccon.ico")
with open('favicon','wb') as f:
	f.write(response.content)
	f.close()

添加headers

import requests
headers={
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
response=requests.get('https://zhihu.com/explore',headers=headers)
print(response.text)

基本post请求

import requests
data={'name':'Tom','age':'22'}
headers={
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
response=requests.post('https://httpbin.org/post',data=data,headers=headers)
print(response.text)

文件上传

import requests
files={'file':open('favicon.ico','rb')}
response=requests.post('https://httpbin.org/post',files=files)
print(response.text)

会话维持

import requests
s=requests.session()
s.get('http://httpbin.org/cookies/set/number/123456')
response=s.get('http://httpbin.org/cookies')
print(response.text)

证书验证

1
2
3

import requests
response=requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

解析库（BeautifulSoup）

安装

1	pip3 install bs4

1	pip install lxml

标准选择器

from bs4 import BeautifulSoup
import requests
headers={
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
html=requests.get('https://zhihu.com/explore',headers=headers).text
soup = BeautifulSoup(html,'lxml')
s=soup.find_all('h2')#用标签选择
#s=soup.find_all(attrs={'data-za-element-name':'Title'})#用attrs选择
#s=soup.find_all(text=re.compile("考研"))#正则表达式匹配内容
for a in s:
    print(a.string)#打印内容

css选择器

获取内容及属性值

from bs4 import BeautifulSoup
import requests
import re
headers={
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
html=requests.get('https://zhihu.com/explore',headers=headers).text
soup = BeautifulSoup(html,'lxml')
s=soup.select('h2 a')#css选择逐层查找
for a in s:
    print(a.get_text(strip=True),'https://zhihu.com'+a.attrs['href'])#获取内容，属性并去除空白

通过类名查找

1	s=soup.select('.question_link')#css通过类名查找

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论，也可以邮件至 jaytp@qq.com