网络爬虫对于信息获取和分析具有重要的意义,希望能够通过网络爬虫的学习来方便以后的科研和生活,因此对现阶段了解的爬虫知识做了如下总结。抓取网页前,首先需要分析网页是静态网页和动态网页,静态网页比较容易抓取,而动态网页需要分析动态加载部分的请求(请求得到的链接,分析得到需要发送请求的header)。主要使用的工具如下:
- Python
- urllib (python标准库), requests, BeautifulSoup, selenium (+ webdriver), scrapy
- Chrome/Chromium
- Developer Tools (Elements, Network→XHR/JS, Console)
import urllib
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
# 静态网页直接用urlopen,动态网页需要先用Request然后再urlopen
# -------------- 静态网页 --------------
base_url = "https://baike.baidu.com"
url = base_url + "/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"
html = urlopen(url).read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
sub_urls = soup.find_all("a", {"target": "_blank"})
# -------------- 动态网页 --------------
url = 'https://modelzoo.co/api/models/0/' # Network→XHR分析得到动态部分在url
headers = { # Network→XHR→0/→Headers(Request Headers)中得到
"Host": "modelzoo.co",
"Connection": "keep-alive",
"Accept": "application/json, text/plain, */*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36",
"Referer": "https://modelzoo.co/",
# "Accept-Encoding": "gzip, deflate, br",
# 去除gzip后,编码是字符,不去除,是二进制
"Accept-Encoding": "deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "_ga=GA1.2.920681968.1558316088; _gid=GA1.2.2105351998.1558316088"
}
req = Request(url, None, headers)
response = urlopen(req)