Skip to content

网络爬虫

网络爬虫需要下载以下相关的库:

  • pip install requests
  • pip install beautifulsoup4
  • pip install lxml
  • pip install html5lib或者pip install requests-html

返回网站html也的文本代码信息:

python
import requests
url = "https://jww.zjgsu.edu.cn/2021/1224/c1331a111803/page.html"
res = requests.get(url)
print(res.text.encode("ISO-8859-1").decode("utf-8"))	

res = requests.get('http://www.baidu.com')
print(res)
# 返回200,表示请求网址成功,若为4xx,表示请求失败
  • response.encoding :打印网页编码
  • response.text :返回文本信息
  • response.content :返回二进制数据
  • response.status_code :返回响应状态码
  • response.url:返回访问的网址
  • response.headers :返回http响应报头

requests-html全部功能只支持python3.6以及以后的版本

python
from requests-html import HTMLSession
session = HTMLSession()
url = 'https://www.dxsbb.com/news/7566.html'
r = session.get(url)
table = r.html.find('tbody>tr')
for row in table[:21]:
    l=row.text.split()
    s=''
    for i in l:
        s=s+'{0:^14}'.format(i)
    print(s)
    f = open(r"C:\Users\Asus\Desktop\111.txt","a+", encoding='UTF-8') 
    f.write(s + '\n')
f.close()

豆瓣爬取书籍排名:

python
import requests
from lxml import etree
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
html = requests.get('https://book.douban.com/top250', headers = headers).text
res = etree.HTML(html)
names = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/text()')
for name in names:
    print(name.strip())

其他爬取:(爬取古诗,分段)

python
import requests
from lxml import etree
#headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
html = requests.get('https://so.gushiwen.cn/shiwenv_d16797ee39e4.aspx').text
res = etree.HTML(html)
names = res.xpath('//*[@id="contsond16797ee39e4"]/text()')
print(names)
name='\n'.join(names)
print(name)
f = open(r"C:\Users\Asus\Desktop\111.txt","a+", encoding='UTF-8')
f.write(name)
f.close()

爬取图片:

python
import requests
response = requests.get('https://github.com/favicon.ico')    # 图片链接
with open('favicon.ico','wb')as f:         # favicon.ico 图片名字,可修改
	f.write(response.content)

Released under the MIT License.