python爬虫入门

python爬虫基础练习

基础准备

requests

比起python内置的urllib库，操作更方便的HTTP库————requests是大多数入门爬虫的选择

使用示例：

1
2
3
4
5
6
7
8
9
10
11
12
# 安装
pip install requests

-------------------------------
# 最基本使用（其他方法移步官网学习）
import requests

if __name__ == '__main__'；
    target = '网址'
    req = requests.get(url = target)
    html = req.text
    print(html)

从上面打印的结果，我们可以看到已经得到了页面的html信息，但是，起里面包含了大量我们不需要的信息。所以，接下来我们要提取我们感兴趣的内容（筛选信息），我们可以通过正则表达式，当然，也可以使用库————BeautifulSoup（灵活，方便，高效，支持多种解析器）。

补充：

通过学习，又了解到一个强大的解析库————pyQuery，其优点是：对前端基础较好的人，使用更加方便，上手快，且功能更强大。

如果想了解，可以参考我下一篇笔记

BeautifulSoup

基础数据

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

基本用法

# 以后代码均已该两代码块为基础

soup = BeautifulSoup(html, 'lxml')
# 进行格式化，对代码进行补全，进行容错处理
print(soup.prettify())
# 获取title标签的内容
print(soup.title.string)

节点选择器（tag）

使用
直接调用名称即可
得到的时Tag对象类型

选择元素

>>>print(soup.head)  #获取html的head标签
<head><title>The Dormouse's story</title></head>

>>>print(soup.p)  #获取html的p标签
<p class="title" name="dromouse"><b>The Dormouse's story</b>

上述例子可以看出，标签选择器只能匹配到第一个符合条件的并返回

获取名称

1 2	>>>print(soup.head.name) 'head'

获取属性

# 两种方式
# 匹配第一个
>>>print(soup.p.attrs('name'))
>>>print(soup.p['name'])
dromouse
dromouse

获取内容

1 2	>>>print(soup.head.string) <title>The Dormouse's story</title>

嵌套调用

1 2	>>>print(soup.head.title) <title>The Dormouse's story</title>

方法选择器find_all(name,attrs,recursive,text,**kwargs)
当我们进行爬虫时，上述方法并不太适用，使用更多的是方法选择器

参数name表示可以查找所有名字为name的标签(tag)，也可以是过滤器，正则表达式，列表或者是True
attrs表示传入的属性，可以通过attrs参数以字典的形式指定如常用属性id,attrs={‘id’:’123’}，由于class属性是python中的关键字，所有在查询时需要在class后面加上下划线即class_=’element’，返回的结果是tag类型的列表
text参数用来匹配节点的文本，传入的形式可以是字符串也可以是正则表达式对象
recursive表示，如果只想搜索直接子节点可以将参数设为false：recursive=Flase

limit参数，可以用来限制返回结果的数量，与SQL中的limit关键字类似

# 示例
eq = requests.get(url = self.target)
    html = req.text
    div_bf = BeautifulSoup(html)
    div = div_bf.find_all('div', class_ = 'listmain')
    a_bf = BeautifulSoup(str(div[0]))
    a = a_bf.find_all('a')
    self.nums = len(a[15:])                                #剔除不必要的章节，并统计章节数
    for each in a[15:]:
        self.names.append(each.string)
        self.urls.append(self.server + each.get('href'))

BeautifulSoup入门
 BeautifulSoup官方中文文档

遇到的问题：

使用BeautifulSoup(markup)出现Warning

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 4 of the file C:/Users/excalibur/PycharmProjects/learn/getMyIP.py. To get rid of this warning, change code that looks like this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

将BeautifulSoup(markup)替换为BeautifulSoup(markup, ‘lxml’)即可

参数markup可以传入一段字符串或一个文件句柄.
在静态网站爬去练习中，传入类型为字符串，注意，BeautifulSoup
获取的为对象，需要先转为字符串才可以继续作为markup参数。

通过BeautifulSoup多次筛选，可以得到任一标签下的任一子标签

# 例如
from bs4 import BeautifulSoup
import requests
if __name__ == "__main__":
     target = 'http://www.biqukan.com/1_1094/'
     req = requests.get(url = target)
     html = req.text
     div_bf = BeautifulSoup(html)
     div = div_bf.find_all('div', class_ = 'listmain')
     print(div[0])

基础之文件下载

小文件下载
小文件下载可以直接存入文件

大文件下载

对于图片等大文件，那么下载下来的文件先放在内存中，内存还是比较有压力的。所以为了防止内存不够用的现象出现，我们要想办法把下载的文件分块写到磁盘中

import requests
from contextlib import closing

url = 'https://www.cayani.cn/usr/uploads/2018/10/2355989267.jpg'
filename = 'tour'

with closing(requests.get(url=url, stream=True, verify=False)) as r:
  with open('%s.jpg'% filename, 'ab') as f:
    for chunk in r.iter_content(chunk_size = 1024):
      if chunk:
        f.write(chunk)
        f.flush()

注释

closing()是用来把一个对象变为具有上下文的对象，进而用于with语句

flush()清晰将缓存区的内容放到磁盘中

批量文件的下载

批量文件的下载其实就是大量小/大文件的下载，分清主次先后，这个问题就大化小了。

第一步：读取网页内容
第二步：获取链接信息，并过滤出我们需要的链接

第三步：对想要得到的文件进行下载

 import requests, sys
from pyquery import PyQuery as pq
from contextlib import closing

target = 'http://www-personal.umich.edu/~csev/books/py4inf/media/'

end = '.mp4'

def get_links(url):
  html = requests.get(url).text
  doc = pq(html)
  # 获取链接
  links = doc('body > pre > a').items()
  target_links = [url + str(link.attr('href')) for link in links if str(link.attr('href')).endswith(end)]
  # 获取链接对应的名字
  return target_links

def download_mp4(download_url):
  file_name = download_url.split('/')[-1]
  print('Downloading file: %s'% file_name)

  with closing(requests.get(download_url, stream = True)) as r:
    content_size = int(r.headers['content-length'])
    chunk_process = 0
    with open(file_name, 'ab') as f:
      for chunk in r.iter_content(chunk_size = 10):
        if chunk: 
          # 显示下载进度
          sys.stdout.write("  已下载:%.3f%%" %  float(chunk_process/content_size) + '\r')
          sys.stdout.flush()
          f.write(chunk)
          chunk_process += 10
          f.flush()

  print('Download is done!')

if __name__ == '__main__':
  target_links = get_links(target)
  target_links = target_links[:2]
  print(target_links)
  for download_url in target_links:
      download_mp4(download_url)
  print('All mp4 is downloaded!')

感谢：
利用Python下载文件
 Python3网络爬虫快速入门实战解析

python爬虫入门

python爬虫基础练习

基础准备

基础之文件下载

小文件下载

大文件下载

批量文件的下载

python下载文件进度条实现

python断点续传

多线程下载