BS4笔记

官方文档

一、解析HTML文件

1.1 配合request解析从网络上爬取的HTML
1
2
3
4
5
# 爬取HTML
import requests
req = requests.get("https://www.baidu.com")
req.encoding = 'utf-8'
soup = BeautifulSoup(req.text,'lxml')
1.2 解析本地文件
1
2
3
4
5
6
7
8
# 直接打开 (推荐)
soup = BeautifulSoup(open('sanguo.html',encoding = 'utf-8'),'lxml')
#open()里面为本地的html路径
# 先打开文件,再解析
path = 'baidu.html'
htmlfile = open(path, 'r', encoding='utf-8')
htmlhandle = htmlfile.read()
soup = BeautifulSoup(htmlhandle, 'lxml')
1.3 直接创建一个HTML当场解析
1
soup = BeautifulSoup("<html>data</html>",'lxml')
1.4本次记录使用的HTML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>

二、BeautifulSoup中一些基本属性介绍

2.1 Tag

Tag对象与XML和HTML原生文档中的tag相同

1
2
3
4
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

2.2 Name

用于标识每个tag,即每个tag的名字

1
2
3
4
5
tag.name
#u'b'
#可以直接更改name
tag.name = "c"
# <c class="boldest">Extremely bold</c>

2.3 Attributes

一个tag可能有多个属性,对tag属性的操作与python字典方式相同

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
soup2 = BeautifulSoup('<p class="body strikeout" id="my id">></p>')
tag = soup.b
tag2 = soup2.p

# 取值(需要在HTML中定义是否为多值属性,即属性的名字被定义可以使用的多值)
p['class'] # HTML 4中定义 class属性为多值属性
# ["body", "strikeout"] #以列表的形式返回,
p['id'] # 在任何版本的HTML中都没有将 id属性定义为多值属性
# u"my id" #以字符串的形式返回

# 添加值
tag['id'] = 1
# <c class="verybold" id="1">Extremely bold</c>

# 修改值
tag['class'] = 'verybold'
# <c class="verybold">Extremely bold</c>
# 被定义为多值属性的
tag['class'] = ['2','3']
# <c class="2 3" id="1">Extremely bold</c>
# 没有被定义为多值属性的则合并成一个字符串
tag['id'] = ['2','3']
# <c class="2 3" id="2 3">Extremely bold</c>

# 删除值
del tag['id']
# <c class="verybold">Extremely bold</c>

#通过.attrs取属性值
tag.attrs
# <c class="2 3" id="2 3">Extremely bold</c>