【Python】第三章 网页数据的解析提取
1. XPath 的使用
XPath(XML Path Language)
html 文档也适用
XML 路径语言
用于在 XML 文档中查找信息
1.1 XPath 常用规则
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取属性 |
示例
//title[@lang="eng"]1
1.2 安装
pip3 install lxml1
1.3 实例引入
from lxml import etree text = """ <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> """ html = etree.HTML(text) result = etree.tostring(html) print(result.decode('utf-8'))
etree 模块可以自动修正 HTML 文本
直接读取文本
test.html
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div>
test.py
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
1.4 所有节点
获取所有节点(//*)
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//*") print(result)
获取所有 li 节点(//li)
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li") print(result)
使用 // + name 来获取所有名称为 name 的节点
1.5 子节点
获取 li 节点的直接子节点 a
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li/a") print(result)
获取 ul 节点的子孙节点 a
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//ul//a") print(result)
1.6 父节点
获取 href 属性为 link4.html 的 a 节点,然后获取其父节点,再获取父节点的 class 属性
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//a[@href=\"link4.html\"]/../@class") print(result)
通过 parent:: 获取父节点
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//a[@href=\"link4.html\"]/parent::*/@class") print(result)
1.7 属性匹配
使用 @ 过滤
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]") print(result)
1.8 文本获取
获取 class 属性为 item-0 的 li 节点,并获取其直接子节点的文本
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]/text()") print(result)
获取 class 属性为 item-0 的 li 节点,并获取其内部的文本
先选取 a 节点再获取文本
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]/a/text()") print(result)
使用 //
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]//text()") print(result)
获取子孙节点下的所有文本://
获取特定子孙节点下的所有文本:/特定节点/
1.9 属性获取
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li/a/@href") print(result)
1.10 属性多值匹配
li 属性有两个值
li
li-first
使用之前的属性匹配将无法获取
from lxml import etree text = """ <li class="li li-first"><a href="link.html">first item</a></li> """ html = etree.HTML(text) result = html.xpath("//li[@class=\"li\"]/a/text()") print(result)
使用 contains 进行属性的多值匹配
from lxml import etree text = """ <li class="li li-first"><a href="link.html">first item</a></li> """ html = etree.HTML(text) result = html.xpath("//li[contains(@class, \"li\")]/a/text()") print(result)
1.11 多属性匹配
from lxml import etree text = """ <li class="li li-first" name="item"><a href="link.html">first item</a></li> """ html = etree.HTML(text) result = html.xpath( "//li[contains(@class, \"li\") and @name=\"item\"]/a/text()") print(result)
运算符
运算符 | 描述 |
---|---|
or | 或 |
and | 与 |
mod | 求余 |
| | 求两个节点集 |
+ | 加 |
- | 减 |
* | 乘 |
div | 除 |
= | 等于 |
!= | 不等于 |
< | 小于 |
<= | 小于等于 |
> | 大于 |
>= | 大于等于 |
1.12 按序选择
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) # 选取第一个节点 result = html.xpath("//li[1]/a/text()") print(result) # 选取最后一个节点 result = html.xpath("//li[last()]/a/text()") print(result) # 获取限定位置的节点 result = html.xpath("//li[position() < 3]/a/text()") print(result) # 获取倒数第三个节点 result = html.xpath("//li[last() - 2]/a/text()") print(result)
1.13 节点轴选择
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) # ancestor轴: 获取所有祖先节点 result = html.xpath("//li[1]/ancestor::*") print(result) result = html.xpath("//li[1]/ancestor::div") print(result) # attribute轴: 获取所有属性值 result = html.xpath("//li[1]/attribute::*") print(result) # child轴: 获取所有直接子节点 result = html.xpath("//li[1]/child::a[@href=\"link1.html\"]") print(result) # descendant轴: 获取所有子孙节点 result = html.xpath("//li[1]/descendant::span") print(result) # following轴: 获取当前节点之后的所有节点 result = html.xpath("//li[1]/following::*[2]") print(result) # following-sibling轴: 获取当前节点之后的所有同级节点 result = html.xpath("//li[1]/following-sibling::*") print(result)
2. Beautiful Soup 的使用
借助网页的结构和属性等特性来解析网页
2.1 解析器
Beautiful Soup 在解析时是依赖解析器的
Beautiful Soup 支持的解析器
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python 标准库 | BeautifulSoup(markup, “html.parser”) | Python 的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 或 Python 3.2.2 前的版本中文容错能力差 |
LXML HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装 C 语言库 |
LXML XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持 XML 的解析器 | 需要安装 C 语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 提供最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
2.2 安装
pip3 install beautifulsoup4 pip3 install lxml
2.3 基本使用
from bs4 import BeautifulSoup html = """ <html> <head> <title>Test</title> </head> <body> <p class="title" name="name"><b>Test</b></p> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><!--A--></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") # 以标准的缩进格式输出 print(soup.prettify()) print(soup.title.string)
Beautiful Soup 在初始化时会自动修正格式
2.4 节点选择器
from bs4 import BeautifulSoup html = """ <html> <head><title>Test</title></head> <body> <p class="title" name="name"><b>Test</b></p> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><!--A--></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") print(soup.title) print(type(soup.title)) print(soup.title.string) print(soup.head) print(soup.p)
此种选择方式只能选择到第一个匹配的节点,后面的其他节点都会忽略
2.5 提取信息
获取节点名称(name)
print(soup.title.name)
获取属性(attrs)
print(soup.p.attrs)print(soup.p.attrs["name"])
获取内容(string)
print(soup.p.string)
嵌套选择
print(soup.head.title)print(soup.head.title.string)
2.6 关联选择
子节点和子孙节点
直接子节点
contents
from bs4 import BeautifulSoup html = """ <html> <head><title>Test</title></head> <body> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") print(soup.p.contents)
children
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
子孙节点(descendants)
from bs4 import BeautifulSoup soup = BeautifulSoup(html, "lxml") print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child)
父节点和祖先节点
父节点(parent)
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> """ soup = BeautifulSoup(html, "lxml") print(soup.a.parent)
祖先节点(parents)
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> """ soup = BeautifulSoup(html, "lxml") print(type(soup.a.parents)) print(list(enumerate(soup.a.parents)))
兄弟节点
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> Test <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a> ABC <a href="http://example.com/B" class="abc" id="link2">B</a> abc <a href="http://example.com/C" class="abc" id="link3">C </a> 123 </p> """ soup = BeautifulSoup(html, "lxml") print("next sibling", soup.a.next_sibling) print("prev sibling", soup.a.previous_sibling) print("next siblings", list(enumerate(soup.a.next_siblings))) print("prev siblings", list(enumerate(soup.a.previous_siblings)))
提取信息
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> Test <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a><a href="http://example.com/B" class="abc" id="link2">B</a><a href="http://example.com/C" class="abc" id="link3">C </a> </p> """ soup = BeautifulSoup(html, "lxml") print(soup.a.next_sibling) print(soup.a.next_sibling.string) print(soup.a.parents) print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs["class"])
2.7 方法选择器
find_all
findall(name, attrs, recursive, text, **kwargs)
name
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4> Hello </h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find_all(name="ul")) print(type(soup.find_all(name="ul")[0])) for ul in soup.find_all(name="ul"): print(ul.find_all(name="li")) for li in ul.find_all(name="li"): print(li.string)
attrs
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4> Hello </h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find_all(attrs={"id": "list-1"})) print(soup.find_all(attrs={"class": "element"})) # 等效于 print(soup.find_all(id="list-1")) print(soup.find_all(class_="element"))
text
该参数应为正则表达式对象
返回结果是由所有与该正则表达式相匹配的节点文本组成的列表
import re from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-body"> <a>link1</a> <a>link2</a> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find_all(text=re.compile("link")))
find
只能获取第一个匹配的节点元素
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4> Hello </h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find(name="ul")) print(type(soup.find(name="ul"))) print(soup.find(class_="list"))
find_parents
获取所有的祖先节点
find_parent
获取直接父节点
find_next_siblings
获取后面的所有兄弟节点
find_next_sibling
获取后面的第一个兄弟节点
find_previous_siblings
获取前面的所有兄弟节点
find_previous_sibling
获取前面第一个兄弟节点
find_all_next
获取节点后面所有符合条件的节点
find_next
获取节点后面第一个符合条件的节点
find_all_previous
获取节点前面所有符合条件的节点
find_previous
获取节点前面第一个符合条件的节点
2.8 CSS 选择器
实例
select(节点1,节点2,节点3 ···)
获取所有 节点1 下的所有 节点2 下的所有 节点3 ···
节点可以换为
.class名
#id名
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.select(".panel .panel-body .list")) print(soup.select("ul li")) print(soup.select("#list-2 .element")) print(type(soup.select("ul")[0]))
嵌套选择
soup = BeautifulSoup(html, "lxml") for ul in soup.select("ul"): print(ul.select("li"))
获取属性
方括号获取
.attrs[]
soup = BeautifulSoup(html, "lxml") for ul in soup.select("ul"): print(ul["id"]) # 等效于 print(ul.attrs["id"])
获取文本
.get_text()
.string
soup = BeautifulSoup(html, "lxml") for li in soup.select("li"): print(f"text: {li.get_text()}") # 等效于 print(f"string: {li.string}")
3. pyquery 的使用
3.1 安装
pip3 install pyquery
3.2 初始化
字符串初始化
from pyquery import PyQuery html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ doc = PyQuery(html) print(doc("li"))
URL初始化
from pyquery import PyQuery doc = PyQuery(url="https://www.bilibili.com/") # 等效于 doc = PyQuery(requests.get(url).text) print(doc("title"))
文件初始化
from pyquery import PyQuery doc = PyQuery(filename="test.html") print(doc("li"))
3.3 基本CSS选择器
from pyquery import PyQuery html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ doc = PyQuery(html) print(doc("#container .list li")) print(type(doc("#container .list li"))) for item in doc("#container .list li").items(): print(item.text())
3.4 查找节点
直接子节点
from pyquery import PyQuery doc = PyQuery(html) items = doc(".list") lis = items.children() print(type(lis)) print(lis)
子孙节点
from pyquery import PyQuery doc = PyQuery(html) items = doc(".list") print(type(items)) print(items) lis = items.find("li") print(type(lis)) print(lis)
直接父节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) items = doc(".list") container = items.parent() print(type(container)) print(container)
祖先节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) items = doc(".list") parents = items.parents() print(type(parents)) print(parents)
兄弟节点(有问题)
输出优先级
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) lis = doc(".list .item-0.active") for item in lis.siblings().items(): print(item.text())
3.5 遍历节点
获取单个节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) print(str(li))
获取多个节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) lis = doc("li").items() print(type(lis)) for li in lis: print(li, type(li))
获取信息
信息
属性
文本
获取属性(attr())
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) a = doc(".item-0.active a") print(a, type(a)) print(a.attr("href")) # 等效于 print(a.attr.href) a = doc("a") print(a, type(a)) print(a.attr("href")) for item in a.items(): print(item.attr("href"))
获取文本
获取内部文本(.text())
贪婪
如果拥有多个文本内容均会输出,且以空格分隔
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) a = doc(".item-0.active a") print(a) print(a.text())
获取内部的HTML文本(.html())
非贪婪
只会输出第一个匹配的文本
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) print(li.html())
.html()的相关误区
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") a = doc(".item-0.active a") # 输出内容相同,但内容并不相等 print(li.html()) print(a) # 类型为str print(type(li.html())) # 类型为pyquery.pyquery.PyQuery print(type(a))
3.6 节点操作
addClass和removeClass
动态改变节点的Class属性
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) li.remove_class("active") print(li) li.add_class("active") print(li)
attr、text和html
from pyquery import PyQuery html = """ <ul class="warp"> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> </ul> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) li.attr("name", "link") print(li) li.text("changed item") print(li) li.html("<span>changed item</span>") print(li)
remove
from pyquery import PyQuery html = """ <ul class="warp"> ABC <p>abc<p> </ul> """ doc = PyQuery(html) li = doc(".item-0.active") warp = doc(".warp") warp.find("p").remove() print(warp.text())
3.7 伪类选择器
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) # 第一个li节点 li = doc("li:first-child") print(li) # 最后一个li节点 li = doc("li:last-child") print(li) # 第二个li节点 li = doc("li:nth-child(2)") print(li) # 下标大于2的节点 li = doc("li:gt(2)") print(li) # 以2为倍数位置的节点 li = doc("li:nth-child(2n)") print(li) # 包含second文本的节点 li = doc("li:contains(second)") print(li)
4. parsel的使用
可以接卸HTML和XML
支持使用XPath和CSS选择器对内容进行提取和修改
融合了正则表达式的提取功能
4.1 安装
pip3 install parsel
4.2 初始化
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) items1 = selector.css(".item-0") print(len(items1), type(items1), items1) items2 = selector.xpath("//li[contains(@class, 'item-0')]") print(len(items2), type(items2), items2)
.css()方法提取的结果是xpath属性而不是css属性
由底层cssselect库实现
CSS选择器被转化成XPath
真正用于节点提取的是XPath
4.3 提取文本
提取单个文本(get())
只能获取第一个匹配的对象
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) items = selector.css(".item-0") print(f"type of items: {type(items)}") for item in items: print(f"type of item: {type(item)}") text = item.xpath(".//text()").get() print(text)
提取多个文本(getall())
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) result = selector.xpath("//li[contains(@class, \"item-0\")]//text()").getall() print(result) result = selector.css(".item-0 *::text").getall() print(result)
4.4 提取属性
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) result = selector.css(".item-0.active a::attr(href)").get() print(result) result = selector.xpath("//li[contains(@class, \"item-0\") and contains(@class, \"active\")]/a/@href").get() print(result)
4.5 正则提取
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) result = selector.css(".item-0").re("link(.*?)\"") print(result) result = selector.css(".item-0 *::text").re(".*item") print(result) # 获取第一个符合 result = selector.css(".item-0").re_first(">(.*?item)") print(result)
猜你喜欢
- 【Python】Python中的装饰器和上下文管理器的原理和使用场景是什么?
- Python中的装饰器和上下文管理器是两个非常有用的特性,它们可以帮助我们更好地组织和管理代码,并提高代码的可复用性。本文将分别介绍装饰器和上下文管理器的原理和使用场景,并给出具体的代码示例。一、装饰器的原理和使用场景原理:装饰器是一种在不改变原函数定义的情况下,为函数添加额外功能的方式。它实际上是一个函数,接受被装饰的函数作为输入,并返回包装后的函数。装饰器通过在被装饰函数的前后添加代码,来实现一些额外的功能,比如日志记录、性能分析、权限控制等。使用场景:装饰器适用于以下场景:日志记录:通过在
- 【Python】学习使用matplotlib绘制不同类型图表的示例
- 使用Matplotlib绘制各类图表的实例学习引言:在数据分析和数据可视化领域,Matplotlib是一个非常强大的Python库。它提供了各种类型的图表和绘图功能,可以帮助我们更好地理解和呈现数据。本文将通过实例学习如何使用Matplotlib绘制各类图表,并提供相应的代码示例。一、折线图(Line Plot):折线图是一种常见的数据可视化方式,用于显示随时间或其他连续变量的数据趋势。下面是一个绘制折线图的简单示例:import matplotlib.pyplot as&n
- 【Python】Python中的内存管理的原理是什么?
- Python中的内存管理的原理是什么?Python是一种高级的、动态类型的编程语言,具有自动垃圾回收功能。Python内存管理的原理基于引用计数机制和垃圾回收机制。引用计数机制是Python内存管理的基础。每个对象都会有一个引用计数器,用于记录对象被引用的次数。当一个对象被创建时,它的引用计数器被初始化为1。当一个对象被引用时,它的引用计数器就增加1。相反,当一个对象的引用失效时,它的引用计数器就减少1。当一个对象的引用计数器变为0时,说明该对象没有被引用,Python会自动将其回收,释放内存。
- 【Python】第五章 Ajax数据爬取
- 目录1. 什么是Ajax发送请求解析内容渲染网页1.1 实例引入1.2 基本原理2. Ajax分析方法2.1 分析案例2.2 过滤请求3. Ajax分析与爬取实战分析实现合并爬取详情页串联调用分析实现合并基础配置爬取页面内容(获取页面的JSON内容)爬取列表页(爬取指定列表页)3.1 爬取目标3.2 初步探索3.3 爬取列表页3.4 爬取详情页3.5 保存数据(MongoDB)(后期补充)使用requests获取的是原始HTML文档浏览器中的页面是JavaScript处理数据后生成的结果数据的来
- 【Python】快速入门Flask框架:构建简单而灵活的Web应用
- 快速入门Flask框架:构建简单而灵活的Web应用Flask是一个基于Python编程语言的轻量级Web应用框架。它简单而灵活,使得开发者可以快速构建Web应用。Flask提供了核心功能,同时也是一个扩展性强大的框架,通过插件可以实现更多的功能。本篇文章将介绍Flask框架的快速入门,并通过具体的代码示例让读者更加深入理解。一、Flask的安装与环境配置首先,我们需要安装Flask。使用pip命令可以方便地进行安装,打开命令行窗口并输入以下命令:pip install fla
- 【Python】Pandas轻松读取SQL数据库中的数据
- 数据处理利器:Pandas读取SQL数据库中的数据,需要具体代码示例随着数据量的不断增长和复杂性的提高,数据处理成为了现代社会中一个重要的环节。在数据处理过程中,Pandas成为了许多数据分析师和科学家们的首选工具之一。本文将介绍如何使用Pandas库来读取SQL数据库中的数据,并提供一些具体的代码示例。Pandas是基于Python的一个强大的数据处理和分析工具。它提供了丰富的数据结构,如Series和DataFrame,以及各种各样的功能,例如数据清洗、过滤、统计、可视化等。同时,Panda
- 【Python】如何在系统中安装pandas库
- 快速入门:Python安装pandas库的方法,需要具体代码示例一、概述Python是一种广泛使用的编程语言,它拥有强大的开发生态系统,其中包括许多实用的库。而pandas是其中一款非常受欢迎的数据分析库,它提供了高效的数据结构和数据分析工具,使得数据处理和分析变得更加简单。本文将介绍如何在Python中安装pandas库,并提供相应的代码示例。二、安装Python在安装pandas库之前,首先需要安装Python。Python官方网站提供了最新版本的Python的安装包,可以根据自己的操作系统
- 【Python】用matplotlib实现数据集散点图的实际应用
- 实战演练:利用Matplotlib绘制数据集的散点图Matplotlib是Python中常用的绘图库之一,它提供了丰富的功能,可以绘制各种类型的图表。其中,散点图是一种常用的数据可视化方式,用于展示两个变量之间的关系。本文将介绍如何利用Matplotlib绘制数据集的散点图,并附上具体的代码示例。首先,我们需要安装Matplotlib库。可以使用pip命令执行以下语句安装:pip install matplotlib安装完成后,我们可以导入Matplotlib库并开始绘制散点