【Python】第三章 网页数据的解析提取
1. XPath 的使用
XPath(XML Path Language)
html 文档也适用
XML 路径语言
用于在 XML 文档中查找信息
1.1 XPath 常用规则
表达式 | 描述 |
---|---|
nodename | 选取此节点的所有子节点 |
/ | 从当前节点选取直接子节点 |
// | 从当前节点选取子孙节点 |
. | 选取当前节点 |
… | 选取当前节点的父节点 |
@ | 选取属性 |
示例
//title[@lang="eng"]1
1.2 安装
pip3 install lxml1
1.3 实例引入
from lxml import etree text = """ <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> """ html = etree.HTML(text) result = etree.tostring(html) print(result.decode('utf-8'))
etree 模块可以自动修正 HTML 文本
直接读取文本
test.html
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div>
test.py
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
1.4 所有节点
获取所有节点(//*)
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//*") print(result)
获取所有 li 节点(//li)
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li") print(result)
使用 // + name 来获取所有名称为 name 的节点
1.5 子节点
获取 li 节点的直接子节点 a
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li/a") print(result)
获取 ul 节点的子孙节点 a
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//ul//a") print(result)
1.6 父节点
获取 href 属性为 link4.html 的 a 节点,然后获取其父节点,再获取父节点的 class 属性
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//a[@href=\"link4.html\"]/../@class") print(result)
通过 parent:: 获取父节点
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//a[@href=\"link4.html\"]/parent::*/@class") print(result)
1.7 属性匹配
使用 @ 过滤
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]") print(result)
1.8 文本获取
获取 class 属性为 item-0 的 li 节点,并获取其直接子节点的文本
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]/text()") print(result)
获取 class 属性为 item-0 的 li 节点,并获取其内部的文本
先选取 a 节点再获取文本
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]/a/text()") print(result)
使用 //
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li[@class=\"item-0\"]//text()") print(result)
获取子孙节点下的所有文本://
获取特定子孙节点下的所有文本:/特定节点/
1.9 属性获取
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) result = html.xpath("//li/a/@href") print(result)
1.10 属性多值匹配
li 属性有两个值
li
li-first
使用之前的属性匹配将无法获取
from lxml import etree text = """ <li class="li li-first"><a href="link.html">first item</a></li> """ html = etree.HTML(text) result = html.xpath("//li[@class=\"li\"]/a/text()") print(result)
使用 contains 进行属性的多值匹配
from lxml import etree text = """ <li class="li li-first"><a href="link.html">first item</a></li> """ html = etree.HTML(text) result = html.xpath("//li[contains(@class, \"li\")]/a/text()") print(result)
1.11 多属性匹配
from lxml import etree text = """ <li class="li li-first" name="item"><a href="link.html">first item</a></li> """ html = etree.HTML(text) result = html.xpath( "//li[contains(@class, \"li\") and @name=\"item\"]/a/text()") print(result)
运算符
运算符 | 描述 |
---|---|
or | 或 |
and | 与 |
mod | 求余 |
| | 求两个节点集 |
+ | 加 |
- | 减 |
* | 乘 |
div | 除 |
= | 等于 |
!= | 不等于 |
< | 小于 |
<= | 小于等于 |
> | 大于 |
>= | 大于等于 |
1.12 按序选择
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) # 选取第一个节点 result = html.xpath("//li[1]/a/text()") print(result) # 选取最后一个节点 result = html.xpath("//li[last()]/a/text()") print(result) # 获取限定位置的节点 result = html.xpath("//li[position() < 3]/a/text()") print(result) # 获取倒数第三个节点 result = html.xpath("//li[last() - 2]/a/text()") print(result)
1.13 节点轴选择
from lxml import etree html = etree.parse("./test.html", etree.HTMLParser()) # ancestor轴: 获取所有祖先节点 result = html.xpath("//li[1]/ancestor::*") print(result) result = html.xpath("//li[1]/ancestor::div") print(result) # attribute轴: 获取所有属性值 result = html.xpath("//li[1]/attribute::*") print(result) # child轴: 获取所有直接子节点 result = html.xpath("//li[1]/child::a[@href=\"link1.html\"]") print(result) # descendant轴: 获取所有子孙节点 result = html.xpath("//li[1]/descendant::span") print(result) # following轴: 获取当前节点之后的所有节点 result = html.xpath("//li[1]/following::*[2]") print(result) # following-sibling轴: 获取当前节点之后的所有同级节点 result = html.xpath("//li[1]/following-sibling::*") print(result)
2. Beautiful Soup 的使用
借助网页的结构和属性等特性来解析网页
2.1 解析器
Beautiful Soup 在解析时是依赖解析器的
Beautiful Soup 支持的解析器
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python 标准库 | BeautifulSoup(markup, “html.parser”) | Python 的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 或 Python 3.2.2 前的版本中文容错能力差 |
LXML HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装 C 语言库 |
LXML XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持 XML 的解析器 | 需要安装 C 语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 提供最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
2.2 安装
pip3 install beautifulsoup4 pip3 install lxml
2.3 基本使用
from bs4 import BeautifulSoup html = """ <html> <head> <title>Test</title> </head> <body> <p class="title" name="name"><b>Test</b></p> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><!--A--></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") # 以标准的缩进格式输出 print(soup.prettify()) print(soup.title.string)
Beautiful Soup 在初始化时会自动修正格式
2.4 节点选择器
from bs4 import BeautifulSoup html = """ <html> <head><title>Test</title></head> <body> <p class="title" name="name"><b>Test</b></p> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><!--A--></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") print(soup.title) print(type(soup.title)) print(soup.title.string) print(soup.head) print(soup.p)
此种选择方式只能选择到第一个匹配的节点,后面的其他节点都会忽略
2.5 提取信息
获取节点名称(name)
print(soup.title.name)
获取属性(attrs)
print(soup.p.attrs)print(soup.p.attrs["name"])
获取内容(string)
print(soup.p.string)
嵌套选择
print(soup.head.title)print(soup.head.title.string)
2.6 关联选择
子节点和子孙节点
直接子节点
contents
from bs4 import BeautifulSoup html = """ <html> <head><title>Test</title></head> <body> <p class="story">A B C <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> <p class="story">···</p> """ soup = BeautifulSoup(html, "lxml") print(soup.p.contents)
children
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
子孙节点(descendants)
from bs4 import BeautifulSoup soup = BeautifulSoup(html, "lxml") print(soup.p.descendants) for i, child in enumerate(soup.p.descendants): print(i, child)
父节点和祖先节点
父节点(parent)
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> """ soup = BeautifulSoup(html, "lxml") print(soup.a.parent)
祖先节点(parents)
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a>, <a href="http://example.com/B" class="abc" id="link2">B</a> and <a href="http://example.com/C" class="abc" id="link3">C </a>; </p> """ soup = BeautifulSoup(html, "lxml") print(type(soup.a.parents)) print(list(enumerate(soup.a.parents)))
兄弟节点
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> Test <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a> ABC <a href="http://example.com/B" class="abc" id="link2">B</a> abc <a href="http://example.com/C" class="abc" id="link3">C </a> 123 </p> """ soup = BeautifulSoup(html, "lxml") print("next sibling", soup.a.next_sibling) print("prev sibling", soup.a.previous_sibling) print("next siblings", list(enumerate(soup.a.next_siblings))) print("prev siblings", list(enumerate(soup.a.previous_siblings)))
提取信息
from bs4 import BeautifulSoup html = """ <html> <body> <p class="story"> Test <a href="http://example.com/A" class="abc" id="link1"><span>A</span></a><a href="http://example.com/B" class="abc" id="link2">B</a><a href="http://example.com/C" class="abc" id="link3">C </a> </p> """ soup = BeautifulSoup(html, "lxml") print(soup.a.next_sibling) print(soup.a.next_sibling.string) print(soup.a.parents) print(list(soup.a.parents)[0]) print(list(soup.a.parents)[0].attrs["class"])
2.7 方法选择器
find_all
findall(name, attrs, recursive, text, **kwargs)
name
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4> Hello </h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find_all(name="ul")) print(type(soup.find_all(name="ul")[0])) for ul in soup.find_all(name="ul"): print(ul.find_all(name="li")) for li in ul.find_all(name="li"): print(li.string)
attrs
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4> Hello </h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find_all(attrs={"id": "list-1"})) print(soup.find_all(attrs={"class": "element"})) # 等效于 print(soup.find_all(id="list-1")) print(soup.find_all(class_="element"))
text
该参数应为正则表达式对象
返回结果是由所有与该正则表达式相匹配的节点文本组成的列表
import re from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-body"> <a>link1</a> <a>link2</a> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find_all(text=re.compile("link")))
find
只能获取第一个匹配的节点元素
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4> Hello </h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.find(name="ul")) print(type(soup.find(name="ul"))) print(soup.find(class_="list"))
find_parents
获取所有的祖先节点
find_parent
获取直接父节点
find_next_siblings
获取后面的所有兄弟节点
find_next_sibling
获取后面的第一个兄弟节点
find_previous_siblings
获取前面的所有兄弟节点
find_previous_sibling
获取前面第一个兄弟节点
find_all_next
获取节点后面所有符合条件的节点
find_next
获取节点后面第一个符合条件的节点
find_all_previous
获取节点前面所有符合条件的节点
find_previous
获取节点前面第一个符合条件的节点
2.8 CSS 选择器
实例
select(节点1,节点2,节点3 ···)
获取所有 节点1 下的所有 节点2 下的所有 节点3 ···
节点可以换为
.class名
#id名
from bs4 import BeautifulSoup html = """ <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">A</li> <li class="element">B</li> <li class="element">C</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">a</li> <li class="element">b</li> <li class="element">c</li> </ul> </div> </div> """ soup = BeautifulSoup(html, "lxml") print(soup.select(".panel .panel-body .list")) print(soup.select("ul li")) print(soup.select("#list-2 .element")) print(type(soup.select("ul")[0]))
嵌套选择
soup = BeautifulSoup(html, "lxml") for ul in soup.select("ul"): print(ul.select("li"))
获取属性
方括号获取
.attrs[]
soup = BeautifulSoup(html, "lxml") for ul in soup.select("ul"): print(ul["id"]) # 等效于 print(ul.attrs["id"])
获取文本
.get_text()
.string
soup = BeautifulSoup(html, "lxml") for li in soup.select("li"): print(f"text: {li.get_text()}") # 等效于 print(f"string: {li.string}")
3. pyquery 的使用
3.1 安装
pip3 install pyquery
3.2 初始化
字符串初始化
from pyquery import PyQuery html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ doc = PyQuery(html) print(doc("li"))
URL初始化
from pyquery import PyQuery doc = PyQuery(url="https://www.bilibili.com/") # 等效于 doc = PyQuery(requests.get(url).text) print(doc("title"))
文件初始化
from pyquery import PyQuery doc = PyQuery(filename="test.html") print(doc("li"))
3.3 基本CSS选择器
from pyquery import PyQuery html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ doc = PyQuery(html) print(doc("#container .list li")) print(type(doc("#container .list li"))) for item in doc("#container .list li").items(): print(item.text())
3.4 查找节点
直接子节点
from pyquery import PyQuery doc = PyQuery(html) items = doc(".list") lis = items.children() print(type(lis)) print(lis)
子孙节点
from pyquery import PyQuery doc = PyQuery(html) items = doc(".list") print(type(items)) print(items) lis = items.find("li") print(type(lis)) print(lis)
直接父节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) items = doc(".list") container = items.parent() print(type(container)) print(container)
祖先节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) items = doc(".list") parents = items.parents() print(type(parents)) print(parents)
兄弟节点(有问题)
输出优先级
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) lis = doc(".list .item-0.active") for item in lis.siblings().items(): print(item.text())
3.5 遍历节点
获取单个节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) print(str(li))
获取多个节点
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) lis = doc("li").items() print(type(lis)) for li in lis: print(li, type(li))
获取信息
信息
属性
文本
获取属性(attr())
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) a = doc(".item-0.active a") print(a, type(a)) print(a.attr("href")) # 等效于 print(a.attr.href) a = doc("a") print(a, type(a)) print(a.attr("href")) for item in a.items(): print(item.attr("href"))
获取文本
获取内部文本(.text())
贪婪
如果拥有多个文本内容均会输出,且以空格分隔
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) a = doc(".item-0.active a") print(a) print(a.text())
获取内部的HTML文本(.html())
非贪婪
只会输出第一个匹配的文本
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) print(li.html())
.html()的相关误区
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") a = doc(".item-0.active a") # 输出内容相同,但内容并不相等 print(li.html()) print(a) # 类型为str print(type(li.html())) # 类型为pyquery.pyquery.PyQuery print(type(a))
3.6 节点操作
addClass和removeClass
动态改变节点的Class属性
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) li.remove_class("active") print(li) li.add_class("active") print(li)
attr、text和html
from pyquery import PyQuery html = """ <ul class="warp"> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> </ul> """ doc = PyQuery(html) li = doc(".item-0.active") print(li) li.attr("name", "link") print(li) li.text("changed item") print(li) li.html("<span>changed item</span>") print(li)
remove
from pyquery import PyQuery html = """ <ul class="warp"> ABC <p>abc<p> </ul> """ doc = PyQuery(html) li = doc(".item-0.active") warp = doc(".warp") warp.find("p").remove() print(warp.text())
3.7 伪类选择器
from pyquery import PyQuery html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """ doc = PyQuery(html) # 第一个li节点 li = doc("li:first-child") print(li) # 最后一个li节点 li = doc("li:last-child") print(li) # 第二个li节点 li = doc("li:nth-child(2)") print(li) # 下标大于2的节点 li = doc("li:gt(2)") print(li) # 以2为倍数位置的节点 li = doc("li:nth-child(2n)") print(li) # 包含second文本的节点 li = doc("li:contains(second)") print(li)
4. parsel的使用
可以接卸HTML和XML
支持使用XPath和CSS选择器对内容进行提取和修改
融合了正则表达式的提取功能
4.1 安装
pip3 install parsel
4.2 初始化
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) items1 = selector.css(".item-0") print(len(items1), type(items1), items1) items2 = selector.xpath("//li[contains(@class, 'item-0')]") print(len(items2), type(items2), items2)
.css()方法提取的结果是xpath属性而不是css属性
由底层cssselect库实现
CSS选择器被转化成XPath
真正用于节点提取的是XPath
4.3 提取文本
提取单个文本(get())
只能获取第一个匹配的对象
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) items = selector.css(".item-0") print(f"type of items: {type(items)}") for item in items: print(f"type of item: {type(item)}") text = item.xpath(".//text()").get() print(text)
提取多个文本(getall())
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) result = selector.xpath("//li[contains(@class, \"item-0\")]//text()").getall() print(result) result = selector.css(".item-0 *::text").getall() print(result)
4.4 提取属性
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) result = selector.css(".item-0.active a::attr(href)").get() print(result) result = selector.xpath("//li[contains(@class, \"item-0\") and contains(@class, \"active\")]/a/@href").get() print(result)
4.5 正则提取
from parsel import Selector html = """ <div> <ul> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """ selector = Selector(text=html) result = selector.css(".item-0").re("link(.*?)\"") print(result) result = selector.css(".item-0 *::text").re(".*item") print(result) # 获取第一个符合 result = selector.css(".item-0").re_first(">(.*?item)") print(result)
猜你喜欢
- 【Python】如何使用Python操作路径名?
- 在本文中,我们将学习使用 Python 操作路径名。以下是下面提到的一些不同的示例 -从文件路径获取主文件名从文件路径获取目录名将路径组件连接在一起扩展用户的主目录从文件路径中分离文件扩展名算法(步骤)以下是执行所需任务所需遵循的算法/步骤。 -使用 import 关键字导入 os 模块。创建一个变量来存储输入文件路径。使用os模块的basename()函数(返回给定文件路径的基本名称)来获取输入文件路径的最后一个组成部分(主文件名)并打印出来。从文件路径获取主文件名示例以下程序使用 os.pa
- 【Python】如何使用Python实现二分查找算法
- 如何使用Python实现二分查找算法?二分查找算法,也称为折半查找算法,是一种高效的查找算法。它适用于有序的数组或列表,通过将目标值与数组中间位置的元素进行比较,从而缩小查找范围。下面将介绍如何在Python中实现二分查找算法,并提供具体的代码示例。算法思路:将目标值与数组中间位置的元素进行比较;如果相等,则返回元素位置;如果目标值大于中间位置的元素,则在右半部分继续查找;如果目标值小于中间位置的元素,则在左半部分继续查找;不断将查找范围缩小一半,直到找到目标值或者查找范围为空。代码实现:下面是
- 【Python】如何用Python绘制3D地理图表
- 如何用Python绘制3D地理图表概述:绘制3D地理图表可以帮助我们更直观地理解地理数据和空间分布。Python作为一种功能强大且易于使用的编程语言,提供了许多库和工具,可用于绘制各种类型的地理图表。在本文中,我们将学习如何使用Python编程语言和一些流行的库,如Matplotlib和Basemap,来绘制3D地理图表。环境准备:在开始之前,我们需要确保已经安装了Python和一些必要的库。这里假设您已经安装了Python 3.x版本,并且已经安装了以下库:Matplotlib:用于绘制图表和
- 【Python】在Python中如何安装pandas库的方法
- Python中如何安装pandas库?Pandas是一个强大且灵活的数据分析工具,它提供了丰富的数据结构和数据分析功能,使得数据处理更加快速和方便。本文将介绍如何在Python中安装pandas库,并提供具体的代码示例。在开始安装之前,确保你已经安装了Python环境。你可以在Python官网(https://www.python.org)下载最新版本的Python安装程序,并按照提示进行安装。在Python中安装pandas库有多种方法,例如使用pip或conda等软件包管理工具。下面我们将分
- 【Python】如何使用Python实现Floyd-Warshall算法
- 如何使用Python实现Floyd-Warshall算法?Floyd-Warshall算法是一种用于解决所有源点到所有目标点的最短路径问题的经典算法。它是一种动态规划算法,可用于处理有向图或负权边问题。本文将介绍如何使用Python实现Floyd-Warshall算法,以及提供具体的代码示例。Floyd-Warshall算法的核心思想是通过遍历图中的所有节点,以每个节点为中间节点,逐步更新节点间的最短路径。我们可以使用一个二维矩阵来存储图中各节点之间的距离。首先,我们需要定义一个函数来实现Flo
- 【Python】pythonGUI写一个exe桌面应用程序
- 一、整体步骤1、安装pyinstaller 3.02、安装wxpython3、安装布局工具wxFormBuilder4、将png生成icon5、upx391w(打包成exe程序)二、工具安装安装布局工具(wxFormBuilder_v3.5.1-rc1.exe)下载地址:http://sourceforge.net/projects/wxformbuilder/files/wxformbuilder/3.1.70/教程地址:https://www.cnblogs.com/jikeboy/p/56
- 【Python】如何用Python编写最短路径算法
- 如何用Python编写最短路径算法?最短路径算法,是一种用于在一个带有加权边的图中找到从起始节点到目标节点的最短路径的算法。其中,最著名且经典的两种算法是Dijkstra算法和A*算法。本文将介绍如何使用Python编写这两种算法,并提供代码示例。Dijkstra算法Dijkstra算法是一种贪婪算法,用于求解带有非负边权的图的最短路径。它以一个起始节点开始,逐步扩展到其他节点,直到找到目标节点或者扩展完所有可能的节点。具体步骤如下:1) 创建一个集合S,用于保存已确定最短路径的节点。2) 初始
- 【Python】第七章 JavaScript动态渲染页面爬取
- 目录1. Selenium的使用隐式等待显式等待获取属性获取文本值获取ID、位置、标签名和大小单个节点多个节点安装selenium安装WebDriverWebDriver配置1.1 准备工作1.2 基本用法1.3 初始化浏览器对象1.4 访问页面1.5 查找节点1.6 节点交互1.7 动作链1.8 运行JavaScript1.9 获取节点信息1.10 切换Frame1.11 延时等待1.12 前进和后退1.13 Cookie1.14 选项卡管理1.15 异常处理1.16 反屏蔽1.17 无头模式