Python爬虫

获取数据

`requests`库

requests库为第三方库，Python不内置，需要手动安装：

1	`pip install requests`

或

1	`pip3 install requests`

GET请求

如果要向网页发送GET请求，使用requests库中的get方法：

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(http.text)

输出结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66deb3be-6e9601872fd67f8e396d9e7e"
  }, 
  "origin": "138.128.221.178", 
  "url": "http://httpbin.org/get"
}

http://httpbin.org/get网页会返回发送过去的GET请求

requests.get方法的参数：

params，也就是url的?后面的数据，比如：http://example.com?a=1&b=2

import requests
data = {
    'name':'abc',
    'age':18
}
http = requests.get('http://httpbin.org/get', params = data)
print(http.text)

返回结果：

{
  "args": {
    "age": "8", 
    "name": "abc"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66deb824-3fee48163eb300d21574b23d"
  }, 
  "origin": "138.128.221.178", 
  "url": "http://httpbin.org/get?name=abc&age=8"
}

POST请求

如果要向网页发送POST请求，使用requests库中的post方法：

1
2
3

import requests
http = requests.post('http://httpbin.org/post')
print(http.text)

返回结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66deb90a-35df41cb25e71332407a0261"
  }, 
  "json": null, 
  "origin": "138.128.221.178", 
  "url": "http://httpbin.org/post"
}

requests.post方法的参数：

data：请求头中的form部分（表单），在POST请求中，请求参数通常在form中

import requests
data = {
    'name':'abc',
    'age':18
}
http = requests.post('http://httpbin.org/post', data = data)
print(http.text)

返回结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "18", 
    "name": "abc"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "15", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66deba67-65e0fe9433e8d1f661b0de58"
  }, 
  "json": null, 
  "origin": "138.128.221.178", 
  "url": "http://httpbin.org/post"
}

json：如果需要上传JSON数据，则使用json参数，

import requests
jsons = {
    'name':'abc',
    'age':18
}
http = requests.post('http://httpbin.org/post', json = jsons)
print(http.text)

返回结果：

{
  "args": {}, 
  "data": "{\"name\": \"abc\", \"age\": 18}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "26", 
    "Content-Type": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66debaff-503cfec77494caf9719b6592"
  }, 
  "json": {
    "age": 18, 
    "name": "abc"
  }, 
  "origin": "138.128.221.178", 
  "url": "http://httpbin.org/post"
}

files：如果我们想要上传文件，可以使用files参数：

import requests
files = {
    'file': open('abc.jpg', 'rb')
}
http = requests.post('http://httpbin.org/post', files = files)
print(http.text)

结果：

{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "data:application/octet-stream;base64,/9j/4AAQSkZJRgABAQEAAAAAAAD/4QAuRXhpZgAATU0AKgAAAAgAAkAAAAMAAAABAHwAAEABAAEAAAABAAAAAAAAAAD/2wBDAAoHBwkHBgoJCAkLCwoMDxkQDw4ODx4WFxIZJCAmJSMgIyIoLTkwKCo2KyIjMkQyNjs9QEBAJjBGS0U+Sjk/QD3/2wBDAQsL......"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "27984", 
    "Content-Type": "multipart/form-data; boundary=a8c2a00ef69d52ac081743800b2ddb96", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66def532-5091c5d5409d13647fbfff7a"
  }, 
  "json": null, 
  "origin": "59.45.22.140", 
  "url": "http://httpbin.org/post"
}

设置请求头

使用headers参数设置请求头

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0 abcabcabc'
}
http = requests.get('http://httpbin.org/get', headers = headers)
print(http.text)

返回结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0 abcabcabc", 
    "X-Amzn-Trace-Id": "Root=1-66dec2e7-770d878920f3da202c694075"
  }, 
  "origin": "223.100.114.95", 
  "url": "http://httpbin.org/get"
}

如果使用requests请求时用了headers参数更改了默认的headers中的内容，其他部分不会改变。

有些网站对请求头有一定的要求，比如如果不设置User-Agent就无法正常访问网站（反爬虫的一种简单的方法），我们可以通过设置请求头绕过这种限制

响应

发送请求后，网页会给我们一个响应，

获取响应状态码

使用status_code获得响应状态码

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(http.status_code)

如果返回的结果为200，代表网页正常响应

状态码常用来判断请求是否成功，除了可以直接使用HTTP提供的状态码，requests库中也提供了一个内置的状态码查询对象requests.codes，

import requests
http = requests.get('http://httpbin.org/get')
if http.status_code == requests.codes.ok:
    print('Success')
else:
    print('Fail')

requests.codes对象拥有的状态码如下：

#信息性状态码
100：('continue',),
101：('switching_protocols',),
102：('processing',),
103：('checkpoint',),
122：('uri_too_long','request_uri_too_long'),

#成功状态码
200：('ok','okay','all_ok','all_okay','all_good','\\o/','√'),
201：('created',),
202：('accepted',),
203：('non_authoritative_info','non_authoritative_information'),
204：('no_content',),
205：('reset_content','reset'),
206：('partial_content','partial'),
207：('multi_status','multiple_status','multi_stati','multiple_stati'),
208：('already_reported',),
226：('im_used',),

#重定向状态码
300：('multiple_choices',),
301：('moved_permanently','moved','\\o-'),
302：('found',),
303：('see_other','other'),
304：('not_modified',),
305：('user_proxy',),
306：('switch_proxy',),
307：('temporary_redirect','temporary_moved','temporary'),
308：('permanent_redirect',),

#客户端请求错误
400：('bad_request','bad'),
401：('unauthorized',),
402：('payment_required','payment'),
403：('forbiddent',),
404：('not_found','-o-'),
405：('method_not_allowed','not_allowed'),
406：('not_acceptable',),
407：('proxy_authentication_required','proxy_auth','proxy_authentication'),
408：('request_timeout','timeout'),
409：('conflict',),
410：('gone',),
411：('length_required',),
412：('precondition_failed','precondition'),
413：('request_entity_too_large',),
414：('request_uri_too_large',),
415：('unsupported_media_type','unsupported_media','media_type'),
416：('request_range_not_satisfiable','requested_range','range_not_satisfiable'),
417：('expectation_failed',),
418：('im_a_teapot','teapot','i_am_a_teapot'),
421：('misdirected_request',),
422：('unprocessable_entity','unprocessable'),
423：('locked'),
424：('failed_dependency','dependency'),
425：('unordered_collection','unordered'),
426：('upgrade_required','upgrade'),
428：('precondition_required','precondition'),
429：('too_many_requests','too_many'),
431：('header_fields_too_large','fields_too_large'),
444：('no_response','none'),
449：('retry_with','retry'),
450：('blocked_by_windows_parental_controls','parental_controls'),
451：('unavailable_for_legal_reasons','legal_reasons'),
499：('client_closed_request',),

#服务端错误状态码
500：('internal_server_error','server_error','/o\\','×')
501：('not_implemented',),
502：('bad_gateway',),
503：('service_unavailable','unavailable'),
504：('gateway_timeout',),
505：('http_version_not_supported','http_version'),
506：('variant_also_negotiates',),
507：('insufficient_storage',),
509：('bandwidth_limit_exceeded','bandwith'),
510：('not_extended',),
511：('network_authentication_required','network_auth','network_authentication')

获取响应头

使用headers获得响应头

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(dict(http.headers))

返回结果：

{'Date': 'Mon, 09 Sep 2024 09:54:46 GMT',
 'Content-Type': 'application/json',
 'Content-Length': '311',
 'Connection': 'keep-alive',
 'Server': 'gunicorn/19.9.0',
 'Access-Control-Allow-Origin': '*',
 'Access-Control-Allow-Credentials': 'true'}

获取cookies

使用cookies获得cookies：

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'
}
http = requests.get('https://www.baidu.com', headers = headers)
print(http.cookies)

这里输出的数据的类型为<class 'requests.cookies.RequestsCookieJar'>，如果我们想要获取其中的内容

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'
}
http = requests.get('https://www.baidu.com', headers = headers)
for key, value in http.cookies.items():
    print(key+'='+value)

结果：

BAIDUID=01F7C48999EF3C0B11FCB8649BEA416E:FG=1
BIDUPSID=01F7C48999EF3C0BBF7B44D700D08F1A
PSTM=1725883093
BDSVRTM=3
BD_HOME=1

http.cookies.items()中的items()方法将cookies转换为列表、元组类型，

1	`print(http.cookies.items())`

结果：

1	`[('BAIDUID', '01F7C48999EF3C0B11FCB8649BEA416E:FG=1'), ('BIDUPSID', '01F7C48999EF3C0BBF7B44D700D08F1A'), ('PSTM', '1725883093'), ('BDSVRTM', '3'), ('BD_HOME', '1')]`

可以使用http.cookies.get(key)的方式获取对应键的值：

1	`print(http.cookies.get('BAIDUID'))`

输出结果：

1	`01F7C48999EF3C0B11FCB8649BEA416E:FG=1`

获取URL

使用url获得URL：

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(http.url)

返回结果：

1	`http://httpbin.org/get`

获取文本（str类型）

使用text获得网页返回的str类型数据

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(http.text)

返回结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66decc33-1d284ece4d191c555bf39f6c"
  }, 
  "origin": "223.100.114.95", 
  "url": "http://httpbin.org/get"
}

获取二进制数据

使用content获取网页返回的二进制数据

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(http.content)

返回结果：

b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate, br", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.28.1", \n    "X-Amzn-Trace-Id": "Root=1-66ded939-4b946c036a13f8176d10afed"\n  }, \n  "origin": "223.100.114.95", \n  "url": "http://httpbin.org/get"\n}\n'

如果二进制数据不是文件，而是str数据类型的二进制，可以使用decoode方法解码，比如解码为UTF-8数据类型：

1
2
3

import requests
http = requests.get('http://httpbin.org/get')
print(http.content.decode('utf-8'))

返回结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66deda00-1b59848b5409550e044f0b07"
  }, 
  "origin": "223.100.114.95", 
  "url": "http://httpbin.org/get"
}

如果爬取的内容为图片、音频、视频等，获取到二进制内容后，可以将其存放到本地：

import requests
http = requests.get('https://th.bing.com/th/id/OIP.TNE2gqvkCP38r1nda89iagHaEo?rs=1&pid=ImgDetMain')
with open('abc.jpg', 'wb') as f:
    f.write(http.content)

打开abc.jpg，即可看到爬取的图片

设置cookies

在一些网站中，我们可以以cookies来维持登录状态，

我们可以直接在设置请求头中设置cookies：

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0',
    'Cookie': 'a=123; b=456'
}
http = requests.get('http://httpbin.org/get', headers = headers)
print(http.text)

结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "a=123; b=456", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0", 
    "X-Amzn-Trace-Id": "Root=1-66deee1a-68b3e32b522e107507e8222c"
  }, 
  "origin": "59.45.22.140", 
  "url": "http://httpbin.org/get"
}

也可以直接使用cookies参数，需要设置RequestsCookieJar：

import requests
import requests.cookies

cookies = 'a=123; b=456'
jar = requests.cookies.RequestsCookieJar()
for cookie in cookies.split(';'):
    key, value = cookie.split('=')
    jar.set(key, value)
http = requests.get('http://httpbin.org/get', cookies = jar)
print(http.text)

结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "b=456; a=123", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66deefd8-5d990c43143813315a20760d"
  }, 
  "origin": "59.45.22.140", 
  "url": "http://httpbin.org/get"
}

如果同时使用headers和cookies参数设置cookies，那么最终只有headers中的cookies生效：

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0',
    'Cookie': 'c=456'
}
jar = requests.cookies.RequestsCookieJar()
cookies = 'a=123; b=456'
for cookie in cookies.split(';'):
    key, value = cookie.split('=')
    jar.set(key, value)
http = requests.get('http://httpbin.org/get', headers = headers, cookies = jar)
print(http.text)

结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "c=456", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0", 
    "X-Amzn-Trace-Id": "Root=1-66def0b2-554682a5736248352cdd2109"
  }, 
  "origin": "59.45.22.140", 
  "url": "http://httpbin.org/get"
}

设置Session

如果使用Cookie来保持登录，每一次请求都需要设置cookies数据，我们可以使用Session来维持会话，

使用requests.Session()来创建Session对象，Session对象也可以使用get()和post()方法，使用这个Session对象，无需重复设置cookies

import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/a/123')
http = s.get('http://httpbin.org/get')
print(http.text)

结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "a=123", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66def7a6-2c6da17e4b19f19c336741c3"
  }, 
  "origin": "59.45.22.140", 
  "url": "http://httpbin.org/get"
}

import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/a/123')
s.get('http://httpbin.org/cookies/set/b/456')
http = s.get('http://httpbin.org/get')
print(http.text)

结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Cookie": "a=123; b=456", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-67447d03-26fc80215e163df61aa8a44e"
  }, 
  "origin": "223.100.114.95", 
  "url": "http://httpbin.org/get"
}

SSL证书验证

当发送HTTPS请求时，requests会检查SSL证书，如果该网站的证书没有被CA机构信任，程序将会报错，如果希望忽略证书验证，需要将verify参数设置为False：

1
2
3

import requests
http = requests.get('https://httpbin.org/get', verify=False)
print(http.text)

也可以指定一个本地证书用作客户端证书，它可以是单个文件（包含密钥和证书）或一个包含两个文件路径的元组：

import requests
# 本地需要有crt和key文件（key文件必须是解密状态，加密状态的key是不支持的），并指定它们的路径
http = requests.get('https://httpbin.org/get', cert = ('/path/server.crt', '/path/key'))
print(http.status_code)

代理设置

反爬虫的一个简单策略就是防止同一个IP频繁的请求，我们可以设置代理来解决这个问题，设置代理使用proxies参数：

import requests

proxies = {
    'http': 'http://127.0.0.1:8080', # http代理
    'https': 'http://127.0.0.1:8080' # https代理
}

http = requests.get('http://httpbin.org/get', proxies = proxies)
print(http.text)

返回结果：

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-66df09c6-6c8ab5b1553a7d15017fd99d"
  }, 
  "origin": "xxx.xxx.xxx.xxx", // 代理的IP 
  "url": "https://httpbin.org/get"
}

如果代理需要使用用户名、密码验证，也就是需要使用HTTP Basic Auth，可以使用http://user:password@host:port这样的语法来设置代理：

import requests

proxies = {
    'http': 'http://abc:123456@127.0.0.1:8080',
    'https': 'http://abc:123456@127.0.0.1:8080'
}

http = requests.get('https://httpbin.org/get', proxies = proxies)
print(http.text)

除了HTTP和HTTPS代理外，requests也支持SOCKS协议的代理，使用SOCKS代理需要先安装requests的SOCKS库：

1	`pip install 'requests[socks]'`

或

1	`pip3 install 'requests[socks]'`

使用方法：

import requests

proxies = {
    'http': 'socks5://user:password@host:port',
    'https': 'socks5://user:password@host:port'
}

http = requests.get('https://httpbin.org/get', proxies = proxies)
print(http.text)

超时设置

在本机或者服务器网络响应很慢或者没有网络的情况下，可能要等待很久的时间才能收到响应，甚至一直收不到响应而报错，可以设置一个超时时间，如果超过了设置的时间仍旧没有收到响应，就会抛出错误。

使用timeout参数设置超时时间

响应超时时间

import requests
# 如果1s内没有得到响应，则抛出错误
http = requests.get('https://httpbin.org/get', timeout = 1)
print(http.text)

分别指定超时时间，响应请求可以分为连接和读取两个阶段。如果timeout参数仅指定一个整数值，则超时时间是这两个阶段的总和；如果分别指定，可以传入一个元组
1
2
3
4
import requests # 如果连接阶段2s内没有得到响应或读取阶段10s内没有得到响应，则抛出错误 http = requests.get('https://httpbin.org/get', timeout = (2, 10)) print(http.text)
永不超时，直接将timeout参数值设置为None，或者不设置timeout参数，因为它的默认值就是None

身份验证

对于带有简单的用户名密码验证功能的网页，也就是可以直接使用http://user:password@host:port形式验证的HTTPBasicAuth功能，可以直接使用requests库中的HTTPBasicAuth类实现

import requests
from requests.auth import HTTPBasicAuth

url = 'http://127.0.0.1:8080'
http = requests.get(url, auth = HTTPBasicAuth('abc', '123456'))
print(http.status_code)

requests还提供其他验证方式，比如OAuth验证，使用这个验证需要安装oauth包：

1	`pip install requests_oauthlib`

或

1	`pip3 install requests_oauthlib`

使用方法：

import requests
from requests_oauthlib import OAuth1
url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
auth = OAuth1("YOUR_APP_KEY","YOUR_APP_SECRET","USER_OAUTH_TOKEN","USER_OAUTH_TOKEN_SECRET")
requests.get(url,auth=auth)

Prepared Request

from requests import Request, Session

url = 'http://httpbin.org/post'

data = {
    'name': 'abc',
    'age': 18
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'
}

# 创建Session对象
s = Session()

# 创建Request对象，这个请求对象还是未准备的，不能直接发送
req = Request('POST', url, data = data, headers = headers)

# 将req请求对象转换为一个准备好的请求对象prepared，这个过程包括添加一些必要的请求头（如HOST、Content-Length等），以及将请求体（如果有的话）编码为适当的格式
prepared = s.prepare_request(req)

# 发送请求
http = s.send(prepared)

print(http.text)

使用Prepared Request的好处是，可以利用Request将请求当做独立的对象来看，这样在进行队列调度时会非常方便，可以构造Request队列

Selenium

要使用Selenium，需要安装selenium库，并下载适用于浏览器的WebDriver

安装Selenium：

1	`pip install selenium`

安装完成后，可以使用下面的命令查看selenium的版本信息：

1	`pip show selenium`

也可以使用Python代码查看selenium的版本信息：

1 2	`import selenium print(selenium.__version__)`

下载WebDriver：

从Selenium 4开始，selenium会尝试自动检测系统中安装的浏览器版本，并下载相应的驱动程序：

1 2	`from selenium import webdriver driver = webdriver.Chrome() # 自动检测并安装适合版本的webdriver`

如果无法自动安装WebDriver，则需要手动下载WebDriver并指定驱动路径：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeServie

service = ChromeService(executable_path='PATH_TO_DRIVER') # 设置WebDriver驱动程序的路径
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

不同的浏览器需要不同的WebDriver，例如Chrome浏览器需要ChromeDriver，Firefox浏览器需要GeckoDriver，Edge浏览器需要EdgeDriver，Safari浏览器需要SafariDriver等，

Chrome浏览器和ChromDriver可以在此处查找版本并下载，在Chrome浏览器输入网址chrome://version/可以查看Chrome浏览器的版本，从而下载对应版本的ChromDriver

使用不同的浏览器，selelium启动WebDriver的命令也不同：

from selenium import webdriver

# 使用 Chrome 浏览器
driver = webdriver.Chrome()

# 使用Firefox 浏览器
# driver = webdriver.Firefox()

# 使用 Edge 浏览器
# driver = webdriver.Edge()

打开网页

使用get()方法打开网页：

from selenium import webdriver

driver = webdriver.Chrome()

# 打开网页
driver.get('http://example.com')

查找页面元素

通过find_element()和find_elements()方法查找页面元素：

1	`driver.find_element(By, value) # 查找第一个匹配的元素`

By：查找的方式，比如通过节点的id查找：By=By.ID；通过节点的类查找：By=By.CLASS_NAME等等
value：值，比如：driver.find_element(By.ID, 'abc')表示查找id为abc的节点

1	`driver.find_elements(By, value) # 查找所有匹配的元素，以列表形式返回`

查找方式：

1	`from selenium.webdriver.common.by import By # 导入By模块`

By.CLASS_NAME：通过类名查找
By.ID：通过ID查找
By.NAME：通过节点名查找
By.XPATH：通过XPath语法查找

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get('https://example.com')
link = driver.get_element(By.NAME, 'a')
link.click()
driver.quit()

模拟用户操作

点击

1	`element.click()`

比如：

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('https://example.com')
button = driver.get_element('xpath', '//div/a')
button.click()

输入内容

1	`element.send_keys(value)`

比如：

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('https://www.baidu.com')
search_box = driver.find_element('xpath', '//span/input[@id="kw"]')
search_box.send_keys('example')

如果我们想要输入特殊字符：

1	`from selenium.webdriver.common.keys import Keys # 导入Keys模块`

回车：Keys.ENTER
删除：Keys.DELETE
Tab：Keys.TAB
F1：Keys.F1
其余依此类推

比如：

from selenium import webdriver
from selenium.webriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get('https://www.baidu.com')
search_box = driver.find_element('xpath', '//span/input[@id="kw"]')
search_box.send_keys('example')
search_box.send_keys(Keys.ENTER)

浏览器后退

1	`driver.back()`

浏览器前进

1	`driver.forward()`

刷新

1	`driver.refresh()`

执行JavaScript脚本

1	`driver.execute_script(script, *args)`

比如：

打开新窗口：

1	`driver.execute_script('window.open("https://www.baidu.com")')`

切换到指定窗口

1	`driver.switch_to.window()`

双击

通过ActionChains类来实现双击、聚焦等复杂操作。

双击为double_click()

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get('https://www.example.com')

# 创建ActionChains对象
actions = ActionChains(driver)

element = driver.find_elements(By.CLASS_NAME, 'example')[0]

# 执行双击操作
actions.double_click(element).perform()

注意：如果不使用perform()，则不会执行操作。

组合键输入

比如，如果想要输入Ctrl+A，实现全选操作，如果使用的是.send_keys(Keys.CONTROL + 'a')，那么则是先发送Ctrl再发送A，无法实现同时按下达到全选的效果。

因此，我们需要使用ActionChains实现按下按键与弹起按键的操作：使用key_down()和key_up()

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get('https://www.example.com')

# 创建ActionChains对象
actions = ActionChains(driver)

element = driver.find_elements(By.CLASS_NAME, 'example')[0]

actions.move_to_element(element).click().key_down(Keys.CONTROL).send_keys('a').key_up(Keys.CONTROL).perform()

右键点击

通过ActionChains类来实现，为context_click()

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get('https://www.example.com')

actions = ActionChains(driver)

element = driver.find_element(By.ID, 'example')

actions.context_click(element).perform()

悬停

模拟鼠标悬停，常用于悬停显示下拉列表或二级菜单

通过ActionChains类实现，为move_to_element()

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get('https://www.example.com')

actions = ActionChains(driver)

element = driver.find_element(By.ID, 'example')

actions.move_to_element(element).perform()

精准坐标移动

针对复杂布局中的元素部分区域操作，如点击图表中的特定点

通过ActionChains类实现，为move_to_element_with_offset(element, x, y)

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get('https://www.example.com')

actions = ActionChains(driver)

element = driver.find_element(By.ID, 'example')

actions.move_to_element_with_offset(element, 50, 30).perform()

鼠标滚轮

在ActionChains类中，与滚动相关的核心方法主要有以下两种：

scroll_by_amount(delta_x, delta_y)
- 功能：以当前视窗左上角为原点，按指定偏移量滚动页面
- 参数：
  - delta_x：顺平滚动量（正数表示向右滚动，负数向左）
  - delta_y：垂直滚动量（正数表示向下滚动，负数向上）
1
actions.scroll_by_amount(100, 200).perform()# 向右滚动100像素，向下滚动200像素

scroll_from_origin(scroll_origin, delta_x, delta_y)

功能：从指定起点（视窗或元素中心）按偏移量滚动
参数：
- scroll_origin：滚动起点，可以是ScrollOrigin对象或页面元素
- delta_x/delta_y：滚动偏移量

from selenium.webdriver.common.actions.wheel_input import ScrollOrigin

element = driver.find_element(By.ID, "canvas")
origin = ScrollOrigin.from_element(element)	# 以元素中心为起点
actions.scroll_from_origin(origin, 0, 50).perform()

定向输入

向指定元素发送按键，通过ActionChains类实现，为send_keys_to_element(element, keys)

绕过焦点限制直接操作特定输入框。

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get('https://www.example.com')

actions = ActionChains(driver)

element = driver.find_element(By.ID, 'example')

actions.send_keys_to_element(element, 'text').perform()

高级操作

添加延迟

解决动态加载元素或防止操作过快导致的失效，通过ActionChains类实现，为pause(seconds)

(ActionChains(driver)
 .click(button)
 .pause(1)  # 等待弹窗加载
 .send_keys("data")
 .perform())

显示等待元素就绪

使用WebDriverWait：

1
2
3

from selenium.webdriver.support.ui import WebDriverWait

WebDriverWait(driver, timeout, poll_frequency=0.5, ignore_exception=None)

driver：WebDriver实例
timeout：最长等待时间（单位：秒）
poll_frequency：轮询间隔（默认0.5秒）
ignored_exceptions：忽略的异常类型（如超时后仍不终止轮询的异常）

核心方法：

until(method)：等待条件满足（返回True）后继续执行
until_not(method)：等待条件不满足（返回False）后继续执行

常用等待条件：

通过expected_conditions模块预定义条件：

元素存在性：
- presence_of_element_located(locator)：元素存在于DOM树中（不一定可见）
- presence_of_all_elements_located(locator)：至少存在一个匹配元素
元素可见性：
- visibility_of_element_located(locator)：元素可见（宽度/高度不为0）
- invisibility_of_element_located(locator)：元素不可见或不存在于DOM中
交互状态：
- element_to_be_clickable(locator)：元素可见且可点击（如按钮）
- text_to_be_present_in_element(locator, text)：元素包含指定文本
页面状态：
- title_is(title)：页面标题完全匹配指定字符串
- url_contains(substring)：URL包含特定子串

比如：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver

driver = webdriver.Chrome()

element = WebDriverWait(driver, 10).until(
    EC.visibility_of_element_located((By.ID, "my_element"))
)

又比如：表单提交后等待新页面加载完成：

1	`WebDriverWait(driver, 10).until(EC.title_contains("提交成功"))`

等待弹窗出现并操作：

1 2	`alert = WebDriverWait(driver, 5).until(EC.alert_is_present()) alert.accept()`

获取内容

获取元素的文本

1	`element.text`

获取当前页面的网址

1	`driver.current_url`

获取元素的属性值

1	`element.get_attribute(value)`

筛选、处理数据

Beautiful Soup

BeautifulSoup的安装

Beautiful Soup是一个可以从HTML或XML中提取数据的Python库，

安装：

1 2	`pip install bs4 pip install lxml`

BeautifulSoup的基本使用

Beautiful Soup在解析时依赖解析器，它除了支持Python标准库中的HTML解析器外，还支持一些第三方库，比如LXML：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, 'html.parser')`	Python内置的标准库，执行速度适中	Python3.2.2之前版本容错能力差
lxml HTML解析器	`BeautifulSoup(markup, 'lxml')`	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	`BeautifulSoup(markup, 'xml')`	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, 'html5lib')`	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢，不依赖外部拓展

lxml解析器可以解析HTML文档和XML文档，并且速度快，容错能力强，如果使用lxml，在初始化BeautifulSoup时，把第二个参数设置为lxml即可

from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Hello world!</p>', 'lxml')

print(soup.p)

输出结果：

1	`<p>Hello world!</p>`

from bs4 import BeautifulSoup

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

soup = BeautifulSoup(html, 'lxml')

print(soup)

输出结果：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>

由此可以看出，创建了BeautifulSoup对象soup后，会自动更正格式

调整缩进：soup.prettify()：（注意：调整缩进后的内容将转换为str数据类型）

from bs4 import BeautifulSoup

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

soup = BeautifulSoup(html, 'lxml')

print(soup.prettify())

输出结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

注意：调整缩进后的内容将转换为str数据类型

soup = BeautifulSoup(html, 'lxml')
print(type(soup))
soup = soup.prettify()
print(type(soup))

输出结果：

1 2	`<class 'bs4.BeautifulSoup'> <class 'str'>`

节点选择器

选择元素

直接选择

我们可以使用标签名来获取节点，即使用soup.标签名，

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.p)
print(type(soup.p))

输出结果：

1 2	`<p class="title"><b>The Dormouse's story</b></p> <class 'bs4.element.Tag'>`

soup.标签名是Tag类型，该类型有很多方法和属性

需要注意的是，如果直接使用标签名，当有多个节点时，这种选择方法只会匹配到第一个节点

元素选择的嵌套

如果标签下还有标签，我们可以使用soup.标签名.标签名这种嵌套方式获取元素：

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(soup.head.title.string)

输出结果：

1 2	`<title>The Dormouse's story</title> The Dormouse's story`

提取节点信息

如果我们想要获取节点的属性，可以使用soup.属性名的方式，

属性	获取方法
节点名称	`soup.name`
id	`soup.id`
文本内容	`soup.string`
文本内容	`soup.text`
文本内容	`soup.get_text()`
全部属性	`soup.attrs`
指定属性	`soup.attrs[属性名]`
指定属性	`soup[属性名]`
指定属性	`soup.get(属性名)`

比如：

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.p.name)
print(soup.p.attrs)
print(soup.p['class'])

输出结果：

1
2
3

p
{'class': ['title']}
['title']

关联选择

选取子节点和子孙节点

`contents`

调用contents属性可以获取节点元素的直接子节点：

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

输出结果：

1	`[<b>The Dormouse's story</b>, '\n', <b>123</b>, '\n']`

这里的子节点包括换行符\n

`children`

除了使用contents方法，还可以使用children方法，children方法得到的是迭代器（iterator），可以转换为list类型或者用for循环将其输出

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(type(soup.p.children))
print(list(soup.p.children))
for child in soup.p.children:
    print(child)

输出结果：

<class 'list_iterator'>
[<b>The Dormouse's story</b>, '\n', <b>123</b>, '\n']
<b>The Dormouse's story</b>


<b>123</b>

`descendants`

使用descendants方法会递归逐层获取子孙元素，当当前子孙元素获取完后（到达叶节点）会继续获取下一个子孙元素（前序遍历）

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com">
<p>123</p>
<p>456</p></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for child in soup.p.descendants:
    print(child)

输出结果：


<a href="http://abc.com">
<p>123</p>
<p>456</p></a>


<p>123</p>
123


<p>456</p>
456


<b>The Dormouse's story</b>
The Dormouse's story


<b>123</b>
123

选取父节点和祖先节点

`parent`

如果想要获取某个节点的父节点可以直接调用parent属性

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com">
<p>123</p>
<p>456</p></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
print(soup.a.parent)

输出结果：

<p class="title">
<a href="http://abc.com">
<p>123</p>
<p>456</p></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>

输出的内容为父节点的所有内容

`parents`

如果我们想获取当前节点的祖先节点而不仅仅是直接父节点，可以使用parents

使用parents方法得到的是一个生成器（generator），我们可以将其转换为list列表或者使用for循环将其输出

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for i in soup.a.parents:
    print(i, end = '\n\n')

输出结果：

<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>

<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>

逐层访问父节点、祖父节点等等，直到访问到根节点

选取兄弟节点

`next_sibling`

顾名思义，或许当前节点的下一个兄弟节点（下一个同级节点）

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.p.next_sibling)
print(soup.p.next_sibling.next_sibling)

输出结果：

1 2	`<p class="text"><b>abcdefg</b></p>`

注意，换行符\n也算同级节点，因此第一个soup.p.next_sibling输出的结果为空（其实为换行符），而next_sibling方法可以嵌套使用，soup.next_sibling.next_sibling输出的就是当前节点后面的第二个同级节点

`next_siblings`

获取当前节点后面的所有兄弟节点，得到的结果是生成器（generator）

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
for i in soup.p.next_siblings:
    print(i)

输出结果：


<p class="text"><b>abcdefg</b></p>


<a href="http://cde.com">123</a>

`previous_sibling`

获取当前节点前面的第一个兄弟节点

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.p.next_sibling.previous_sibling)

输出结果：

<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>

`previous_siblings`

获取当前节点前面的所有兄弟节点

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
n = soup.p.next_sibling.next_sibling.next_sibling.next_sibling
print(f'n为{n}')
for i in n.previous_siblings:
    print(i)

输出结果：

n为<a href="http://cde.com">123</a>


<p class="text"><b>abcdefg</b></p>


<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>

方法选择器

`find_all()`

使用soup.find_all()方法可以根据条件查找所有满足的元素，

1	`find_all(name, attrs, recursive, text, **kwargs)`

name：根据节点名称来选择参数
attrs：通过属性查询，传入的参数为字典类型
text：通过内容查询

我们可以根据需要查找的元素的特征单独或者组合使用这些参数

对于所有的bs4.BeautifulSoup类型数据和bs4.element.Tag类型数据，均可以使用find_all()方法，也就是说，find_all()可以嵌套使用（注：使用find_all()查找的列表中元素类型为bs4.element.Tag）

根据节点名称查询

通过soup.find_all(标签名)查询所有满足条件的节点，输出结果为列表类型

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('p'))
print(len(soup.find_all('p')))

输出结果：

[<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>, <p class="text"><b>abcdefg</b></p>]
2

根据属性查询

通过soup.find_all(attrs = 属性字典)查询所有满足条件的节点，输出类型为字典

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs = {'class': 'title'}))

输出结果：

[<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b>123</b>
</p>]

对于常用的属性，比如class、id等，我们可以直接传入这些参数，需要注意的是，class是Python的保留字，所以我们需要在其后加下划线：class_

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b id='abc'>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(class_='text'))
print(soup.find_all(id='abc'))

输出结果：

1 2	`[<p class="text"><b>abcdefg</b></p>] [<b id="abc">123</b>]`

根据内容查找

通过soup.find_all(text=内容)查找所有满足条件的元素，注意，返回的列表中元素只包含内容

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b id='abc'>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='abcdefg'))
print(soup.find_all(text='abcdefg')[0].parent)

输出结果：

1 2	`['abcdefg'] <b>abcdefg</b>`

`find()`

与find_all()不同的是，find_all()会查询所有满足条件的节点，返回的是一个列表，而find()查询第一个满足条件的元素，返回的是一个元素

对于所有的bs4.BeautifulSoup类型数据和bs4.element.Tag类型数据，均可以使用find()方法，也就是说，find()可以嵌套使用（注：使用find()查找的元素类型为bs4.element.Tag）

find()与find_all()的使用方法相同

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b id='abc'>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='p'), end='\n\n')
print(soup.find(name='p', class_='text'))

输出结果：

<p class="title">
<a href="http://abc.com"></a>
<b>The Dormouse's story</b>
<b id="abc">123</b>
</p>

<p class="text"><b>abcdefg</b></p>

`find_parent()`和`find_parents()`

对于bs4.element.Tag类型的变量，均可以使用find_parent()方法查找其直接父节点，使用方法为变量.find_parent()

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id='abc'>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
a1 = soup.find(name='a', attrs={'href': 'http://abc.com'})
print(a1, end='\n\n')
print(a1.find_parent())

输出结果：

<a href="http://abc.com">123456</a>

<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id="abc">123</b>
</p>

使用find_parents()则可以逐层查找元素的各层父节点

from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id='abc'>123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'lxml')
a1 = soup.find(name='a', attrs={'href': 'http://abc.com'})
print(a1.find_parents())

输出结果：

[<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id="abc">123</b>
</p>, <body>
<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id="abc">123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id="abc">123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
<a href="http://abc.com">123456</a>
<b>The Dormouse's story</b>
<b id="abc">123</b>
</p>
<p class="text"><b>abcdefg</b></p>
<a href="http://cde.com">123</a>
</body>
</html>
]

`find_next_sibling()`和`find_next_siblings()`

用法与前面的find_parent()、find_parents()相同，

find_next_sibling()返回后面的第一个兄弟节点

find_next_siblings()返回后面的所有兄弟节点，返回类型为列表

`find_previous_sibling()`和`find_previous_siblings()`

用法与前面的find_parent()、find_parents()相同，

find_previous_sibling()返回前面的第一个兄弟节点

find_previous_siblings()返回前面的所有兄弟节点，返回类型为列表

CSS选择器

使用CSS选择器的时候，需要调用select()方法，将属性值或者节点名传入选择器即可

基础选择器：

选择器	表示方法
元素选择器	`元素名称`，比如`p`
类选择器	`.类名`，比如`.classselect`，代表`class`属性为`classselect`
ID选择器	`#id名`，比如`#idselect`，代表`id`属性为`idselect`

组合选择器：

选择器	作用	表示方法
后代选择器	选择指定元素内部的所有符合条件的后代元素（子孙元素）	`元素名元素名`，比如`div p`，会选择所有`<div>`元素内的`<p>`元素
子元素选择器	选择指定元素的直接子元素	`元素名 > 元素名`，比如`div > p`，会选择所有直接位于`<div>`元素内的`<p>`元素
相邻兄弟选择器	选择与指定元素在同一层级且紧接在它后面的元素	`元素名 + 元素名`，比如`h2 + p`，会选择紧跟在`<h2>`元素后面的`<p>`元素
通用兄弟选择器	选择与指定元素在同一层级的所有后续兄弟元素	`元素名 ~ 元素名`，比如`h2 ~ p`，会选择所有与`<h2>`在同一层级的`<p>`元素

属性选择器：

选择器	作用	表示方法
等于属性选择器	选择具有指定属性且属性值完全等于指定值的元素	`元素[属性名=属性值]`，比如`a[target="_blank"]`，会选择所有`target`属性值为`_blank`的`<a>`标签
包含属性选择器	选择属性值包含指定完整子串的元素	`元素[属性名~=属性值]`，比如`a[title~="example"]`，会选择所有`title`属性值中包含单词`example`的`<a>`标签，注意，是单词`example`，而不是任意地方出现`example`
起始属性选择器	选择属性值以指定子串开头的元素	`元素[属性名^=属性值]`，比如`a[href^="https"]`，会选择所有`href`属性值以`https`开头的`<a>`标签
结尾属性选择器	选择属性值以指定子串结尾的元素	`元素[属性名$=属性值]`，比如`a[href$=".pdf"]`，会选择所有`href`属性值以`.pdf`结尾的`<a>`标签
子串属性选择器	选择属性值包含指定子串的元素	`元素[属性名=属性值]`，比如`a[href="example"]`，会选择所有`href`属性值中包含子串`example`的`<a>`标签

from bs4 import BeautifulSoup
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello World</h4>   
    </div>
    
    <div class="panel-body">
        <ul class="list" id="list-1">
           <li class="element">Foo</li>
           <li class="element">Bar</li>
           <li class="element">Jay</li>
        </ul>
        
        <ul class="list list-samll" id="list-2">
           <li class="element">Foo</li>
           <li class="element">Bar</li>
           <li class="element">Jay</li>
        </ul>
    </div>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'), end='\n\n') # 选择class为panel元素下class为panel-heading的元素
print(soup.select('ul li'), end='\n\n') # 获取ul下的li节点
print(soup.select('#list-2 li'), end='\n\n') # 获取id为list-2元素下的li节点

输出结果：

[<div class="panel-heading">
<h4>Hello World</h4>
</div>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]

select()方法获得的列表中元素类型为bs4.element.Tag，意味着其也支持嵌套选择

正则表达式

正则表达式规则

匹配普通字符：字母、数字、汉字、下划线，以及没有特殊定义的标点符号，都是普通字符，匹配普通字符，直接用与之相同的字符即可

匹配特殊字符：使用\进行转义

表达式	可匹配	表达式	可匹配
`\n`	匹配换行符	`\?`	匹配`?`本身
`\t`	匹配制表符	`\*`	匹配`*`本身
`\\`	匹配`\`本身	`\+`	匹配`+`本身
`\^`	匹配`^`本身	`\{`、`\}`	匹配大括号
`\$`	匹配`$`本身	`\[`、`\]`	匹配中括号
`\.`	匹配`.`本身	`$`、`$`	匹配小括号

匹配多种字符：

表达式	可匹配
`\d`	匹配任意一个数字（0~9）
`\w`	匹配任意一个数字、字母或下划线（0~9、A~Z、a~z、_）
`\s`	匹配空格、制表符、换页符等任意一个空白字符
`.`	匹配除了换行符`\n`以外的任意一个字符
`\D`	与`\d`相反
`\W`	与`\w`相反
`\b`	匹配一个单词的边界，也就是指单词和空格之间的位置。例如，`er\b`可以匹配`never`中的`er`，但不能匹配`verb`中的`er`
`\B`	与`\b`相反，匹配非单词边界。例如，`er\B`能匹配`verb`中的`er`，但不能匹配`never`中的`er`

自定义匹配字符：使用方括号[ ]包含一系列字符，匹配其中任意一个字符；使用[^ ]包含一系列字符，匹配除了其中字符之外的任意一个字符；使用-可以省略中间字符

表达式	可匹配
`[ab@6]`	匹配`a`、`b`、`@`、`6`中任意一个字符
`[^abc]`	匹配除了`a`、`b`、`c`外的任意一个字符
`[c-f]`	匹配`c`到`f`之间的任意一个字母
`[^A-F0-3]`	匹配`A`到`F`、`0`到`3`以外的任意一个字符

匹配次数：

表达式	可匹配
`{n}`	表达式重复$n$次，比如`\w{2}`相当于`\w\w`，`a{5}`相当于`aaaaa`
`{m,n}`	表达式至少重复$m$次，最多重复$n$次，比如`ba{1,3}`可以匹配`ba`或`baa`或`baaa`
`{m,}`	表达式至少重复$m$次，比如`\w\d{2,}`可以匹配`a12`、`_456`、`M12345`等等
`?`	匹配表达式0次或者1次，相当于`{0,1}`，比如`a[cd]?`可以匹配`a`、`ac`、`ad`
`+`	表达式至少出现1次，相当于`{1,}`，比如`a+b`可以匹配`ab`、`aab`、`aaab`、...
`*`	表达式不出现或出现任意次数，相当于`{0,}`，比如`\^*b`可以匹配`b`、`^b`、`^^b`、...

特殊符号（匹配位置）：

表达式	可匹配
`^`	与字符串开始的地方匹配，本身不匹配任何字符
`$`	与字符串结束的地方匹配，本身不匹配任何字符
`\|`	表示逻辑“或”的操作，比如`cat\|dog`匹配`cat`或者`dog`；`\|`也可以与`()`搭配使用，比如：`(cat\|dog)(food\|toy)`匹配`catfood`、`cattoy`、`dogfood`、`dogtoy`
`()`	匹配括号中的内容，通常与`.group()`或`.groups()`一起使用

匹配次数中的贪婪与非贪婪：在使用修饰匹配次数的符号?、+、*等，可以使同一个表达式匹配不同的次数，具体匹配的次数随被匹配的字符串而定，这种重复匹配不定次数的表达式在匹配过程中，总是尽可能多的匹配，这种匹配原则就叫做“贪婪”模式，比如，针对文本dxxdxxd
可以使用?，实现非贪婪：
- *?：匹配0或多次，但尽可能少的匹配
- +?：匹配1或多次，但尽可能少的匹配
- ??：匹配0次或1次，但尽可能少的匹配
- {n,m}?：匹配至少n次，最多m次，但尽可能少的匹配

例如：

import re
string = r'<div>Hello, <span>world</span></div><div>456</div>'
pattern1 = r'<div>.*</div>'
pattern2 = r'<div>.*?</div>'

# 贪婪
print('贪婪：')
match1 = re.search(pattern1, string)
if match1:
    print(match1.group())

# 非贪婪
print('\n非贪婪：')
match2 = re.search(pattern2, string)
if match2:
    print(match2.group())

输出结果：

贪婪：
<div>Hello, <span>world</span></div><div>456</div>

非贪婪：
<div>Hello, <span>world</span></div>

正则表达式的使用

如果需要使用正则表达式，在Python中需要引入re模块：

1	`import re`

`re.match()`函数

re.match()尝试从字符串的起始位置匹配一个模式，如果不是在起始位置匹配成功的话，就会返回None

1	`re.match(pattern, string, flags=0)`

pattern：匹配的正则表达式
string：要匹配的字符串
flags：标志位，用于控制正则表达式的匹配方式，如：是否区分大小写、多行匹配等

1
2
3

import re
print(re.match('www', 'www.example.com'))
print(re.match('com', 'www.example.com'))

输出结果：

1 2	`<re.Match object; span=(0, 3), match='www'> None`

如果匹配成功，re.match()返回的结果为re.Match类型的数据，对这种类型的数据，我们可以使用group(num)或groups()方法来获取匹配表达式，它可以将括号中的内容单独分组

匹配对象方法	描述
`group(num=0)`	匹配的整个表达式的字符串，`group()`可以一次输入多个组号，在这种情况下它；`group(0)`得到的是整个匹配的子串；`group(num)`得到的是第`num`个括号中的内容
`groups()`	返回一个包含所有小组字符串的元组，从1到所含的小组号

比如：

import re
html = """
<html>
<body>
<p>Hello!</p>
<p>Hi!</p>
</body>
</html>
"""
a = re.match('\s.+\s.+\s<p>(.*)</p>\s<p>(.*)</p>', html)
print(f'a.groups() = {a.groups()}')
print(f'a.group(0) = {a.group()}')
print(f'a.group(1) = {a.group(1)}')
print(f'a.group(2) = {a.group(2)}')

输出结果：

a.groups() = ('Hello!', 'Hi!')
a.group(0) = 
<html>
<body>
<p>Hello!</p>
<p>Hi!</p>
a.group(1) = Hello!
a.group(2) = Hi!

除了group()和groups()方法外，还有：

start()方法，用于获取分组匹配的子串在整个字符串中的起始位置（子串第一个字符的索引），参数默认值为0
end()方法，用于获取分组匹配的子串在整个字符串中的结束位置（子串最后一个字符的索引+1），参数默认值为0
span()方法，返回(start(group), end(group))

import re
html = """
<html>
<body>
<p>Hello!</p>
<p>Hi!</p>
</body>
</html>
"""
a = re.match('\s.+\s.+\s<p>(.*)</p>\s<p>(.*)</p>', html)
print(f'len(html) = {len(html)}')
print(f'a.span() = {a.span()}')
print(f'a.span(1) = {a.span(1)}')
m, n = a.span(1)
for i in range(m, n):
    print(html[i], end='')
print()
print(f'a.span(2) = {a.span(2)}')
print(f'a.start(2) = {a.start(2)}')
print(f'a.end(2) = {a.end(2)}')

输出结果：

len(html) = 56
a.span() = (0, 39)
a.span(1) = (18, 24)
Hello!
a.span(2) = (32, 35)
a.start(2) = 32
a.end(2) = 35

`re.search()`函数

与re.match()函数不同的是，re.match()只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search()匹配整个字符串，直到找到一个匹配

re.search()的用法与re.match()相同

import re
html = """
<html>
<body>
<p>Hello!</p>
<p>Hi!</p>
</body>
</html>
"""
a = re.search('<p>(.*)</p>', html)
print(a.groups())

输出结果：

1	`('Hello!',)`

`re.sub()`函数

re.sub()函数用于字符串的替换

1	`re.sub(pattern, repl, string, count=0, flags=0)`

参数：

pattern：正则中的模式字符串
repl：替换的字符串，也可为一个函数
string：要被查找替换的原始字符串
count：模式匹配后替换的最大次数，默认0表示替换所有的匹配

re.sub()无需从头开始匹配

import re
html = """
<html>
<body>
<p>Hello!</p>
<p>Hi!</p>
</body>
</html>
"""
a = re.sub('<p>.*</p>', '', html)
print(f'a:\n{a}')

b = re.sub('<p>.*</p>', '', html, 1)
print(f'b:\n{b}')

输出结果：

a:

<html>
<body>


</body>
</html>

b:

<html>
<body>

<p>Hi!</p>
</body>
</html>

`re.compile()`函数

re.compile()函数用于编译正则表达式，生成一个正则表达式（Pattern）对象，供match()、search()等使用

1	`re.compile(pattern[, flags])`

pattern：一个字符串形式的正则表达式
flags：可选，表示匹配模式，比如忽略大小写、多行模式等，如果需要有多个匹配模式，可以使用按位或运算符|来指定多个标志，具体参数为：
- re.I：值为2，忽略大小写
- re.L：值为4，表示特殊字符集\w,\W,\b,\B,\s,\S依赖于当前环境
- re.M：值为8，多行模式，影响^和$
- re.S：值为16，使 . 匹配包括换行在内的所有字符
- re.U：值为32，表示特殊字符集\w,\W,\b,\B,\d,\D,\s,\S依赖于Unicode字符属性数据库
- re.X：值为64，为了增加可读性，忽略空格和#后面的注释

re.compile()生成的正则表达式对象，可以使用match(),search(),findall()等方法，并且可以规定开始查找和结束查找的位置：

pattern.match(string[, pos[, endpos]])
pattern.search(string[, pos[, endpos]])
pattern.findall(string[, pos[, endpos]])

import re
pattern = re.compile('\d+') # 匹配一个或多个数字
m = pattern.match('one12twothree34four')
print(m)
n = pattern.match('one12twothree34four', 3, 10) # 从3的位置开始10的位置结束去查找，注意，最开始的字符的位置为0
print(n)

输出结果：

1 2	`None <re.Match object; span=(3, 5), match='12'>`

设置flags：

import re
pattern = re.compile('([a-z]+) ([a-z]+)', re.I) # 忽略大小写
m = pattern.match('Hello World Wide Web')
print(m.group())

输出结果：

1	`Hello World`

`re.findall()`函数

在字符串中找到正则表达式匹配的所有子串，并返回一个列表，如果有多个匹配模式，则返回元组列表，如果没有找到匹配的，则返回空列表

1	`re.findall(pattern, string, flags=0)`

pattern：必选参数，表示正则表达式的模式字符串
string：必选参数，表示要搜索的字符串
flags：可选参数，表示匹配模式

比如：

1
2
3

import re
result = re.findall('\d+', 'runoob 123 google 456')
print(result)

输出结果：

1	`['123', '456']`

使用括号捕获特定组：使用括号从每个匹配中提取特定的部分

1
2
3

import re
result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)

输出结果：

1	`[('width', '20'), ('height', '10')]`

re.findall()也可以使用re.compile()：

1	`findall(string[, pos[, endpos]])`

string：待匹配的字符串
pos：可选参数，指定字符串的起始位置，默认为0
endpos：可选参数，指定字符串的结束位置，默认为字符串的长度

import re
pattern = re.compile(r'\d+')

result1 = pattern.findall('runoob 123 google 456')
result2 = pattern.findall('run88oob123google456', 0, 10) # 不包括10, 即范围是字符串的0~9

print(result1)
print(result2)

输出结果：

1 2	`['123', '456'] ['88', '12']`

`re.finditer()`函数

和re.findall()类似，在字符串中找到正则表达式匹配的所有子串，并把它们作为一个迭代器返回

1	`re.finditer(pattern, string, flags=0)`

pattern：必选参数，表示正则表达式的模式字符串
string：必选参数，表示要搜索的字符串
flags：可选参数，表示匹配模式

比如：

import re

it = re.finditer(r'\d+', '12a32bc43jf3')

for match in it:
    print(match.group())

输出结果：

`re.split()`函数

re.split()方法按照能够匹配的子串将字符串分割后返回列表

1	`re.split(pattern, string[, maxsplit=0, flags=0])`

pattern：匹配的正则表达式
string：要匹配的字符串
maxsplit：分隔次数，maxsplit=1表示分隔一次，默认为0，不限制次数
flags：标志位，用于控制正则表达式的匹配方式

比如：

import re

result1 = re.split(r'\W+', 'ab12, cd34. ef56!')
print(result1)

输出结果：

1	`['ab12', 'cd34', 'ef56', '']`

使用括号分组：

1
2
3

import re
result2 = re.split(r'(\W+)', 'ab12, cd34. ef56!')
print(result2)

输出结果：

1	`['ab12', ', ', 'cd34', '. ', 'ef56', '!', '']`

设置maxsplit：

import re

result3 = re.split(r'\W+', 'ab12, cd34. ef56!', 1) # 仅分割一次
print(result3)

输出结果：

1	`['ab12', 'cd34. ef56!']`

编程 > Python > 网络爬虫

#编程 #Python #爬虫 #自动化

Python爬虫

https://blog.shinebook.net/2025/03/17/编程/Python/爬虫/python爬虫/

作者

发布于

2025年3月17日

许可协议

SMTP发送邮件上一篇

Selenium的一些经验下一篇

表达式	可匹配
`{n}`	表达式重复\(n\)次，比如`\w{2}`相当于`\w\w`，`a{5}`相当于`aaaaa`
`{m,n}`	表达式至少重复\(m\)次，最多重复\(n\)次，比如`ba{1,3}`可以匹配`ba`或`baa`或`baaa`
`{m,}`	表达式至少重复\(m\)次，比如`\w\d{2,}`可以匹配`a12`、`_456`、`M12345`等等
`?`	匹配表达式0次或者1次，相当于`{0,1}`，比如`a[cd]?`可以匹配`a`、`ac`、`ad`
`+`	表达式至少出现1次，相当于`{1,}`，比如`a+b`可以匹配`ab`、`aab`、`aaab`、...
`*`	表达式不出现或出现任意次数，相当于`{0,}`，比如`\^*b`可以匹配`b`、`^b`、`^^b`、...

Python爬虫

获取数据

requests库

GET请求

POST请求

设置请求头

响应

获取响应状态码

获取响应头

获取cookies

获取URL

获取文本（str类型）

获取二进制数据

设置cookies

设置Session

SSL证书验证

代理设置

超时设置

身份验证

Prepared Request

Selenium

打开网页

查找页面元素

模拟用户操作

点击

输入内容

浏览器后退

浏览器前进

刷新

执行JavaScript脚本

切换到指定窗口

双击

组合键输入

右键点击

悬停

精准坐标移动

鼠标滚轮

定向输入

高级操作

添加延迟

显示等待元素就绪

获取内容

获取元素的文本

获取当前页面的网址

获取元素的属性值

筛选、处理数据

Beautiful Soup

BeautifulSoup的安装

BeautifulSoup的基本使用

节点选择器

选择元素

直接选择

元素选择的嵌套

提取节点信息

关联选择

选取子节点和子孙节点

contents

children

descendants

选取父节点和祖先节点

parent

parents

选取兄弟节点

next_sibling

next_siblings

previous_sibling

previous_siblings

方法选择器

find_all()

根据节点名称查询

根据属性查询

根据内容查找

find()

find_parent()和find_parents()

find_next_sibling()和find_next_siblings()

find_previous_sibling()和find_previous_siblings()

CSS选择器

正则表达式

正则表达式规则

正则表达式的使用

`requests`库

`contents`

`children`

`descendants`

`parent`

`parents`

`next_sibling`

`next_siblings`

`previous_sibling`

`previous_siblings`

`find_all()`

`find()`

`find_parent()`和`find_parents()`

`find_next_sibling()`和`find_next_siblings()`

`find_previous_sibling()`和`find_previous_siblings()`

`re.match()`函数

`re.search()`函数

`re.sub()`函数

`re.compile()`函数

`re.findall()`函数

`re.finditer()`函数

`re.split()`函数