python selenium+chromedriver安装及爬虫算例

要看的一些资料

Requests+BS4爬sci-hub

示例

xilock写了个示例，参见这里

Erfec写了一个中国知网爬虫（CNKI）批量下载PDF格式的论文，实现了PDF下载功能，但Xilock因为校外VPN访问cnki不稳定，就没测试，可以看一下。

参考资料

Selenium+Chromedriver爬知网

安装

安装selenium

pip install selenium

但pip有时候出幺蛾子，不行的话就用conda装：

conda install selenium

安装chromedriver

chromedriver的版本一定要与Chrome的版本一致，不然就不起作用（如“92.0.4515.107_chrome32_stable_windows_installer.exe”匹配“chromedriver_win32_92.0.4515.107”,下载地址及提取码: ktwj）。

有两个下载地址：1、 http://chromedriver.storage.googleapis.com/index.html；2、 https://npm.taobao.org/mirrors/chromedriver/
下载完后解压，找到chromedriver.exe复制到chrome的安装目录（C:\Program Files (x86)\Google\Chrome\Application\），将文件位置（C:\Program Files (x86)\Google\Chrome\Application\）添加到环境变量PATH里
在cmd里输入chromedriver检查是否安装成功，或者输入下面代码看是否自动弹出一个浏览器：

from selenium import webdriver
import time

def main():
    b = webdriver.Chrome()
    b.get('https://www.baidu.com')
    time.sleep(5)
    b.quit()

if __name__ == '__main__':
    main()

但是会有一个没有什么影响的警告

[27280:10064:0513/222542.826:ERROR:device_event_log_impl.cc(214)] [22:25:42.825] Bluetooth: bluetooth_adapter_winrt.cc:1204 Getting Radio failed. Chrome will be unable to change the power state by itself.
[27280:10064:0513/222542.847:ERROR:device_event_log_impl.cc(214)] [22:25:42.847] Bluetooth: bluetooth_adapter_winrt.cc:1282 OnPoweredRadioAdded(), Number of Powered Radios: 1
[27280:10064:0513/222542.848:ERROR:device_event_log_impl.cc(214)] [22:25:42.848] Bluetooth: bluetooth_adapter_winrt.cc:1297 OnPoweredRadiosEnumerated(), Number of Powered Radios: 1

这主要是因为These errors are the direct impact of the changes incorporated with google-chrome，解决方法参见这里

原理点在这里Selenium Chrome Driver: Resolve Error Messages Regarding Registry Keys and Experimental Options

即：添加一个excludeSwitches: ['enable-logging']

因为不影响使用，所以可以直接忽略它，也可以去更新一下啥的（Xilock没试过），处于强迫症，Xilock决定Suppressing the error，所以就有了这段代码：

from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options

def main():
    options = webdriver.ChromeOptions()
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    b = webdriver.Chrome(options=options, executable_path=r'C:\Users\Xilock\AppData\Local\Google\Chrome\Application\chromedriver.exe') 
    
    b.get('https://www.baidu.com')
    time.sleep(5)
    b.quit()

if __name__ == '__main__':
    main()

爬虫知网算例

写了几个爬知网文献的case，但是XPATH变得比较频繁，用的话可能需要经常自己手动更新。顺便录了一个使用selenium爬知网实例。

注：xpath可用edge的插件“SelectorsHub - XPath Plugin”或“Xpath finder”获取。

代码爱小菜鸡写了个《用Python自动化爬取CNKI知网数据（批量下载PDF论文）》，但Xilock这边校外ip不方便

要练习的爬虫算例

requests+正则表达式爬取静态网页（最好是加入搜索关键词的），并加入多进程，数据库存储，文件下载（图片和文本）
requests+lxml+xpath爬取静态网页，其他同第（1）点
requests+bs4+css/xpath爬取静态网页，其他同第（1）点
requests+pyquery+css爬取静态网页，其他同第（1）点
selenium+Phantomjs爬取静态网页，其他同第（1）点
pyspider+ selenium+Phantomjs爬取静态网页，其他同第（1）点（静态网页用pyspider爬感觉大材小用）
scrapy爬取动态网页，其他同第（1）点
找一个封IP和cookies的网站（比如微博），用scrapy爬取，把几个pipeline都用起来，然后加入分布式爬取（找3个云服务器就ok了，一个发布任务，两个爬取），其他同第（1）点

参考资料

爬文件

写了一个基于python的小爬虫，可以处理双层网站（即目录–点击跳转–点击下载），放github里了.

参考资料：

手机版“神探玺洛克”请扫码

要看的一些资料

Requests+BS4爬sci-hub

示例

参考资料

Selenium+Chromedriver爬知网

安装

爬虫知网算例

要练习的爬虫算例

参考资料

爬文件

CATALOG

FEATURED TAGS

FRIENDS