在python中使用pycurl替代requests来提高网页抓取效率

在写爬虫的时候,有时感觉requests库的抓取效率好低。在大规模爬取网页的时候,使用PycURL可以有效提高效率。

PycURL使用简介

在这里使用了官方的例程,适用于Python3\

1
2
3
4
5
6
7
8
9
10
11
12
import pycurl 
from io import BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, 'http://pycurl.io/')
c.setopt(c.WRITEDATA, buffer)
c.perform() c.close()
body = buffer.getvalue()
# Body is a byte string.
# We have to know the encoding in order to print it to a text file
# such as standard output.
print(body.decode('iso-8859-1'))

来自http://pycurl.io/docs/latest/quickstart.html#retrieving-a-network-resource
你还可以设置许多选项,比如代理,忽略SSL证书检查等。具体选项都在这个网页里
https://curl.haxx.se/libcurl/c/curl_easy_setopt.html
在设置选项时要去掉CURLOPT_前缀,比如CURLOPT_URL写为URL

接下来对PycURL和requests抓取效率进行测试,测试内容为抓取百度主页1000次,代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import requests
import time
import pycurl
from io import BytesIO
url = 'http://www.baidu.com/'
c = pycurl.Curl()
data = BytesIO()
c.setopt(c.DNS_USE_GLOBAL_CACHE, True)
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, data.write)
def pycurl_crawl():
c.perform()
def requests_crawl():
r = requests.get(url)
t1 = time.time()
for i in range(1,1000):
requests_crawl()
print("Using requests crawl Baidu for 1000 times in",time.time() - t1)
t2=time.time()
for i in range(1,1000):
pycurl_crawl()
print("Using PycURL crawl Baidu for 1000 times in",time.time() - t2)
c.close()

测试结果如下
1.png
看起来效率提升了不少。在使用PycURL抓取时,基本占满了我50mbps的带宽,如果你使用100mbps的宽带,也许还能再提升一些。
在stackoverflow上有一篇更加详细的文章,链接如下
https://stackoverflow.com/questions/15461995/python-requests-vs-pycurl-performance