2017-02-03

在python中使用pycurl替代requests来提高网页抓取效率

在写爬虫的时候，有时感觉requests库的抓取效率好低。在大规模爬取网页的时候，使用PycURL可以有效提高效率。

PycURL使用简介

在这里使用了官方的例程，适用于Python3\

import pycurl 
from io import BytesIO
buffer = BytesIO() 
c = pycurl.Curl() 
c.setopt(c.URL, 'http://pycurl.io/') 
c.setopt(c.WRITEDATA, buffer) 
c.perform() c.close()
body = buffer.getvalue() 
# Body is a byte string. 
# We have to know the encoding in order to print it to a text file 
# such as standard output. 
print(body.decode('iso-8859-1'))

来自http://pycurl.io/docs/latest/quickstart.html#retrieving-a-network-resource
你还可以设置许多选项，比如代理，忽略SSL证书检查等。具体选项都在这个网页里
https://curl.haxx.se/libcurl/c/curl_easy_setopt.html
在设置选项时要去掉CURLOPT_前缀，比如CURLOPT_URL写为URL

接下来对PycURL和requests抓取效率进行测试，测试内容为抓取百度主页1000次，代码如下

import requests
import time
import pycurl
from io import BytesIO
url = 'http://www.baidu.com/'
c = pycurl.Curl()
data = BytesIO()
c.setopt(c.DNS_USE_GLOBAL_CACHE, True)
c.setopt(c.URL, url)
c.setopt(c.WRITEFUNCTION, data.write)
def pycurl_crawl():
    c.perform()
def requests_crawl():
    r = requests.get(url)
t1 = time.time()
for i in range(1,1000):
    requests_crawl()
print("Using requests crawl Baidu for 1000 times in",time.time() - t1)
t2=time.time()
for i in range(1,1000):
    pycurl_crawl()
print("Using PycURL crawl Baidu for 1000 times in",time.time() - t2)
c.close()

测试结果如下

看起来效率提升了不少。在使用PycURL抓取时，基本占满了我50mbps的带宽，如果你使用100mbps的宽带，也许还能再提升一些。
在stackoverflow上有一篇更加详细的文章，链接如下
https://stackoverflow.com/questions/15461995/python-requests-vs-pycurl-performance