python写图片爬取软件_ 络爬虫之站图片爬取-python实现

版本1.5

本次简单添加了四路多线程(由于我电脑CPU是四核的)，速度飙升。本想试试xPath，但发现反倒是多此一举，故暂不使用

#-*- coding:utf-8 -*-

import re,urllib,os,urllib2,chardet,requests,time

from multiprocessing.dummy import Pool

def urllink(link): # 页HTML获取以及编码转换

html_1 = urllib2.urlopen(link,timeout=120).read()

encoding_dict = chardet.detect(html_1)

web_encoding = encoding_dict[‘encoding’]

if web_encoding == ‘utf-8’ or web_encoding == ‘UTF-8’:

html = html_1

else :

html = html_1.decode(‘gbk’,’ignore’).encode(‘utf-8’)

return html

def downloadpic(j):

href = ‘http://www.dazui88.com’ + re.search(‘href=”(.*html’, j).group(1) # 每一套总址

label = (re.findall(‘alt=”(.*’, j, re.S))[0].strip() # 每一套名称

path = unicode(r’D:pachongpans%s’ % (label), ‘utf-8’) # 每一套文件夹

print ‘开始下载，%s’ % (label)

if not os.path.exists(path):

os.mkdir(path)

p = 0

for k in range(1, 100): # 爬取其中每一张图片

hrefnew = href

if k is not 1:

hrefnew = href + ‘_%d’ % k

hrefnew += ‘.html’

try: # 如果此页不存在，表示已经爬完了，开始爬下一组

html2 = urllink(hrefnew)

except:

print u’该套下载完毕n’

break;

try: # 如果该页中图片丢失，则开始爬下一张

picurl = re.findall(‘img alt.*c=”(.*’, html2)

except:

print u’该处丢失一张’

continue

for n in picurl: # 由于可能存在多张，故需一一下载

p += 1

if not re.findall(‘http’, n, re.S): # 部分图片缺少前缀部分，故需判断后添加

n = ‘http://www.dazui88.com’ + n

print u’正在下载图片，图片地址：’ + n

retu = requests.get(n, stream=True)

picpath = unicode(r’D:pachongpans%s%s’ % (label, str(p)) + ‘.jpg’, ‘utf-8’)

file = open(picpath, ‘wb’)

for chunk in retu.iter_content(chunk_size=1024 * 8):

if chunk:

file.write(chunk)

file.flush()

file.close()

def spider():

for i in range(2,46): #爬取总共的主页面数

link1=”http://www.dazui88.com/tag/pans/list_86_%d.html”%i

html=urllink(link1)

plist=re.findall(“

.*p>”,html,re.S)

pool = Pool(4)

pool.map(downloadpic,plist)

pool.close()

pool.join()

if __name__==”__main__”:

spider()

***************************** 分割线 *********************************

版本1.2

此次添加若干功能，如下：

1.首先，终于解决了中文文件夹乱码的问题，现支持中文文件夹自动创建，我想吐槽一句，python的中文编码处理真的是稀烂，各种迷

2.美化脚本的输出，使其更加直观美观

3.解决了一个页面多张图片的下载问题

4.修复部分图片地址缺少前缀导致无法下载的问题

***************************** 分割线 *********************************

版本1.0

今天上午开始学了学爬虫技术。下午开始着手写了第一个站爬虫脚本。

这次我要爬的是随手找的一个主要是妹子图片的站，目标是把其中某个分类下的所有妹子套图都爬下来

(举例附址：http://www.dazui88.com/tag/toutiao/list_130_1.html)

老司机们应该知道，一套图一般都有几十张，但这类站一般一页只会放出一张或者几张，这样一来我们浏览图片时的观感就会大大下降，

因此一次把图片全都爬下来会方便的多。

实现时的技术要求与难点：

总的老说目前做的还比较粗糙，但完全够用，主要是拿来练练手。

1.本想以每套图片的名字作为文件夹，但是由于中文有乱码，暂时未实现；

2.下载图片使用的还是requests库，貌似 beautifulsoup库会更好一点，但暂未尝试；

3.由于并没有编写多线程，以及缓冲池之类的，速度会比较慢，但还可以接受；

4.脚本存在一定问题，如果站其中一张图片丢失，该套图片的剩下部分会被跳过，可解决，但暂未解决；

5.脚本还可以做成软件的形式，但较耗时，有空再说；

6.由于此次爬取的站其他版块的url结构基本一致，所以花上几秒钟改一下，就能很快爬取其他的图片，甚至整个站所有的图片，但是速度有待改进。

代码实现：

#-*- coding:utf-8 -*-

import re,urllib,os,urllib2,chardet,requests

def urllink(link): # 页HTML获取以及编码转换

html_1 = urllib2.urlopen(link,timeout=120).read()

encoding_dict = chardet.detect(html_1)

web_encoding = encoding_dict[‘encoding’]

if web_encoding == ‘utf-8’ or web_encoding == ‘UTF-8’:

html = html_1

else :

html = html_1.decode(‘gbk’,’ignore’).encode(‘utf-8’)

return html

def spider():

m=0

for i in range(1,12): #爬取总共的主页面数

link1=”http://www.dazui88.com/tag/tgod/list_80_%d.html”%i

html=urllink(link1)

plist=re.findall(“

.*p>”,html,re.S)

for j in plist: #开始爬其中每一套

m+=1

href = ‘http://www.dazui88.com’ + re.search(‘href=”(.*html’, j).group(1) #每一套总址

label = (re.findall(‘alt=”(.*’,j,re.S))[0].strip() #每一套名称

path = unicode(r’D:pachongtgod%s %s’ %( str(m),label), ‘utf-8’) #每一套文件夹

print ‘开始下载第%d套，%s’%(m,label)

if not os.path.exists(path):

os.mkdir(path)

p=0

for k in range(1,100): #爬取其中每一张图片

hrefnew=href

if k is not 1:

hrefnew=href+’_%d’%k

hrefnew+=’.html’

try: #如果此页不存在，表示已经爬完了，开始爬下一组

html2=urllink(hrefnew)

except:

print u’该套下载完毕n’

break;

try: #如果该页中图片丢失，则开始爬下一张

picurl=re.findall(‘img alt.*c=”(.*’,html2)

except:

print u’该处丢失一张’

continue

for n in picurl: #由于可能存在多张，故需一一下载

p+=1

print u’正在下载图片，图片地址：’+n

retu=requests.get(n,stream=True)

picpath=unicode(r’D:pachongtgod%s %s%s’%(str(m),label,str(p))+’.jpg’,’utf-8′)

file=open(picpath,’wb’)

for chunk in retu.iter_content(chunk_size=1024*8):

if chunk:

file.write(chunk)

file.flush()

file.close()

if __name__==”__main__”:

spider()

文章知识点与官方知识档案匹配，可进一步学习相关知识Python入门技能树络爬虫urllib211389 人正在系统学习中相关资源：开源的爬虫软件Heritrix3.1.0_开源爬虫-Java工具类资源-CSDN文库

声明：本站部分文章及图片源自用户投稿，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！

python写图片爬取软件_ 络爬虫之 站图片爬取-python实现

相关推荐

python写图片爬取软件_ 络爬虫之站图片爬取-python实现