Python实现网络爬虫蜘蛛Word下载.docx

资源描述

Python实现网络爬虫蜘蛛Word下载.docx

《Python实现网络爬虫蜘蛛Word下载.docx》由会员分享，可在线阅读，更多相关《Python实现网络爬虫蜘蛛Word下载.docx（16页珍藏版）》请在冰豆网上搜索。

Python实现网络爬虫蜘蛛Word下载.docx

10.

self.inbody=True

11.

SGMLParser.reset（self）

12.

defhandle_data（self,text）:

13.

ifself.inbody:

14.

self.text+=text

15.

16.

defstart_head（self,text）:

17.

self.inbody=False

18.

defend_head（self）:

19.

20.

21.

22.if__name__=="

__main__"

23.

parser=Html2txt（）

24.

parser.feed（urllib.urlopen（"

）.read（））

25.

parser.close（）

26.

printparser.text.strip（）

python下载网页

importhttplib

conn=httplib.HTTPConnection（"

）

conn.request（"

GET"

/index.html"

r1=conn.getresponse（）

printr1.status,r1.reason

data=r1.read（）

printdata

conn.close

用python下载网页，超级简单！

fromurllibimporturlopen

webdata=urlopen（"

）.read（）

printwebdata

深入python里面有

python

下载网页内容,用python的pycurl模块实现

1.用python下载网页内容还是很不错的，之前是使用urllib模块实验的，但听说有pycurl这个模块，而且比urllib好，所以尝试下，废话不说，以下是代码

4.#!

/usr/bin/envpython

5.#-*-coding:

utf-8-*-

6.importStringIO

7.importpycurl

9.defwrite）:

f=open（x,'

f.write（fstr）

f.close

10.

1.html=StringIO.StringIO（）

2.c=pycurl.Curl（）

3.myurl='

5.c.setopt（pycurl.URL,myurl）

7.#写的回调

8.c.setopt（pycurl.WRITEFUNCTION,html.write）

10.c.setopt（pycurl.FOLLOWLOCATION,1）

12.#最大重定向次数,可以预防重定向陷阱

13.c.setopt（pycurl.MAXREDIRS,5）

15.#连接超时设置

16.c.setopt（pycurl.CONNECTTIMEOUT,60）

17.c.setopt（pycurl.TIMEOUT,300）

19.#模拟浏览器

20.c.setopt（pycurl.USERAGENT,"

Mozilla/4.0（compatible;

MSIE6.0;

WindowsNT5.1;

SV1;

.NETCLR1.1.4322）"

22.

24.#访问,阻塞到访问结束

25.c.perform（）

27.#打印出200（HTTP状态码，可以不需要）

28.printc.getinfo（pycurl.HTTP_CODE）

29.

30.#输出网页的内容

31.printhtml.getvalue（）

32.#保存成down.txt文件

33.write（）,"

down.txt"

的pycurl模块的安装可以到这里去找.

不同系统使用不同版本，自己看看

总结下，Python下载网页的几种方法

fd=urllib2.urlopen（url_link）

data=fd.read（）

这是最简洁的一种，当然也是Get的方法

通过GET的方法

defGetHtmlSource（url）:

try:

htmSource='

req=urllib2.Request（url）

fd=urllib2.urlopen（req,"

while1:

data=fd.read（1024）

ifnotlen（data）:

break

htmSource+=data

fd.close（）

delfd

delreq

htmSource=htmSource.decode（'

cp936'

htmSource=formatStr（htmSource）

returnhtmSource

exceptsocket.error,err:

str_err="

%s"

%err

return"

通过GET的方法

defGetHtmlSource_Get（htmurl）:

htmSource="

urlx=httplib.urlsplit（htmurl）

conn=httplib.HTTPConnection（loc）

conn.connect（）

conn.putrequest（"

htmurl,None）

conn.putheader（"

Content-Length"

0）

Connection"

close"

conn.endheaders（）

res=conn.getresponse（）

htmSource=res.read（）

exceptException（）,err:

trackback.print_exec（）

conn.close（）

通过POST的方法

defGetHtmlSource_Post（getString）:

url=httplib.urlsplit（"

POST"

/sipo/zljs/hyjs-jieguo.jsp"

len（getString））

Content-Type"

application/x-"

Keep-Alive"

conn.send（getString）

f=conn.getresponse（）

ifnotf:

raisesocket.error,"

timedout"

htmSource=f.read（）

f.close（）

本文来自CSDN博客，转载请标明出处：

Django+python+BeautifulSoup组合的垂直搜索爬虫

使用python+BeautifulSoup完成爬虫抓取特定数据的工作，并使用Django搭建一个管理平台，用来协调抓取工作。

因为自己很喜欢Djangoadmin后台，所以这次用这个后台对抓取到的链接进行管理，使我的爬虫可以应对各种后期的需求。

比如分时段抓取，定期的对已经抓取的地址重新抓取。

数据库是用python自带的sqlite3，所以很方便。

这几天正好在做一个电影推荐系统，需要些电影数据。

本文的例子是对豆瓣电影抓取特定的数据。

第一步：

建立Django模型

模仿nutch的爬虫思路，这里简化了。

每次抓取任务开始先从数据库里找到未保存的（is_save=False）的链接，放到抓取链表里。

你也可以根据自己的需求去过滤链接。

python代码：

viewplaincopytoclipboardprint?

01.classCrawl_URL（models.Model）:

02.url=models.URLField（'

抓取地址'

max_length=100,unique=True）

03.weight=models.SmallIntegerField（'

抓取深度'

default=0）#抓取深度起始1

04.is_save=models.BooleanField（'

是否已保存'

default=False）#

05.date=models.DateTimeField（'

保存时间'

auto_now_add=True,blank=True,null=True）

06.def__unicode__（self）:

07.returnself.url

classCrawl_URL（models.Model）:

url=models.URLField（'

max_length=100,unique=True）

weight=models.SmallIntegerField（'

default=0）#抓取深度起始1

is_save=models.BooleanField（'

default=False）#

date=models.DateTimeField（'

auto_now_add=True,blank=True,null=True）

def__unicode__（self）:

returnself.url

然后生

展开阅读全文