Python实现网络爬虫蜘蛛.docx

资源描述

Python实现网络爬虫蜘蛛.docx

《Python实现网络爬虫蜘蛛.docx》由会员分享，可在线阅读，更多相关《Python实现网络爬虫蜘蛛.docx（16页珍藏版）》请在冰豆网上搜索。

Python实现网络爬虫蜘蛛.docx

Python实现网络爬虫蜘蛛

python中如何提取网页正文啊谢谢

importurllib.request

url=""

response=urllib.request.urlopen（url）

page=response.read（）

python提取网页中的文本

1.importos,sys,datetime

2.importhttplib,urllib,re

3.fromsgmllibimportSGMLParser

5.importtypes

7.classHtml2txt（SGMLParser）:

8. defreset（self）:

9. self.text=''

10. self.inbody=True

11. SGMLParser.reset（self）

12. defhandle_data（self,text）:

13. ifself.inbody:

14. self.text+=text

15.

16. defstart_head（self,text）:

17. self.inbody=False

18. defend_head（self）:

19. self.inbody=True

20.

21.

22.if__name__=="__main__":

23. parser=Html2txt（）

24. parser.feed（urllib.urlopen（""）.read（））

25. parser.close（）

26. printparser.text.strip（）

python下载网页

importhttplib

conn=httplib.HTTPConnection（""）

conn.request（"GET","/index.html"）

r1=conn.getresponse（）

printr1.status,r1.reason

data=r1.read（）

printdata

conn.close

用python下载网页，超级简单！

fromurllibimporturlopen

webdata=urlopen（""）.read（）

printwebdata

深入python里面有

python 下载网页内容,用python的pycurl模块实现

1.用python下载网页内容还是很不错的，之前是使用urllib模块实验的，但听说有pycurl这个模块，而且比urllib好，所以尝试下，废话不说，以下是代码

4.#!

/usr/bin/envpython

5.#-*-coding:

utf-8-*-

6.importStringIO

7.importpycurl

9.defwrite）:

f=open（x,'w'）

f.write（fstr）

f.close

10.

1.html=StringIO.StringIO（）

2.c=pycurl.Curl（）

3.myurl=''

5.c.setopt（pycurl.URL,myurl）

7.#写的回调

8.c.setopt（pycurl.WRITEFUNCTION,html.write）

10.c.setopt（pycurl.FOLLOWLOCATION,1）

11.

12.#最大重定向次数,可以预防重定向陷阱

13.c.setopt（pycurl.MAXREDIRS,5）

14.

15.#连接超时设置

16.c.setopt（pycurl.CONNECTTIMEOUT,60）

17.c.setopt（pycurl.TIMEOUT,300）

18.

19.#模拟浏览器

20.c.setopt（pycurl.USERAGENT,"Mozilla/4.0（compatible;MSIE6.0;WindowsNT5.1;SV1;.NETCLR1.1.4322）"）

21.

22.

23.

24.#访问,阻塞到访问结束

25.c.perform（）

26.

27.#打印出200（HTTP状态码，可以不需要）

28.printc.getinfo（pycurl.HTTP_CODE）

29.

30.#输出网页的内容

31.printhtml.getvalue（）

32.#保存成down.txt文件

33.write（）,"down.txt"）

的pycurl模块的安装可以到这里去找.

不同系统使用不同版本，自己看看

总结下，Python下载网页的几种方法

fd=urllib2.urlopen（url_link）

data=fd.read（）

这是最简洁的一种，当然也是Get的方法

通过GET的方法

defGetHtmlSource（url）:

try:

htmSource=''

req=urllib2.Request（url）

fd=urllib2.urlopen（req,""）

while1:

data=fd.read（1024）

ifnotlen（data）:

break

htmSource+=data

fd.close（）

delfd

delreq

htmSource=htmSource.decode（'cp936'）

htmSource=formatStr（htmSource）

returnhtmSource

exceptsocket.error,err:

str_err="%s"%err

return""

通过GET的方法

defGetHtmlSource_Get（htmurl）:

htmSource=""

try:

urlx=httplib.urlsplit（htmurl）

conn=httplib.HTTPConnection（loc）

conn.connect（）

conn.putrequest（"GET",htmurl,None）

conn.putheader（"Content-Length",0）

conn.putheader（"Connection","close"）

conn.endheaders（）

res=conn.getresponse（）

htmSource=res.read（）

exceptException（）,err:

trackback.print_exec（）

conn.close（）

returnhtmSource

通过POST的方法

defGetHtmlSource_Post（getString）:

htmSource=""

try:

url=httplib.urlsplit（""）

conn=httplib.HTTPConnection（loc）

conn.connect（）

conn.putrequest（"POST","/sipo/zljs/hyjs-jieguo.jsp"）

conn.putheader（"Content-Length",len（getString））

conn.putheader（"Content-Type","application/x-"）

conn.putheader（"Connection","Keep-Alive"）

conn.endheaders（）

conn.send（getString）

f=conn.getresponse（）

ifnotf:

raisesocket.error,"timedout"

htmSource=f.read（）

f.close（）

conn.close（）

returnhtmSource

exceptException（）,err:

trackback.print_exec（）

conn.close（）

returnhtmSource

本文来自CSDN博客，转载请标明出处：

Django+python+BeautifulSoup组合的垂直搜索爬虫

使用python+BeautifulSoup完成爬虫抓取特定数据的工作，并使用Django搭建一个管理平台，用来协调抓取工作。

因为自己很喜欢Djangoadmin后台，所以这次用这个后台对抓取到的链接进行管理，使我的爬虫可以应对各种后期的需求。

比如分时段抓取，定期的对已经抓取的地址重新抓取。

数据库是用python自带的sqlite3，所以很方便。

这几天正好在做一个电影推荐系统，需要些电影数据。

本文的例子是对豆瓣电影抓取特定的数据。

第一步：

建立Django模型

模仿nutch的爬虫思路，这里简化了。

每次抓取任务开始先从数据库里找到未保存的（is_save=False）的链接，放到抓取链表里。

你也可以根据自己的需求去过滤链接。

python代码：

viewplaincopytoclipboardprint?

01.classCrawl_URL（models.Model）:

02.url=models.URLField（'抓取地址',max_length=100,unique=True）

03.weight=models.SmallIntegerField（'抓取深度',default=0）#抓取深度起始1

04.is_save=models.BooleanField（'是否已保存',default=False）#

05.date=models.DateTimeField（'保存时间',auto_now_add=True,blank=True,null=True）

06.def__unicode__（self）:

07.returnself.url

classCrawl_URL（models.Model）:

url=models.URLField（'抓取地址',max_length=100,unique=True）

weight=models.SmallIntegerField（'抓取深度',default=0）#抓取深度起始1

is_save=models.BooleanField（'是否已保存',default=False）#

date=models.DateTimeField（'保存时间',auto_now_add=True,blank=True,null=True）

def__unicode__（self）:

returnself.url

然后生

展开阅读全文