Python网络爬虫实习报告Word文档格式.docx

资源描述

Python网络爬虫实习报告Word文档格式.docx

《Python网络爬虫实习报告Word文档格式.docx》由会员分享，可在线阅读，更多相关《Python网络爬虫实习报告Word文档格式.docx（15页珍藏版）》请在冰豆网上搜索。

Python网络爬虫实习报告Word文档格式.docx

Portia框架:

Portia框架是一款允许没有任何编程基础的用户可视化地爬取网页的爬虫框架。

newspaper框架:

newspaper框架是一个用来提取新闻、文章以及内容分析的Python爬虫框架。

Python-goose框架：

Python-goose框架可提取的信息包括：

文章主体内容;

文章主要图片;

文章中嵌入的任heYoutube/Vimeo视频;

元描述;

元标签

五、数据爬取实战（豆瓣网爬取电影数据）

1分析网页

#获取html源代码

defgetHtml（）:

data=[]

pageNum=1

pageSize=0

try:

while（pageSize<

=125）:

#headers={'

User-Agent'

Mozilla/5.0（WindowsNT6.1）AppleWebKit/537.11（KHTML,likeGecko）Chrome/23.0.1271.64Safari/537.11'

Referer'

None# 注意如果依然不能抓取的话，这里可以设置抓取网站的host

#opener=urllib.request.build_opener（）

#opener.addheaders=[headers]

url="

+str（pageSize）+"

filter="

+str（pageNum）

#data['

html%s'

i]=urllib.request.urlopen（url）.read（）.decode（"

utf-8"

）

data.append（urllib.request.urlopen（url）.read（）.decode（"

utf-8"

））

pageSize+=25

pageNum+=1

print（pageSize,pageNum）

exceptExceptionase:

raisee

returndata

2爬取数据

defgetData（html）:

title=[] #电影标题

#rating_num=[]#评分

range_num=[]#排名

#rating_people_num=[]#评价人数

movie_author=[]#导演

data={}

#bs4解析html

soup=BeautifulSoup（html,"

html.parser"

forliinsoup.find（"

ol"

attrs={'

class'

grid_view'

}）.find_all（"

li"

）:

title.append（li.find（"

span"

class_="

title"

）.text）

#rating_num.append（li.find（"

div"

class_='

star'

）.find（"

rating_num'

range_num.append（li.find（"

pic'

em"

#spans=li.find（"

class_='

）.find_all（"

#forxinrange（len（spans））:

#ifx<

=2:

#pass

#else:

rating_people_num.append（spans[x].string[-len（spans[x].string）:

-3]）

str=li.find（"

bd'

）.text.lstrip（）

index=str.find（"

主"

if（index==-1）:

index=str.find（"

..."

print（li.find（"

if（li.find（"

）.text

==210）:

index=60

#print（"

aaa"

#print（str[4:

index]）

movie_author.append（str[4:

data['

title'

]=title

#data['

]=rating_num

data['

range_num'

]=range_num

rating_people_num'

]=rating_people_num

movie_author'

]=movie_author

3数据整理、转换

defgetMovies（data）:

f=open（'

//douban_movie.html'

encoding='

utf-8'

f.write（"

html>

head>

metacharset='

UTF-8'

title>

Inserttitlehere<

/title>

/head>

body>

h1>

爬取豆瓣电影<

/h1>

h4>

作者：

刘文斌<

/h4>

时间：

+nowtime+"

hr>

tablewidth='

800px'

border='

align=center>

thead>

tr>

th>

fontsize='

color=green>

电影

/font>

/th>

#f.write（"

thwidth='

50px'

评分<

f.write（"

排名<

100px'

评价人数<

导演

/tr>

/thead>

）f.write（"

tbody>

fordataindatas:

foriinrange（0,25）:

f.write（"

style='

color:

orange;

text-align:

center'

%s<

/td>

%data['

][i]）

#f.write（"

tdstyle='

blue;

f.write（"

red;

][i]）

展开阅读全文