1、元标签五、数据爬取实战(豆瓣网爬取电影数据)1分析网页 # 获取html源代码def _getHtml(): data = pageNum = 1 pageSize = 0 try: while (pageSize = 125): # headers = User-Agent:Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11, # RefererNone #注意如果依然不能抓取的话,这里可以设置抓取网站的host # # opener = u
2、rllib.request.build_opener() # opener.addheaders = headers url = https:/movie.douban./top250?start= + str(pageSize) + &filter= + str(pageNum) # datahtml%s % i =urllib.request.urlopen(url).read().decode(utf-8) data.append(urllib.request.urlopen(url).read().decode() pageSize += 25 pageNum += 1 print(p
3、ageSize, pageNum) except Exception as e: raise e return data2爬取数据def _getData(html): title = # 电影标题 #rating_num = # 评分 range_num = # 排名 #rating_people_num = # 评价人数 movie_author = # 导演 data = # bs4解析html soup = BeautifulSoup(html, html.parser for li in soup.find(ol, attrs=class grid_view).find_all(li
4、): title.append(li.find(span, class_=title).text) #rating_num.append(li.find(div, class_=star).find(rating_num range_num.append(li.find(picem #spans = li.find().find_all( #for x in range(len(spans): # if x headmeta charset=UTF-8titleInsert title here/headbodyh1爬取豆瓣电影h4 作者:刘文斌 时间: + nowtime + hrtable
5、 width=800px border=1 align=centertheadtrthfont size=5 color=green电影/th #f.write(th width=50px评分排名100px评价人数导演/theadf.write(tbody for data in datas: for i in range(0, 25):td style=color:orange;text-align:center%s % datai) # f.write(blue;red;black;/tbody/table/body/html f.close()if _name_ = _main_ dat
6、as = htmls = _getHtml() for i in range(len(htmls): data = _getData(htmlsi) datas.append(data) _getMovies(datas)4数据保存、展示结果如后图所示:5技术难点关键点数据爬取实战(搜房网爬取房屋数据)from bs4 import BeautifulSoupimport requestsrep = requests.get(newhouse.fang./top/rep.encoding = gb2312 # 设置编码方式html = rep.textsoup = BeautifulSoup(html, html.parserf = open(/fang.htmlcenter新房成交TOP3table border=1px width=1000px height=h2房址成交量均价5px color=red % name) color=blue % chengjiaoliang) % junjia) print(name)六、总结教师评语:成绩: 指导教师:
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1