1、heritrixHeritrix很可能由于包加多了而导致很多错误,要注意加的包的冲突问题Lib中的包很多个有冲突,只是包的版本不一样,看包的名字就知道了org.archive.crawler.frontier; AdaptiveRevisitFrontier/Override public void finalTasks() / by default do nothing 由于有错误,将Override注释掉,编译无错在org.archive.crawler.prefetch;包下的PreconditionEnforcer类中的private boolean considerRobotsPre
2、conditions(CrawlURI curi)方法,注释掉,并返回false右键工程创建一个包用来放自己定制heritrix所需要写的类先继承FrontierScheduler类写一个处理链接的类,代码如下package my;import java.util.logging.Logger;import org.archive.crawler.datamodel.CandidateURI;import org.archive.crawler.postprocessor.FrontierScheduler;public class FrontierSchedulerFor163Mobile
3、extends FrontierScheduler private static Logger LOGGER = Logger .getLogger(FrontierSchedulerFor163Mobile.class.getName(); public FrontierSchedulerFor163Mobile(String name) super(name); protected void schedule(CandidateURI caUri) String url = caUri.toString(); try if (url.indexOf() != -1 | url.indexO
4、f(robots.txt) != -1 | url.indexOf(dns:) != -1) if(url.indexOf() != -1) return ; if(url.endsWith(.zip) |url.endsWith(.exe) |url.endsWith(.pdf) |url.endsWith(.doc) |url.endsWith(.xls) |url.endsWith(.rar) |url.endsWith(.swf) |url.endsWith(.rmvb) |url.endsWith(.wmv) |url.endsWith(.asf) |url.endsWith(.pp
5、t) |url.endsWith(.mpg) |url.endsWith(.mp3) |url.endsWith(.iso) |url.endsWith(.wma) |url.endsWith(.dat) |url.endsWith(.ape) |url.endsWith(.ask) |url.endsWith(.csf) |url.endsWith(.mkv) |url.endsWith(.vod) |url.endsWith(.rn) ) return; if (url.indexOf(#) = -1) getController().getFrontier().schedule(caUr
6、i); else return; catch (Exception e) e.printStackTrace(); finally 这个类的作用是在过滤掉不需要的音频视频文件,压缩文件,可执行文件,office等文件,获取需要抓取的文件的URI。一个编程启动heritrix的类:package my;import java.io.File; import javax.management.InvalidAttributeValueException; import org.archive.crawler.event.CrawlStatusListener; import org.archive
7、.crawler.framework.CrawlController; import org.archive.crawler.framework.exceptions.InitializationException; import org.archive.crawler.settings.XMLSettingsHandler; public class StartHeritrixByEclipse public static void main(String args) throws InterruptedException String orderFile = D:/Documents an
8、d Settings/admin/workspace/heritrix_1/jobs/keyanchu-20100827131710296/order.xml;/order.xml文件路径 File file = null; /order.xml文件 CrawlStatusListener listener = null;/监听器 XMLSettingsHandler handler = null; /读取order.xml文件的处理器 CrawlController controller = null; /Heritrix的控制器 try file=new File(orderFile);
9、handler = new XMLSettingsHandler(file); handler.initialize();/读取order.xml中的各个配置 controller = new CrawlController();/ controller.initialize(handler);/从读取的order.xml中的各个配置来初始化控制器 if (listener != null) controller.addCrawlStatusListener(listener);/控制器添加监听器 controller.requestCrawlStart();/开始抓取 /* * 如果Heri
10、trix还一直在运行则等待 */ while (true) if (controller.isRunning() = false) break; Thread.sleep(1000); /如果Heritrix不再运行则停止 controller.requestCrawlStop(); catch (InvalidAttributeValueException e) / TODO Auto-generated catch block e.printStackTrace(); catch (InitializationException e) / TODO Auto-generated catch
11、 block e.printStackTrace(); catch (InterruptedException e) / TODO Auto-generated catch block e.printStackTrace(); 启动heritrix的代码执行的流程书上有介绍,很详细,在开发自己的搜索引擎Lucene+Heritrix中304308页继承链接制造工厂frontier写一个抓取线程处理的类,重写了getClassKey方法,加入ELFHash算法,并对robots的识别做了相关处理【这边还要注意要把自己写的这个类加载到heritrix的属性文件】这个图中倒数第二行是所有线程策略,在
12、这边要把自己写好的策略的类名加进去package org.archive.crawler.frontier;import java.util.logging.Level;import java.util.logging.Logger;import mons.httpclient.URIException;import org.archive.crawler.datamodel.CandidateURI;import org.archive.crawler.framework.CrawlController;import org.archive.crawler.frontier.QueueAss
13、ignmentPolicy;import .UURI;import .UURIFactory;public class ELFHashQueueAssignmentPolicy extends QueueAssignmentPolicy private static final Logger logger = Logger .getLogger(ELFHashQueueAssignmentPolicy.class.getName(); private static String DEFAULT_CLASS_KEY = default.; private static final String
14、DNS = dns; public ELFHashQueueAssignmentPolicy() / TODO Auto-generated constructor stub Override public String getClassKey(CrawlController controller, CandidateURI cauri) String uri = cauri.getUURI().toString(); String scheme = cauri.getUURI().getScheme(); String candidate = null; String name = null
15、; long hash = 0; try name = cauri.getUURI().getName(); catch (URIException e1) / TODO Auto-generated catch block e1.printStackTrace(); try if (scheme.equals(DNS) if (cauri.getVia() != null) / Special handling for DNS: treat as being / of the same class as the triggering URI. / When a URI includes a
16、port, this ensures / the DNS lookup goes atop the host:port / queue that triggered it, rather than / some other host queue UURI viaUuri = UURIFactory.getInstance(cauri.flattenVia(); candidate = viaUuri.getAuthorityMinusUserinfo(); / adopt scheme of triggering URI /scheme = viaUuri.getScheme(); hash
17、= ELFHash(viaUuri.toString(); candidate = candidate + Long.toString(hash % 10); else candidate = cauri.getUURI().getReferencedHost(); else / String uri = cauri.getUURI().toString(); candidate = cauri.getUURI().getAuthorityMinusUserinfo(); if(name != null & name.equals(robots.txt) hash = ELFHash(UURI
18、Factory.getInstance(cauri.flattenVia().toString(); else hash = ELFHash(uri); candidate = candidate + Long.toString(hash % 10); if (candidate = null | candidate.length() = 0) candidate = DEFAULT_CLASS_KEY; catch (URIException e) logger.log(Level.INFO, unable to extract class key; using default, e); c
19、andidate = DEFAULT_CLASS_KEY; return candidate.replace(:, #); public String getClassKey(String uri) / String uri = cauri.getUURI().toString(); long hash = ELFHash(uri); String a = Long.toString(hash % 100); return a; public static long ELFHash(String str) long hash = 0; long x = 0; for (int i = 0; i
20、 str.length(); i+) hash = (hash 24); hash &= x; return (hash & 0x7FFFFFFF); Heriytrix抓取的网址乱码问题【部分解决】org.archive.crawler.writer.MirrorWriterProcessor.joinParts()StringBuffer sb = new StringBuffer(length(); String ss = null; sb.append(mainPart.asStringBuffer(); if (null != uniquePart) sb.append(unique
21、Part); if (suffixAtEnd) if (null != query) sb.append(); sb.append(query); if (null != suffix) sb.append(.); sb.append(suffix); else if (null != suffix) sb.append(.); sb.append(suffix); if (null != query) sb.append(query); try ss = new String(sb.toString().getBytes(ISO-8859-1),UTF-8); catch (Unsuppor
22、tedEncodingException e) / TODO Auto-generated catch block e.printStackTrace(); return ss;修改org.archive.crawler.frontier.WorkQueueFrontier中的public CrawlURI next()方法,这里也对该方法做一些介绍,具体请看源码注释,改成如下,红色部分为改动部分:1. /* 2. *从调度中心获取下一个要抓取的URL 3. * 4. */5. publicCrawlURInext()throwsInterruptedException,EndedExcept
23、ion 6. while(true)/一直不停的循环,直到遇到异常或终止 7. /郭芸修改,用于当队列里没有可抓取的URL的时候去获取种子继续 8. synchronized(this) 9. if(this.controller.getFrontier().isEmpty()/如果没有可抓取的URL 10. loadSeeds();/重新载入种子 11. this.controller.getToePool().notifyAll();/唤醒所有抓取线程 12. 13. 14. /郭芸修改,用于当队列里没有可抓取的URL的时候去获取种子继续 15. 16. 17. longnow=Syste
24、m.currentTimeMillis();/开始获取时间 18. 19. /检查是否有暂停命令、结束命令以及宽带控制,这里会导致Heritrix结束 20. preNext(now); 21. 22. /* 23. *允许最多一个线程去填充准备队列(readyClassQueues) 24. */25. if(readyFiller.tryAcquire()/表示没有线程去使用当前变量,当前类1次只允许1个线程同时使用 26. try 27. 28. /空闲队列数=目标队列数-准备队列数 29. intactivationsNeeded=targetSizeForReadyQueues()
25、30. -readyClassQueues.size(); 31. /如果空闲队列数大于0,并且不在活动状态的队列数不是空的,则表示需要将不在活动状态的队列转移到准备队列 32. while(activationsNeeded0&!inactiveQueues.isEmpty() 33. activateInactiveQueue();/将不在活动状态队列的URL转移一定数目到活动状态队列 34. activationsNeeded-; 35. 36. finally 37. readyFiller.release();/必须释放,这样下次才可以继续使用 38. 39. 40. 41. Wor
26、kQueuereadyQ=null;/准备工作队列 42. /获取并移除此准备队列表示的队列的头部(即准备队列的第一个元素)如果该队列没有可用元素,则等待指定的时间,这里是1000毫秒也就是1秒 43. Objectkey=readyClassQueues.poll(DEFAULT_WAIT,TimeUnit.MILLISECONDS);/获得classKey,然后再通过classKey去获得队列 44. 45. if(key!=null) 46. readyQ=(WorkQueue)this.allQueues.get(key);/获得工作队列WorkQueue 47. 48. if(readyQ!=null) 49.
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1