[封面页]_馆档网

[封面页]
北京大学计算机科学技术系苏杭学士论文
一种可扩展的高效链接提取模型的实现与验证
教师指导意见:
链接提取是网页搜集系统中的一个重要组成部分.苏杭同学的毕业论文工作,是对这一部分的突出贡献.
论文所涉及的工作包含了对搜索引擎技术的一般认识. 链接提取模块以 "容错性""正确性""全面性""高效性"和"可扩展性"为设计目标,在充分认 , , , 识到传统的链接提取方法不足的基础上,提出新的设计思路,并且实现.该模块包括信息提取,信息加工,信息分析和信息存储四个过程.并成功的运用于"天网"搜索引擎.论文内容丰富,所涉及的工作量大,且有较强的系统性,是一篇很有价值的论文.
在毕业设计工作的过程中,苏杭同学态度端正,积极努力,精力集中,表现出很强的进取精神和踏实的工作作风,为"天网"的发展做出了贡献.
指导教师:闫宏飞 2003 年 6 月 18 日
北京大学计算机科学技术系苏杭学士论文
一种可扩展的高效链接提取模型的实现与验证
摘要
随着 WWW(World Wide Web)越来越广泛的发展与应用,搜索引擎已经成为人们从中查找信息的重要工具;在搜索引擎的系统实现中,如何通过链接提取发现更多更广的 Web 资源又是影响搜索引擎性能的重要因素之一. 本文总结了设计链接提取模块所要求的"容错性""正确性""全面性" , , , "高效性"和"可扩展性"等五个目标,并从这些角度去分析传统的链接提取方法的不足,并作为改进,提出了一种新的设计思路. 本文将链接提取的过程划分为信息提取,信息加工,信息分析以及信息储存四个过程来进行研究.信息的获取通过 HTML 文法分析方法从文档中得到初始 URI(Uniform Resource Indetifier)数据;信息加工阶段通过运用 URI 解析算法对初始数据进行精练;然后在信息分析过程中进一步地筛选与过滤;最后将结果存储在一个双链表结构中. 基于上述方法,本文实现了一个新的链接提取模型,并将该模型运用于北京大学天网 WWW 搜索引擎;在获得足够的实验数据之后,全面的比较了这种新的链接提取模式与传统方法在各项指标上的优劣.结果表明该模型有明显的优势.
关键字:搜索引擎,链接提取,统一资源地址(URI)
北京大学计算机科学技术系苏杭学士论文
一种可扩展的高效链接提取模型的实现与验证
Abstract
As the World Wide Web is becoming more and more popular, search engine has become an essencial tool for people to look for certain information on the web. Among many factors to take into consideration in implementing a search engine, how to find more web resources using URL extractor is a very important one. This essay conludes the basic objectives in the design of URL extractor module, which are robustness, correctness, completeness, effectiveness and expansibility. With these objectives, the essay analyzes the weakness of original design and furthermore, generalizes a new and more powerful design method. The new method is devided into four procedures, that are Information Extraction, Information Refinement, Information Analysis and Information Storage. During the Information Extraction process, the initial URI (Uniform Resource Indentifier) information is extracted from HTML source using HTML BNF analysis method. The initial raw data is refined and normalized in the Information Refinement process and filtered in the Information Analysis phase. Eventually, the remaining correct and useful data is stored in a double-link structure. On implementing the above method, the essay has produced a new URL extractor module, and applies the module on PKU Tianwang WWW search engine,then after acquiring sufficient testing data, completely compares the new module with the original one. According to the comparison result, the new module is much better in many aspects.

下一页

文档基本属性
文档语言：	English
文档格式：	pdf
文档作者：	whale
关键词：
主题：
备注：
点击这里显示更多文档属性
经理：
单位：	Compaq
分类：
创建时间：
上次保存者：
修订次数：
编辑时间：
文档创建者：
修订：
加密标识：
幻灯片：
段落数：
字节数：
备注：
演示格式：
上次保存时间：