An Approach of Extracting Information for Maritime Unstructured Text Based on Rules
-
摘要: 海事数据的结构化处理是海事安全研究的一个重要步骤.目前,网络上存在着大量的海事相关信息,但多为不同格式的非结构化文档数据,可以采用一种基于规则的海事信息抽取方法,将海事自由文本转化为结构化的数据.通过网络爬虫从海事相关网页中得到待抽取文本数据,根据得到的文本信息定义抽取任务为时间、地点、船名和事故类型4个数据项,再根据抽取任务本身及其常见触发词构建自定义海事词库,用于自由文本的分词和词性标注;通过对大量事故语料的分析总结,编制抽取规则进行海事信息的抽取,形成结构化的海事数据.以长江海事局网站的事故详情为数据源,采用基于规则的抽取方法进行实验.实验结果表明,时间信息抽取的准确率为100%,召回率为91%;地点信息抽取的准确率为94.52%,召回率为69%;船名信息抽取的准确率为97.75%,召回率为86%;事故类型信息抽取的准确率为96.67%,召回率为87%.Abstract: Structural processing of maritime data plays an important role in maritime safety.There is a plenty of maritime related information on internet.However, most of the information is unstructured data which has different formats.An approach of extracting maritime information and converting unstructured text into structural data is proposed in this paper.Web crawlers are used to obtain the text data from maritime-related Web pages.According to the definitions of the texts, they are divided into four items, which are time, location, vessel name, and type of accident.According to the extraction process and its common trigger words, the maritime lexicon for segmentation of Chinese words and part-of-speech tagging is constructed.Relying on an analysis of a large number of accident corpuses, the rules for extraction of information are summarized.The structured maritime data is then formulated.In order to verify the feasibility of this approach in term of extracting information based on rules, the data from the website of The Yangtze river maritime bureau is applied as a case study.The results indicate that the precision of extracting time information is 100%, with the recall rate of 91%.The precision of extracting location information is 94.52%, with the recall rate of 69%.The precision of extracting vessel name information is 97.75%, with the recall rate of 86%.The precision of extracting accident type information is 96.6%, with the recall rate of 87%.
点击查看大图
计量
- 文章访问数: 608
- HTML全文浏览量: 124
- PDF下载量: 5
- 被引次数: 0