资源描述

数据清洗之网页中文过滤.docx

《数据清洗之网页中文过滤.docx》由会员分享，可在线阅读，更多相关《数据清洗之网页中文过滤.docx（19页珍藏版）》请在冰豆网上搜索。

数据清洗之网页中文过滤.docx

数据清洗之网页中文过滤

数据清洗中文过滤实现

编制人

马飞

所在部门

研发中心

编制时间

2016年5月23日

目录2

一、概述3

二、过滤规则3

三、实现代码3

2.1表说明3

2.2初始化4

2.3包说明4

四、程序源码5

3.1包头代码5

3.2包体代码6

3.2源代码附件14

一、概述

根据项目要求，需将HBASE分库式数据库中存储的HTML代码按要求过滤操作，只保留中文、标点符点及部分HTML标签。

现通过PL/SQL代码对其进行实现。

二、过滤规则

1、保留

、

过滤其它字符。

2、保留、、、标签，保留其中汉字及标点符号，过滤其它字符。

2、保留

、

3、最终以文档，UTF-8编码返回结果。

4、通过创建规则表，存储各种保留规则。

三、实现代码

2.1表说明

表名

功能描述

代码

UTF8_NOFILTER_CH

非过滤字符表

createtableUTF8_NOFILTER_CH

（CHNVARCHAR2（10）,

SMNVARCHAR2（100）,

BZVARCHAR2（20）

）;

UTF8_WRITE_CH_TMP

逐字符存放网页内容

createtableUTF8_WRITE_CH_TMP

（VNVARCHAR2

（1）,

V2VARCHAR2（10）,

RINTEGER

）;

UTF8

存放测试数据，其中V列存放网页内容，V2列存放处理好的网页内容，XH表示序号为主键，表示一共有多少个网页需处理。

createtableUTF8

（VNCLOB,

V2NCLOB,

XHINTEGERprimarykey

）;

UTF8_WRITE_LINE_TMP

将UTF8_WRITE_CH_TMP表中的字符合并为行

createtableUTF8_WRITE_LINE_TMP

（VNVARCHAR2（4000）,

RINTEGER

）;

UTF8_HANDLE_LINE_TMP

存放UTF8_WRITE_LINE_TMP表中的每一行按过滤规则处理后的结果

createtableUTF8_HANDLE_LINE_TMP

（VNVARCHAR2（4000）,

RINTEGER

）;

2.2初始化

规则表名

功能描述

代码

UTF8_NOFILTER_CH

初始化非过滤字符数据，目前只包括，全、半角标点符号、部分需保留的HTML关键字。

.\scripts\insert\UTF8_NOFILTER_CH.sql

2.3包说明

包头子程序名

功能描述

接口代码（包头）

filter

按过滤规则批量处理"UTF8"表中的列“V“中的网页内容。

procedurefilter;

filter（n_xhint）

按过滤规则处理"UTF8"表中某一行的列“V“中的网页内容。

procedurefilter（n_xhint）;

包体子程序名

功能描述

接口代码（包头）

read_ch_from_nclob

从NCLOB逐字符读取至表utf8_write_ch_tmp中。

详见包体中read_ch_from_nclob过程代码

read_line_from_ch

将表utf8_write_ch_tmp中字符处理为行。

详见包体中read_line_from_ch过程代码

write_line_to_nclob_init

首次写nclob时对其初始化

详见包体中write_line_to_nclob_init过程代码

write_line_to_nclob

将处理好的行信息写入nclob字段中

详见包体中write_line_to_nclob

is_ch_filter

功能：

检测某个字符是否为UTF8编码的汉字。

入口：

一个字符

出口：

1为是，0为否

详见包体中is_ch_filter函数

is_sign_filter

功能：

检测某个字符是否为UTF8编码的全角、半角标点符号。

入口：

一个字符

出口：

1为是，0为否

详见包体中is_sign_filter函数

is_valid_filter

功能：

检测某个字符串是否为UTF8编码的HTML关键字，具体支持列表详见“UTF8_NOFILTER_CH”表中的BZ=‘HTML’的结果。

入口：

一个字符串

出口：

正数表示是，0为否

详见包体中is_valid_filter函数

process

对表“UTF8_WRITE_LINE_TMP”中的每一行内容进行按规则过滤操作。

入口：

处理前的行

出口：

过滤后的行

详见包体中process函数

handle_line

按过滤规则循环处理“UTF8_WRITE_LINE_TMP”个中所有行。

详见包体中handle_line函数

handle_exception

对处理后的行进行处理，删除无效行或对行进行修正

详见包体中handle_exception过程。

merge_into_nclob

合并“UTF8_HANDLE_LINE_TMP”表中行并更新至"UTF8"表中的列“V2“列中。

详见包体中merge_into_nclob过程。

filter

按过滤规则批量处理"UTF8"表中的列“V“中的网页内容。

详见包体中filter过程。

filter（n_xhint）

按过滤规则处理"UTF8"表中某一行的列“V“中的网页内容。

详见包体中filter（n_xhint）过程。

init

过滤前的初始化工作

详见包体中init过程。

四、程序源码

3.1包头代码

createorreplacepackagedp_filter_chis

方法：

filter

功能：

按过滤规则批量处理"UTF8"表中的列“V“中的网页内容：

规则如下：

1.保留中文字符

2.保留全角、半角标点符号

3.保留HTML保留字，详见规则表：

utf8_nofilter_ch

procedurefilter;

方法：

filter（n_xhint）;

功能：

按过滤规则处理"UTF8"表中某一行的列“V“中的网页内容：

规则如下：

1.保留中文字符

2.保留全角、半角标点符号

3.保留HTML保留字，详见规则表：

utf8_nofilter_ch

procedurefilter（n_xhint）;

enddp_filter_ch;

3.2包体代码

createorreplacepackagebodydp_filter_chis

p_xhint;

procedureread_ch_from_nclobis

directionsnclob;

buffernvarchar2

（1）;

amountint:

=1;

n_lenint;

n_existsint;

begin

selectcount（0）inton_existsfromutf8_write_ch_tmpwherexh=p_xh;

ifn_exists=0then

deletefromutf8_write_ch_tmpwherexh=p_xh;

commit;

selectv,length（v）intodirections,n_lenfromutf8wherexh=p_xh;

dbms_lob.open（directions,dbms_lob.lob_readonly）;

foriin1..n_lenloop

DBMS_LOB.read（directions,amount,i,buffer）;

insertintoutf8_write_ch_tmp（xh,r,v）values（p_xh,i,buffer）;

ifmod（i,500）=0then

commit;

endif;

endloop;

commit;

DBMS_LOB.CLOSE（directions）;

endif;

exception

whenothersthen

null;

end;

procedureread_line_from_chis

v_buffernvarchar2（4000）;

n_offsetint:

=1;

begin

deletefromutf8_write_line_tmpwherexh=p_xh;

commit;

v_buffer:

='';

foriin（selectxh,v,r,ascii（v）chfromutf8_write_ch_tmptwherexh=p_xhorderbyr）loop

ifi.ch<>10andlength（v_buffer）<500then

v_buffer:

=v_buffer||i.v;

else

n_offset:

=n_offset+1;

insertintoutf8_write_line_tmp（xh,r,v）values（p_xh,n_offset,v_buffer）;

v_buffer:

=chr（10）;

endif;

ifmod（n_offset,20）=0then

commit;

endif;

endloop;

insertintoutf8_write_line_tmp（xh,r,v）values（p_xh,n_offset,v_buffer）;

commit;

end;

procedurewrite_line_to_nclob_initis

directionsnclob;

buffernvarchar2（100）:

='';

amountint:

=length（buffer）;

offsetint:

=1;

n_lenint;

begin

updateutf8setv2=empty_clob（）wherexh=p_xh;

selectv2,length（v2）intodirections,n_lenfromutf8wherexh=p_xhforupdate;

dbms_lob.open（directions,dbms_lob.lob_readwrite）;

DBMS_LOB.WRITE（directions,amount,offset,buffer）;

updateutf8setv2=directionswherexh=p_xh;

DBMS_LOB.CLOSE（directions）;

commit;

end;

procedurewrite_line_to_nclob（v_tmpnvarchar2）is

directionsnclob;

buffernvarchar2（4000）:

=v_tmp||chr（10）;

amountint:

=length（buffer）;

offsetint:

=1;

n_lenint;

begin

selectv2intodirectionsfromutf8wherexh=p_xh;

ifdirectionsisnullthen

write_line_to_nclob_init;

endif;

selectv2,length（v2）intodirections,n_lenfromutf8wherexh=p_xhforupdate;

dbms_lob.open（directions,dbms_lob.lob_readwrite）;

DBMS_LOB.WRITEAPPEND（directions,LENGTH（buffer）,buffer）;

updateutf8setv2=directionswherexh=p_xh;

DBMS_LOB.CLOSE（directions）;

commit;

end;

functionis_ch_filter（P_strnvarchar2）returnintis

v_setvarchar2（10）:

='UTF8';

n_beginint:

=to_number（'4E00','XXXXXXXX'）;

n_endint:

=to_number（'9FA5','XXXXXXXX'）;

n_existsint;

n_retint:

=0;

begin

iflengthb（P_str）=3then

selectinstr（dump（P_str,1010）,v_set）inton_existsfromdual;

ifn_exists>0then

ifto_number（replace（asciistr（P_str）,'\'）,'XXXX'）betweenn_beginandn_endthen

n_ret:

=1;

endif;

returnn_ret;

end;

functionis_sign_filter（P_strnvarchar2）returnintis

n_existsint:

=0;

begin

selectcount（0）

inton_exists

fromdual

whereexists（select1

fromutf8_nofilter_cht

wheret.bz='BDFH'

andt.ch=asciistr（P_str））;

returnsign（n_exists）;

end;

functionis_valid_filter（P_chnvarchar2,P_posint,P_strnvarchar2）returnintis

begin

ifP_ch='<'then

foriin（selectasciistr（t.ch）ch,lengthb（t.ch）len

fromutf8_nofilter_chtwheret.bz='HTML'）loop

ifasciistr（upper（substr（P_str,p_pos,i.len）））=asciistr（upper（i.ch））then

returni.len;

endif;

endloop;

endif;

return0;

end;

functionprocess（P_strnvarchar2）returnnvarchar2is

v_setvarchar2（10）:

='UTF8';

n_beginint:

=to_number（'4E00','XXXXXXXX'）;

n_endint:

=to_number（'9FA5','XXXXXXXX'）;

n_existsint;

v_tmpnvarchar2（20）;

n_posint:

=1;

n_lenint:

=1;

v_retnvarchar2（4000）;

begin

v_ret:

='';

loop

v_tmp:

=substr（P_str,n_pos,n_len）;

--当取出字符为空时退出

ifv_tmpisnullthen

exit;

endif;

--中文，标点，一次移动一个字符

ifis_ch_filter（v_tmp）=1oris_sign_filter（v_tmp）=1then

v_ret:

=v_ret||v_tmp;

n_pos:

=n_pos+1;

--不需过滤的字符或字符串,按过滤表中字符的长度进行移动

elsifis_valid_filter（v_tmp,n_pos,P_str）>0then

n_len:

=is_valid_filter（v_tmp,n_pos,P_str）;

v_ret:

=v_ret||substr（P_str,n_pos,n_len）;

n_pos:

=n_pos+n_len;

n_len:

=1;

--其他情况移动一个字符

elsifis_valid_filter（v_tmp,n_pos,P_str）=0then

n_pos:

=n_pos+1;

n_len:

=1;

endif;

endloop;

returnv_ret;

end;

procedurehandle_lineis

v_buffernvarchar2（4000）;

n_offsetint:

=1;

begin

deleteutf8_handle_line_tmpwherexh=p_xh;

v_buffer:

='';

foriin（selectr,vfromutf8_write_line_tmptwherexh=p_xhorderbyr）loop

v_buffer:

=v_buffer||process（i.v）;

insertintoutf8_handle_line_tmp（xh,r,v）values（p_xh,i.r,v_buffer）;

v_buffer:

='';

endloop;

commit;

end;

proceduremerge_into_nclobis

begin

foriin（selectr,v,ascii（v）,lengthb（v）fromutf8_handle_line_tmpwherexh=p_xhorderbyr）loop

write_line_to_nclob（i.v）;

endloop;

end;

procedurehandle_exceptionis

begin

--删除一行中只有一个字符且该字符为ASCII=10

deletefromutf8_write_line_tmptwherelength（v）=1andascii（v）=10;

--删除一行中ASCII字符为空或每行长度为1或2的行。

deletefromutf8_handle_line_tmptwhereascii（t.v）isnullorlengthb（t.v）in（1,2）;

commit;

--将连续两个字符为以下情况的都替换为空

foriin（selectr,v,ascii（v）,lengthb（v）fromutf8_handle_line_tmpwherexh=p_xhorderbyr）loop

updateutf8_handle_line_tmpsetv=replace（v,'..'）wherexh=p_xhandr=i.r;

updateutf8_handle_line_tmpsetv=replace（v,',,'）wherexh=p_xhandr=i.r;

updateutf8_handle_line_tmpsetv=replace（v,'.,'）wherexh=p_xhandr=i.r;

updateutf8_handle_line_tmpsetv=replace（v,',.'）wherexh=p_xhandr=i.r;

endloop;

--以.或,开头的去掉开头字符,或以.结尾的字符

foriin（selectrowidfromutf8_handle_line_tmp

wherexh=p_xhand（vlike',%'orvlike'.%'orvlike'%.'））loop

updateutf8_handle_line_tmpsetv=substr（v,2）whererowid=i.rowid;

endloop;

--以.结尾的字符

foriin（selectrowidfromutf8_handle_line_tmpwherexh=p_xhandvlike'%.'）loop

updateutf8_handle_line_tmpsetv=substr（v,1,length（v）-1）whererowid=i.rowid;

endloop;

--当一行内容只有以下字符时，删除改行

deletefromutf8_handle_line_tmpwherexh=p_xhandvisnull;

deletefromutf8_handle_line_tmpwherexh=p_xhandv='.';

deletefromutf8_handle_line_tmpwherexh=p_xhandv=',';

deletefromutf8_handle_line_tmpwherexh=p_xhandv='..';

deletefromutf8_handle_line_tmpwherexh=p_xhandv='...';

commit;

end;

procedureinit（n_xhint）is

begin

p_xh:

=n_xh;

updateutf8setv2=nullwherexh=p_xh;

commit;

end;

展开阅读全文