Python利用nltk的clean_html提取htm文件的内容

来源:互联网 时间:1970-01-01

Python利用nltk的clean_html提取htm文件的内容,有需要的朋友可以参考下。


import osimport codecs# import nltkimport refrom pdf_extract import extract_pattern# clean_html为nltk的库函数def clean_html(html): """ Copied from NLTK package. Remove HTML markup from the given string. :param html: the HTML string to be cleaned :type html: str :rtype: str """ # First we remove inline JavaScript/CSS: cleaned = re.sub(r"(?is)<(script|style).*?>.*?(<//1>)", "", html.strip()) # Then we remove html comments. This has to be done before removing regular # tags since comments can contain '>' characters. cleaned = re.sub(r"(?s)<!--(.*?)-->[/n]?", "", cleaned) # Next we can remove the remaining tags: cleaned = re.sub(r"(?s)<.*?>", " ", cleaned) # Finally, we deal with whitespace cleaned = re.sub(r" ", " ", cleaned) cleaned = re.sub(r" ", " ", cleaned) cleaned = re.sub(r" ", " ", cleaned) return cleaned.strip()for htm_dir,i2,htm_names in os.walk('E:/companyProject/searchEngine/kg_database'+os.sep+'fangji'): #三个参数:分别返回1.父目录 2.所有文件夹名字(不含路径) 3.所有文件名字 for htm_name_temp in htm_names: temp = htm_name_temp.split('.') temp_name = temp[0] temp_type = temp[1] if temp_type=='htm': f = codecs.open(htm_dir+'//'+htm_name_temp, 'r', 'gbk') raw_data = f.read() f.close() final_data = clean_html(raw_data) f2 = codecs.open('E:/companyProject/searchEngine/kg_database//tempfile'+'//'+temp_name+'.txt', 'w', 'gbk') # print >>f2, final_data f2.write(final_data) f2.close()




版权声明:本文为博主原创文章,未经博主允许不得转载。



相关阅读:
Top