Python轉(zhuǎn)換HTML到Text純文本的方法

2020-02-23 06:21:07

字體：大中小

來源：轉(zhuǎn)載

供稿：網(wǎng)友

本文實(shí)例講述了Python轉(zhuǎn)換HTML到Text純文本的方法。分享給大家供大家參考。具體分析如下：

今天項(xiàng)目需要將HTML轉(zhuǎn)換為純文本，去網(wǎng)上搜了一下，發(fā)現(xiàn)Python果然是神通廣大，無所不能，方法是五花八門。

拿今天親自試的兩個(gè)方法舉例，以方便后人：

方法一：

1. 安裝nltk，可以去pipy裝

（注：需要依賴以下包：numpy, PyYAML）

2.測試代碼：
代碼如下:>>> import nltk
>>> aa = r'''''
<html>
 <body>
Project: DeHTML 
Description: 
This small script is intended to allow conversion from HTML markup to
plain text.
 </body>
</html>
'''
>>> aa
'/n<html>/n <body>/n Project: DeHTML /n Description: /n This small script is intended to allow conversion from HTML markup to /n plain text./n </body>/n </html>/n '
>>> print nltk.clean_html(aa)
Project: DeHTML
 Description :
 This small script is intended to allow conversion from HTML markup to
 plain text.

方法二：

如果覺得nltk太笨重，大材小用的話，可以自己寫代碼，代碼如下:
代碼如下:from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ /t/r/n]+', ' ', text)
            self.__text.append(text + ' ')