STRIP TAGS AND JAVASCRIPT FROM HTML PAGE, LEAVING ONLY SAFE TAGS (PYTHON RECIPE)

티스토리 뷰

Devolopment/Python

STRIP TAGS AND JAVASCRIPT FROM HTML PAGE, LEAVING ONLY SAFE TAGS (PYTHON RECIPE)

OpenUiz 2020. 3. 17. 14:35

URL: http://code.activestate.com/recipes/52281/

Sometimes we are getting HTML input from the user. We want to only allow valid, undangerous tags, we want all tags to be balanced (i.e. an unclosed <b> will leave all text on your page bold), and we want to strip out all Javascript.

This recipe demonstrates how to do this using the sgmllib parser to parse HTML.

Python, 59 lines

Download

import sgmllib, string

class StrippingParser(sgmllib.SGMLParser):

    # These are the HTML tags that we will leave intact
    valid_tags = ('b', 'a', 'i', 'br', 'p')

    from htmlentitydefs import entitydefs # replace entitydefs from sgmllib
    
    def __init__(self):
        sgmllib.SGMLParser.__init__(self)
        self.result = ""
        self.endTagList = [] 
        
    def handle_data(self, data):
        if data:
            self.result = self.result + data

    def handle_charref(self, name):
        self.result = "%s&#%s;" % (self.result, name)
        
    def handle_entityref(self, name):
        if self.entitydefs.has_key(name): 
            x = ';'
        else:
            # this breaks unstandard entities that end with ';'
            x = ''
        self.result = "%s&%s%s" % (self.result, name, x)
    
    def unknown_starttag(self, tag, attrs):
        """ Delete all tags except for legal ones """
        if tag in self.valid_tags:       
            self.result = self.result + '<' + tag
            for k, v in attrs:
                if string.lower(k[0:2]) != 'on' and string.lower(v[0:10]) != 'javascript':
                    self.result = '%s %s="%s"' % (self.result, k, v)
            endTag = '</%s>' % tag
            self.endTagList.insert(0,endTag)    
            self.result = self.result + '>'
                
    def unknown_endtag(self, tag):
        if tag in self.valid_tags:
            self.result = "%s</%s>" % (self.result, tag)
            remTag = '</%s>' % tag
            self.endTagList.remove(remTag)

    def cleanup(self):
        """ Append missing closing tags """
        for j in range(len(self.endTagList)):
                self.result = self.result + self.endTagList[j]    
        

def strip(s):
    """ Strip illegal HTML tags from string s """
    parser = StrippingParser()
    parser.feed(s)
    parser.close()
    parser.cleanup()
    return parser.result

Getting rid of Javascript is hard. Our code only handles URLs that start with 'javascript:' and onClick and similar handlers.

The contents of <script> tags will be printed as part of the text, and for all I know 'vbscript:' URLs may be legal in IE.

'Devolopment > Python' 카테고리의 다른 글

Python3 AES 암호화, 복호화 (0)	2022.06.20
PDF to TEXT by Python3 (0)	2021.05.10
python에서 두개의 dictionary를 하나로 합치는 방법 (0)	2020.03.16
Python으로 만든 데몬을 시작/중지/재시작 시키는 쉘스크립트 예제. (0)	2020.03.02
PyQt5 에서 UI File 불러오는 방법 (0)	2020.02.25

최근에 달린 댓글

말랑말랑슈가

티스토리 뷰

STRIP TAGS AND JAVASCRIPT FROM HTML PAGE, LEAVING ONLY SAFE TAGS (PYTHON RECIPE)

'Devolopment > Python' 카테고리의 다른 글

티스토리툴바