티스토리 뷰


URL: http://code.activestate.com/recipes/52281/


Sometimes we are getting HTML input from the user. We want to only allow valid, undangerous tags, we want all tags to be balanced (i.e. an unclosed <b> will leave all text on your page bold), and we want to strip out all Javascript.

This recipe demonstrates how to do this using the sgmllib parser to parse HTML.

Python, 59 lines


import sgmllib, string

class StrippingParser(sgmllib.SGMLParser):

    # These are the HTML tags that we will leave intact
    valid_tags = ('b', 'a', 'i', 'br', 'p')

    from htmlentitydefs import entitydefs # replace entitydefs from sgmllib
    def __init__(self):
        self.result = ""
        self.endTagList = [] 
    def handle_data(self, data):
        if data:
            self.result = self.result + data

    def handle_charref(self, name):
        self.result = "%s&#%s;" % (self.result, name)
    def handle_entityref(self, name):
        if self.entitydefs.has_key(name): 
            x = ';'
            # this breaks unstandard entities that end with ';'
            x = ''
        self.result = "%s&%s%s" % (self.result, name, x)
    def unknown_starttag(self, tag, attrs):
        """ Delete all tags except for legal ones """
        if tag in self.valid_tags:       
            self.result = self.result + '<' + tag
            for k, v in attrs:
                if string.lower(k[0:2]) != 'on' and string.lower(v[0:10]) != 'javascript':
                    self.result = '%s %s="%s"' % (self.result, k, v)
            endTag = '</%s>' % tag
            self.result = self.result + '>'
    def unknown_endtag(self, tag):
        if tag in self.valid_tags:
            self.result = "%s</%s>" % (self.result, tag)
            remTag = '</%s>' % tag

    def cleanup(self):
        """ Append missing closing tags """
        for j in range(len(self.endTagList)):
                self.result = self.result + self.endTagList[j]    

def strip(s):
    """ Strip illegal HTML tags from string s """
    parser = StrippingParser()
    return parser.result


Getting rid of Javascript is hard. Our code only handles URLs that start with 'javascript:' and onClick and similar handlers.

The contents of <script> tags will be printed as part of the text, and for all I know 'vbscript:' URLs may be legal in IE.

최근에 달린 댓글