CloudBot/plugins/urlparse.py

from util import hook, http, urlnorm
import re

titler = re.compile(r'(?si)<title>(.+?)</title>');

def parse(url):
    """ an improved version of our parsing code - now regex powered """
    url = urlnorm.normalize(url.encode('utf-8'))
    url = url.decode('utf-8')
    # add http if its missing
    if url[:7] != "http://" and url[:8] != "https://":
        url = "http://" + url
    try:
        # get the title
        request = http.open(url)
        real_url = request.geturl()
        text = request.read()
        text = text.decode('utf8')
        match = titler.search(text)
        title = match.group(1)
    except:
        return "Could not parse URL! Are you sure its valid?"
        
    title = http.unescape(title)
    
    # if the url has been redirected, show us
    if real_url == url:
        return title
    else:
        return u"%s [%s]" % (title, real_url)

@hook.command
def title(inp):
    ".title <url> -- gets the title of a web page"
    return parse(inp)
First :D 2011-11-20 10:23:31 +01:00			`from util import hook, http, urlnorm`
			`import re`

Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`titler = re.compile(r'(?si)<title>(.+?)</title>');`
Removed poken factoid prefixes, added more HTML entity codes to the URL parser 2011-11-26 01:15:43 +01:00
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`def parse(url):`
			`""" an improved version of our parsing code - now regex powered """`
			`url = urlnorm.normalize(url.encode('utf-8'))`
			`url = url.decode('utf-8')`
			`# add http if its missing`
			`if url[:7] != "http://" and url[:8] != "https://":`
			`url = "http://" + url`
Made url parsing toggleable, added more advenced regex 2012-03-04 02:23:39 +01:00			`try:`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`# get the title`
			`request = http.open(url)`
			`real_url = request.geturl()`
			`text = request.read()`
			`text = text.decode('utf8')`
			`match = titler.search(text)`
			`title = match.group(1)`
			`except:`
			`return "Could not parse URL! Are you sure its valid?"`

Changed code to unescape titles properly 2012-02-29 09:43:11 +01:00			`title = http.unescape(title)`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00
			`# if the url has been redirected, show us`
			`if real_url == url:`
			`return title`
Improved URL parser 2011-11-30 13:51:43 +01:00			`else:`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`return u"%s [%s]" % (title, real_url)`

			`@hook.command`
			`def title(inp):`
			`".title <url> -- gets the title of a web page"`
			`return parse(inp)`