CloudBot/plugins/urlparse.py

from util import hook, http, urlnorm
import re

titler = re.compile(r'(?si)<title>(.+?)</title>')


def get_title(url):
    url = urlnorm.normalize(url.encode('utf-8'))
    url = url.decode('utf-8')
    # add http if its missing
    if not "://" in url:
        url = "http://" + url
    try:
        # get the title
        request = http.open(url)
        real_url = request.geturl()
        text = request.read()
        text = text.decode('utf8')
        match = titler.search(text)
        title = match.group(1)
    except:
        return "Could not parse URL! Are you sure its valid?"

    title = http.unescape(title)

    # if the url has been redirected, show us
    if real_url == url:
        return title
    else:
        return u"%s [%s]" % (title, real_url)


@hook.command
def title(inp):
    "title <url> -- gets the title of a web page"
    return get_title(inp)
More bug squashing 2012-05-16 21:45:59 +02:00			`from util import hook, http, urlnorm`
First :D 2011-11-20 10:23:31 +01:00			`import re`

PEP-8 + ping.py changes 2012-03-23 06:14:58 +01:00			`titler = re.compile(r'(?si)<title>(.+?)</title>')`

Removed poken factoid prefixes, added more HTML entity codes to the URL parser 2011-11-26 01:15:43 +01:00
Standardised function name 2012-03-28 00:04:06 +02:00			`def get_title(url):`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`url = urlnorm.normalize(url.encode('utf-8'))`
			`url = url.decode('utf-8')`
			`# add http if its missing`
Changed method of adding http:// sot URLS 2012-04-22 21:44:25 +02:00			`if not "://" in url:`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`url = "http://" + url`
Made url parsing toggleable, added more advenced regex 2012-03-04 02:23:39 +01:00			`try:`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`# get the title`
			`request = http.open(url)`
			`real_url = request.geturl()`
			`text = request.read()`
			`text = text.decode('utf8')`
			`match = titler.search(text)`
			`title = match.group(1)`
			`except:`
			`return "Could not parse URL! Are you sure its valid?"`
PEP-8 + ping.py changes 2012-03-23 06:14:58 +01:00
Changed code to unescape titles properly 2012-02-29 09:43:11 +01:00			`title = http.unescape(title)`
PEP-8 + ping.py changes 2012-03-23 06:14:58 +01:00
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`# if the url has been redirected, show us`
			`if real_url == url:`
			`return title`
Improved URL parser 2011-11-30 13:51:43 +01:00			`else:`
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`return u"%s [%s]" % (title, real_url)`

PEP-8 + ping.py changes 2012-03-23 06:14:58 +01:00
Dumped the buggy auto parser. replaced it with a faster command-based system that doesn't fail as much 2012-03-21 11:52:15 +01:00			`@hook.command`
			`def title(inp):`
Updated helpstrings 2012-05-16 05:07:27 +02:00			`"title <url> -- gets the title of a web page"`
Standardised function name 2012-03-28 00:04:06 +02:00			`return get_title(inp)`