CloudBot/plugins/urlparse.py

from util import hook, http, urlnorm
import urllib
from urllib2 import urlopen, Request, HTTPError
import re
import BeautifulSoup

ignored_urls = ["http://google.com", "http://youtube.com",
                "http://pastebin.com", "http://mibpaste.com",
                "http://fpaste.com", "http://git.io"]

def parse(match):
    url = urlnorm.normalize(match.encode('utf-8'))
    if url not in ignored_urls:
        url = url.decode('utf-8')
        try:
            soup = BeautifulSoup.BeautifulSoup(http.get(url))
            return soup.title.string
        except:
            return "fail"

# there should be " after the ' in the regex string but I was unable to escape it properly
@hook.regex(r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))")
def urlparser(match, say=None, bot=None):
    try:
        enabled = bot.config["plugins"]["urlparse"]["enabled"]
    except KeyError:
        enabled = False

    if not enabled:
        return

    url = urlnorm.normalize(match.group().encode('utf-8'))
    if url[:7] != "http://":
        if url[:8] != "https://":
            url = "http://" + url
    for x in ignored_urls:
        if x in url:
            return
    title = parse(url)
    if title == "fail":
        return
    title = http.unescape(title)
    realurl = http.get_url(url)
    if realurl == url:
        say(u"(Link) %s" % title)
        return
    else:
        say(u"(Link) %s [%s]" % (title, realurl))
        return
First :D 2011-11-20 10:23:31 +01:00			`from util import hook, http, urlnorm`
			`import urllib`
Removed poken factoid prefixes, added more HTML entity codes to the URL parser 2011-11-26 01:15:43 +01:00			`from urllib2 import urlopen, Request, HTTPError`
First :D 2011-11-20 10:23:31 +01:00			`import re`
			`import BeautifulSoup`

fixed urltools.py 2012-02-29 07:58:10 +01:00			`ignored_urls = ["http://google.com", "http://youtube.com",`
			`"http://pastebin.com", "http://mibpaste.com",`
update urltools.py for git.io 2012-02-29 08:32:42 +01:00			`"http://fpaste.com", "http://git.io"]`
Removed poken factoid prefixes, added more HTML entity codes to the URL parser 2011-11-26 01:15:43 +01:00
First :D 2011-11-20 10:23:31 +01:00			`def parse(match):`
			`url = urlnorm.normalize(match.encode('utf-8'))`
			`if url not in ignored_urls:`
			`url = url.decode('utf-8')`
			`try:`
Changed factoid commands, added URL parser 2011-11-25 16:36:44 +01:00			`soup = BeautifulSoup.BeautifulSoup(http.get(url))`
First :D 2011-11-20 10:23:31 +01:00			`return soup.title.string`
			`except:`
Changed factoid commands, added URL parser 2011-11-25 16:36:44 +01:00			`return "fail"`
First :D 2011-11-20 10:23:31 +01:00
Made url parsing toggleable, added more advenced regex 2012-03-04 02:23:39 +01:00			`# there should be " after the ' in the regex string but I was unable to escape it properly`
			@hook.regex(r"(?i)\b((?:https?://\|www\d{0,3}[.]\|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+\|\(([^\s()<>]+\|(\([^\s()<>]+\)))\))+(?:\(([^\s()<>]+\|(\([^\s()<>]+\)))\)\|[^\s`!()\[\]{};:'.,<>?«»“”‘’]))")
			`def urlparser(match, say=None, bot=None):`
			`try:`
			`enabled = bot.config["plugins"]["urlparse"]["enabled"]`
			`except KeyError:`
			`enabled = False`

			`if not enabled:`
			`return`

Changed factoid commands, added URL parser 2011-11-25 16:36:44 +01:00			`url = urlnorm.normalize(match.group().encode('utf-8'))`
fixed urltools.py 2012-02-29 07:58:10 +01:00			`if url[:7] != "http://":`
			`if url[:8] != "https://":`
			`url = "http://" + url`
Changed factoid commands, added URL parser 2011-11-25 16:36:44 +01:00			`for x in ignored_urls:`
			`if x in url:`
			`return`
			`title = parse(url)`
			`if title == "fail":`
			`return`
Changed code to unescape titles properly 2012-02-29 09:43:11 +01:00			`title = http.unescape(title)`
Improved URL parser 2011-11-30 13:51:43 +01:00			`realurl = http.get_url(url)`
			`if realurl == url:`
Made url parsing toggleable, added more advenced regex 2012-03-04 02:23:39 +01:00			`say(u"(Link) %s" % title)`
Improved URL parser 2011-11-30 13:51:43 +01:00			`return`
			`else:`
Made url parsing toggleable, added more advenced regex 2012-03-04 02:23:39 +01:00			`say(u"(Link) %s [%s]" % (title, realurl))`
Improved URL parser 2011-11-30 13:51:43 +01:00			`return`