Sync up with the fork
commit
d6317cd2ce
@ -1,9 +1,11 @@
|
||||
language: python
|
||||
python:
|
||||
- "2.7"
|
||||
- "2.6"
|
||||
- "2.7"
|
||||
- "3.2"
|
||||
- "3.3"
|
||||
before_install: sudo apt-get install libxml2-dev libxslt-dev
|
||||
# command to install dependencies
|
||||
install: pip install -r requirements.txt --use-mirrors
|
||||
# command to run tests
|
||||
script: python setup.py install && nosetests src/breadability/tests
|
||||
script: python setup.py install && nosetests tests
|
||||
|
@ -0,0 +1,3 @@
|
||||
Rick Harding (original author)
|
||||
Michal Belica (current maintainer)
|
||||
nhnifong
|
@ -0,0 +1,71 @@
|
||||
.. :changelog:
|
||||
|
||||
Changelog for readability
|
||||
==========================
|
||||
- Added property ``Article.main_text`` for getting text annotated with
|
||||
semantic HTML tags (<em>, <strong>, ...).
|
||||
- Join node with 1 child of the same type. From
|
||||
``<div><div>...</div></div>`` we get ``<div>...</div>``.
|
||||
- Don't change <div> to <p> if it contains <p> elements.
|
||||
- Renamed test generation helper 'readability_newtest' -> 'readability_test'.
|
||||
- Renamed package to readability.
|
||||
- Added support for Python >= 3.2.
|
||||
- Py3k compatible package 'charade' is used instead of 'chardet'.
|
||||
|
||||
0.1.11 (Dec 12th 2012)
|
||||
-----------------------
|
||||
- Add argparse to the install requires for python < 2.7
|
||||
|
||||
0.1.10 (Sept 13th 2012)
|
||||
-----------------------
|
||||
- Updated scoring bonus and penalty with , and " characters.
|
||||
|
||||
0.1.9 (Aug 27nd 2012)
|
||||
----------------------
|
||||
- In case of an issue dealing with candidates we need to act like we didn't
|
||||
find any candidates for the article content. #10
|
||||
|
||||
0.1.8 (Aug 27nd 2012)
|
||||
----------------------
|
||||
- Add code/tests for an empty document.
|
||||
- Fixes #9 to handle xml parsing issues.
|
||||
|
||||
0.1.7 (July 21nd 2012)
|
||||
----------------------
|
||||
- Change the encode 'replace' kwarg into a normal arg for older python
|
||||
version.
|
||||
|
||||
0.1.6 (June 17th 2012)
|
||||
----------------------
|
||||
- Fix the link removal, add tests and a place to process other bad links.
|
||||
|
||||
0.1.5 (June 16th 2012)
|
||||
----------------------
|
||||
- Start to look at removing bad links from content in the conditional cleaning
|
||||
state. This was really used for the scripting.com site's garbage.
|
||||
|
||||
0.1.4 (June 16th 2012)
|
||||
----------------------
|
||||
- Add a test generation helper readability_newtest script.
|
||||
- Add tests and fixes for the scripting news parse failure.
|
||||
|
||||
0.1.3 (June 15th 2012)
|
||||
----------------------
|
||||
- Add actual testing of full articles for regression tests.
|
||||
- Update parser to properly clean after winner doc node is chosen.
|
||||
|
||||
0.1.2 (May 28th 2012)
|
||||
----------------------
|
||||
- Bugfix: #4 issue with logic of the 100char bonus points in scoring
|
||||
- Garden with PyLint/PEP8
|
||||
- Add a bunch of tests to readable/scoring code.
|
||||
|
||||
0.1.1 (May 11th 2012)
|
||||
---------------------
|
||||
- Fix bugs in scoring to help in getting right content
|
||||
- Add concept of -d which shows scoring/decisions on nodes
|
||||
- Update command line client to be able to pipe output to other tools
|
||||
|
||||
0.1.0 (May 6th 2012)
|
||||
--------------------
|
||||
- Initial release and upload to PyPi
|
@ -0,0 +1,10 @@
|
||||
Copyright (c) 2013 Rick Harding, Michal Belica and contributors
|
||||
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
|
||||
|
||||
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
|
||||
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
@ -1,2 +1,3 @@
|
||||
include README.rst
|
||||
include NEWS.txt
|
||||
include CHANGELOG.rst
|
||||
include LICENSE.rst
|
||||
|
@ -1,61 +0,0 @@
|
||||
# Makefile to help automate tasks
|
||||
WD := $(shell pwd)
|
||||
PY := bin/python
|
||||
PIP := bin/pip
|
||||
PEP8 := bin/pep8
|
||||
NOSE := bin/nosetests
|
||||
|
||||
# ###########
|
||||
# Tests rule!
|
||||
# ###########
|
||||
.PHONY: test
|
||||
test: venv develop $(NOSE)
|
||||
$(NOSE) --with-id -s src/breadability/tests
|
||||
|
||||
$(NOSE):
|
||||
$(PIP) install nose pep8 pylint coverage
|
||||
|
||||
# #######
|
||||
# INSTALL
|
||||
# #######
|
||||
.PHONY: all
|
||||
all: venv develop
|
||||
|
||||
venv: bin/python
|
||||
bin/python:
|
||||
virtualenv .
|
||||
|
||||
.PHONY: clean_venv
|
||||
clean_venv:
|
||||
rm -rf bin include lib local man share
|
||||
|
||||
.PHONY: develop
|
||||
develop: lib/python*/site-packages/readability_lxml.egg-link
|
||||
lib/python*/site-packages/readability_lxml.egg-link:
|
||||
$(PY) setup.py develop
|
||||
|
||||
|
||||
# ###########
|
||||
# Development
|
||||
# ###########
|
||||
.PHONY: clean_all
|
||||
clean_all: clean_venv
|
||||
if [ -d dist ]; then \
|
||||
rm -r dist; \
|
||||
fi
|
||||
|
||||
|
||||
# ###########
|
||||
# Deploy
|
||||
# ###########
|
||||
.PHONY: dist
|
||||
dist:
|
||||
$(PY) setup.py sdist
|
||||
|
||||
.PHONY: upload
|
||||
upload:
|
||||
$(PY) setup.py sdist upload
|
||||
|
||||
.PHONY: version_update
|
||||
version_update:
|
||||
$(EDITOR) setup.py src/breadability/__init__.py NEWS.txt
|
@ -0,0 +1,7 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
|
||||
__version__ = "0.1.14"
|
@ -0,0 +1,101 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from sys import version_info
|
||||
|
||||
|
||||
PY3 = version_info[0] == 3
|
||||
|
||||
|
||||
if PY3:
|
||||
bytes = bytes
|
||||
unicode = str
|
||||
else:
|
||||
bytes = str
|
||||
unicode = unicode
|
||||
string_types = (bytes, unicode,)
|
||||
|
||||
|
||||
try:
|
||||
import urllib2 as urllib
|
||||
except ImportError:
|
||||
import urllib.request as urllib
|
||||
|
||||
|
||||
def unicode_compatible(cls):
|
||||
"""
|
||||
Decorator for unicode compatible classes. Method ``__unicode__``
|
||||
has to be implemented to work decorator as expected.
|
||||
"""
|
||||
if PY3:
|
||||
cls.__str__ = cls.__unicode__
|
||||
cls.__bytes__ = lambda self: self.__str__().encode("utf8")
|
||||
else:
|
||||
cls.__str__ = lambda self: self.__unicode__().encode("utf8")
|
||||
|
||||
return cls
|
||||
|
||||
|
||||
def to_string(object):
|
||||
return to_unicode(object) if PY3 else to_bytes(object)
|
||||
|
||||
|
||||
def to_bytes(object):
|
||||
try:
|
||||
if isinstance(object, bytes):
|
||||
return object
|
||||
elif isinstance(object, unicode):
|
||||
return object.encode("utf8")
|
||||
else:
|
||||
# try encode instance to bytes
|
||||
return instance_to_bytes(object)
|
||||
except UnicodeError:
|
||||
# recover from codec error and use 'repr' function
|
||||
return to_bytes(repr(object))
|
||||
|
||||
|
||||
|
||||
def to_unicode(object):
|
||||
try:
|
||||
if isinstance(object, unicode):
|
||||
return object
|
||||
elif isinstance(object, bytes):
|
||||
return object.decode("utf8")
|
||||
else:
|
||||
# try decode instance to unicode
|
||||
return instance_to_unicode(object)
|
||||
except UnicodeError:
|
||||
# recover from codec error and use 'repr' function
|
||||
return to_unicode(repr(object))
|
||||
|
||||
|
||||
def instance_to_bytes(instance):
|
||||
if PY3:
|
||||
if hasattr(instance, "__bytes__"):
|
||||
return bytes(instance)
|
||||
elif hasattr(instance, "__str__"):
|
||||
return unicode(instance).encode("utf8")
|
||||
else:
|
||||
if hasattr(instance, "__str__"):
|
||||
return bytes(instance)
|
||||
elif hasattr(instance, "__unicode__"):
|
||||
return unicode(instance).encode("utf8")
|
||||
|
||||
return to_bytes(repr(instance))
|
||||
|
||||
|
||||
def instance_to_unicode(instance):
|
||||
if PY3:
|
||||
if hasattr(instance, "__str__"):
|
||||
return unicode(instance)
|
||||
elif hasattr(instance, "__bytes__"):
|
||||
return bytes(instance).decode("utf8")
|
||||
else:
|
||||
if hasattr(instance, "__unicode__"):
|
||||
return unicode(instance)
|
||||
elif hasattr(instance, "__str__"):
|
||||
return bytes(instance).decode("utf8")
|
||||
|
||||
return to_unicode(repr(instance))
|
@ -0,0 +1,89 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from itertools import groupby
|
||||
from lxml.sax import saxify, ContentHandler
|
||||
from .utils import is_blank, shrink_text
|
||||
from ._compat import to_unicode
|
||||
|
||||
|
||||
_SEMANTIC_TAGS = frozenset((
|
||||
"a", "abbr", "acronym", "b", "big", "blink", "blockquote", "cite", "code",
|
||||
"dd", "del", "dfn", "dir", "dl", "dt", "em", "h", "h1", "h2", "h3", "h4",
|
||||
"h5", "h6", "i", "ins", "kbd", "li", "marquee", "menu", "ol", "pre", "q",
|
||||
"s", "samp", "strike", "strong", "sub", "sup", "tt", "u", "ul", "var",
|
||||
))
|
||||
|
||||
|
||||
class AnnotatedTextHandler(ContentHandler):
|
||||
"""A class for converting a HTML DOM into annotated text."""
|
||||
|
||||
@classmethod
|
||||
def parse(cls, dom):
|
||||
"""Converts DOM into paragraphs."""
|
||||
handler = cls()
|
||||
saxify(dom, handler)
|
||||
return handler.content
|
||||
|
||||
def __init__(self):
|
||||
self._content = []
|
||||
self._paragraph = []
|
||||
self._dom_path = []
|
||||
|
||||
@property
|
||||
def content(self):
|
||||
return self._content
|
||||
|
||||
def startElementNS(self, name, qname, attrs):
|
||||
namespace, name = name
|
||||
|
||||
if name in _SEMANTIC_TAGS:
|
||||
self._dom_path.append(to_unicode(name))
|
||||
|
||||
def endElementNS(self, name, qname):
|
||||
namespace, name = name
|
||||
|
||||
if name == "p" and self._paragraph:
|
||||
self._append_paragraph(self._paragraph)
|
||||
elif name in ("ol", "ul", "pre") and self._paragraph:
|
||||
self._append_paragraph(self._paragraph)
|
||||
self._dom_path.pop()
|
||||
elif name in _SEMANTIC_TAGS:
|
||||
self._dom_path.pop()
|
||||
|
||||
def endDocument(self):
|
||||
if self._paragraph:
|
||||
self._append_paragraph(self._paragraph)
|
||||
|
||||
def _append_paragraph(self, paragraph):
|
||||
paragraph = self._process_paragraph(paragraph)
|
||||
self._content.append(paragraph)
|
||||
self._paragraph = []
|
||||
|
||||
def _process_paragraph(self, paragraph):
|
||||
current_paragraph = []
|
||||
|
||||
for annotation, items in groupby(paragraph, key=lambda i: i[1]):
|
||||
if annotation and "li" in annotation:
|
||||
for text, _ in items:
|
||||
text = shrink_text(text)
|
||||
current_paragraph.append((text, annotation))
|
||||
else:
|
||||
text = "".join(i[0] for i in items)
|
||||
text = shrink_text(text)
|
||||
current_paragraph.append((text, annotation))
|
||||
|
||||
return tuple(current_paragraph)
|
||||
|
||||
def characters(self, content):
|
||||
if is_blank(content):
|
||||
return
|
||||
|
||||
if self._dom_path:
|
||||
pair = (content, tuple(sorted(frozenset(self._dom_path))))
|
||||
else:
|
||||
pair = (content, None)
|
||||
|
||||
self._paragraph.append(pair)
|
@ -0,0 +1,130 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
"""Generate a clean nice starting html document to process for an article."""
|
||||
|
||||
from __future__ import absolute_import
|
||||
|
||||
import re
|
||||
import logging
|
||||
import charade
|
||||
|
||||
from lxml.etree import tostring, tounicode, XMLSyntaxError
|
||||
from lxml.html import document_fromstring, HTMLParser
|
||||
|
||||
from ._compat import unicode, to_bytes, to_unicode, unicode_compatible
|
||||
from .utils import cached_property
|
||||
|
||||
|
||||
logger = logging.getLogger("readability")
|
||||
|
||||
|
||||
TAG_MARK_PATTERN = re.compile(to_bytes(r"</?[^>]*>\s*"))
|
||||
def determine_encoding(page):
|
||||
encoding = "utf8"
|
||||
text = TAG_MARK_PATTERN.sub(to_bytes(" "), page)
|
||||
|
||||
# don't venture to guess
|
||||
if not text.strip() or len(text) < 10:
|
||||
return encoding
|
||||
|
||||
# try enforce UTF-8
|
||||
diff = text.decode(encoding, "ignore").encode(encoding)
|
||||
sizes = len(diff), len(text)
|
||||
|
||||
# 99% of UTF-8
|
||||
if abs(len(text) - len(diff)) < max(sizes) * 0.01:
|
||||
return encoding
|
||||
|
||||
# try detect encoding
|
||||
encoding_detector = charade.detect(text)
|
||||
if encoding_detector["encoding"]:
|
||||
encoding = encoding_detector["encoding"]
|
||||
|
||||
return encoding
|
||||
|
||||
|
||||
BREAK_TAGS_PATTERN = re.compile(to_unicode(r"(?:<\s*[bh]r[^>]*>\s*)+"), re.IGNORECASE)
|
||||
def convert_breaks_to_paragraphs(html):
|
||||
"""
|
||||
Converts <hr> tag and multiple <br> tags into paragraph.
|
||||
"""
|
||||
logger.debug("Converting multiple <br> & <hr> tags into <p>.")
|
||||
|
||||
return BREAK_TAGS_PATTERN.sub(_replace_break_tags, html)
|
||||
|
||||
|
||||
def _replace_break_tags(match):
|
||||
tags = match.group()
|
||||
|
||||
if to_unicode("<hr") in tags:
|
||||
return to_unicode("</p><p>")
|
||||
elif tags.count(to_unicode("<br")) > 1:
|
||||
return to_unicode("</p><p>")
|
||||
else:
|
||||
return tags
|
||||
|
||||
|
||||
UTF8_PARSER = HTMLParser(encoding="utf8")
|
||||
def build_document(html_content, base_href=None):
|
||||
"""Requires that the `html_content` not be None"""
|
||||
assert html_content is not None
|
||||
|
||||
if isinstance(html_content, unicode):
|
||||
html_content = html_content.encode("utf8", "replace")
|
||||
|
||||
try:
|
||||
document = document_fromstring(html_content, parser=UTF8_PARSER)
|
||||
except XMLSyntaxError:
|
||||
raise ValueError("Failed to parse document contents.")
|
||||
|
||||
if base_href:
|
||||
document.make_links_absolute(base_href, resolve_base_href=True)
|
||||
else:
|
||||
document.resolve_base_href()
|
||||
|
||||
return document
|
||||
|
||||
|
||||
@unicode_compatible
|
||||
class OriginalDocument(object):
|
||||
"""The original document to process."""
|
||||
|
||||
def __init__(self, html, url=None):
|
||||
self._html = html
|
||||
self._url = url
|
||||
|
||||
@property
|
||||
def url(self):
|
||||
"""Source URL of HTML document."""
|
||||
return self._url
|
||||
|
||||
def __unicode__(self):
|
||||
"""Renders the document as a string."""
|
||||
return tounicode(self.dom)
|
||||
|
||||
@cached_property
|
||||
def dom(self):
|
||||
"""Parsed HTML document from the input."""
|
||||
html = self._html
|
||||
if not isinstance(html, unicode):
|
||||
encoding = determine_encoding(html)
|
||||
html = html.decode(encoding)
|
||||
|
||||
html = convert_breaks_to_paragraphs(html)
|
||||
document = build_document(html, self._url)
|
||||
|
||||
return document
|
||||
|
||||
@cached_property
|
||||
def links(self):
|
||||
"""Links within the document."""
|
||||
return self.dom.findall(".//a")
|
||||
|
||||
@cached_property
|
||||
def title(self):
|
||||
"""Title attribute of the parsed document."""
|
||||
title_element = self.dom.find(".//title")
|
||||
if title_element is None or title_element.text is None:
|
||||
return ""
|
||||
else:
|
||||
return title_element.text.strip()
|
@ -0,0 +1,460 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
|
||||
import re
|
||||
import logging
|
||||
|
||||
from copy import deepcopy
|
||||
from operator import attrgetter
|
||||
from pprint import PrettyPrinter
|
||||
from lxml.html.clean import Cleaner
|
||||
from lxml.etree import tounicode, tostring
|
||||
from lxml.html import fragment_fromstring, fromstring
|
||||
|
||||
from .document import OriginalDocument
|
||||
from .annotated_text import AnnotatedTextHandler
|
||||
from .scoring import (score_candidates, get_link_density, get_class_weight,
|
||||
is_unlikely_node)
|
||||
from .utils import cached_property, shrink_text
|
||||
|
||||
|
||||
html_cleaner = Cleaner(scripts=True, javascript=True, comments=True,
|
||||
style=True, links=True, meta=False, add_nofollow=False,
|
||||
page_structure=False, processing_instructions=True,
|
||||
embedded=False, frames=False, forms=False,
|
||||
annoying_tags=False, remove_tags=None, kill_tags=("noscript", "iframe"),
|
||||
remove_unknown_tags=False, safe_attrs_only=False)
|
||||
|
||||
|
||||
SCORABLE_TAGS = ("div", "p", "td", "pre", "article")
|
||||
ANNOTATION_TAGS = (
|
||||
"a", "abbr", "acronym", "b", "big", "blink", "blockquote", "br", "cite",
|
||||
"code", "dd", "del", "dir", "dl", "dt", "em", "font", "h", "h1", "h2",
|
||||
"h3", "h4", "h5", "h6", "hr", "i", "ins", "kbd", "li", "marquee", "menu",
|
||||
"ol", "p", "pre", "q", "s", "samp", "span", "strike", "strong", "sub",
|
||||
"sup", "tt", "u", "ul", "var",
|
||||
)
|
||||
NULL_DOCUMENT = """
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
||||
</head>
|
||||
<body>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
logger = logging.getLogger("readability")
|
||||
|
||||
|
||||
def ok_embedded_video(node):
|
||||
"""Check if this embed/video is an ok one to count."""
|
||||
good_keywords = ('youtube', 'blip.tv', 'vimeo')
|
||||
|
||||
node_str = tounicode(node)
|
||||
for key in good_keywords:
|
||||
if key in node_str:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def build_base_document(dom, return_fragment=True):
|
||||
"""
|
||||
Builds a base document with the body as root.
|
||||
|
||||
:param dom: Parsed lxml tree (Document Object Model).
|
||||
:param bool return_fragment: If True only <div> fragment is returned.
|
||||
Otherwise full HTML document is returned.
|
||||
"""
|
||||
body_element = dom.find(".//body")
|
||||
|
||||
if body_element is None:
|
||||
fragment = fragment_fromstring('<div id="readabilityBody"/>')
|
||||
fragment.append(dom)
|
||||
else:
|
||||
body_element.tag = "div"
|
||||
body_element.set("id", "readabilityBody")
|
||||
fragment = body_element
|
||||
|
||||
return document_from_fragment(fragment, return_fragment)
|
||||
|
||||
|
||||
def build_error_document(dom, return_fragment=True):
|
||||
"""
|
||||
Builds an empty erorr document with the body as root.
|
||||
|
||||
:param bool return_fragment: If True only <div> fragment is returned.
|
||||
Otherwise full HTML document is returned.
|
||||
"""
|
||||
fragment = fragment_fromstring(
|
||||
'<div id="readabilityBody" class="parsing-error"/>')
|
||||
|
||||
return document_from_fragment(fragment, return_fragment)
|
||||
|
||||
|
||||
def document_from_fragment(fragment, return_fragment):
|
||||
if return_fragment:
|
||||
document = fragment
|
||||
else:
|
||||
document = fromstring(NULL_DOCUMENT)
|
||||
body_element = document.find(".//body")
|
||||
body_element.append(fragment)
|
||||
|
||||
document.doctype = "<!DOCTYPE html>"
|
||||
return document
|
||||
|
||||
|
||||
def check_siblings(candidate_node, candidate_list):
|
||||
"""
|
||||
Looks through siblings for content that might also be related.
|
||||
Things like preambles, content split by ads that we removed, etc.
|
||||
"""
|
||||
candidate_css = candidate_node.node.get("class")
|
||||
potential_target = candidate_node.content_score * 0.2
|
||||
sibling_target_score = potential_target if potential_target > 10 else 10
|
||||
parent = candidate_node.node.getparent()
|
||||
siblings = parent.getchildren() if parent is not None else []
|
||||
|
||||
for sibling in siblings:
|
||||
append = False
|
||||
content_bonus = 0
|
||||
|
||||
if sibling is candidate_node.node:
|
||||
append = True
|
||||
|
||||
# Give a bonus if sibling nodes and top candidates have the example
|
||||
# same class name
|
||||
if candidate_css and sibling.get("class") == candidate_css:
|
||||
content_bonus += candidate_node.content_score * 0.2
|
||||
|
||||
if sibling in candidate_list:
|
||||
adjusted_score = candidate_list[sibling].content_score + content_bonus
|
||||
|
||||
if adjusted_score >= sibling_target_score:
|
||||
append = True
|
||||
|
||||
if sibling.tag == "p":
|
||||
link_density = get_link_density(sibling)
|
||||
content = sibling.text_content()
|
||||
content_length = len(content)
|
||||
|
||||
if content_length > 80 and link_density < 0.25:
|
||||
append = True
|
||||
elif content_length < 80 and link_density == 0:
|
||||
if ". " in content:
|
||||
append = True
|
||||
|
||||
if append:
|
||||
logger.debug("Sibling appended: %s %r", sibling.tag, sibling.attrib)
|
||||
if sibling.tag not in ("div", "p"):
|
||||
# We have a node that isn't a common block level element, like
|
||||
# a form or td tag. Turn it into a div so it doesn't get
|
||||
# filtered out later by accident.
|
||||
sibling.tag = "div"
|
||||
|
||||
candidate_node.node.append(sibling)
|
||||
|
||||
return candidate_node
|
||||
|
||||
|
||||
def clean_document(node):
|
||||
"""Cleans up the final document we return as the readable article."""
|
||||
if node is None or len(node) == 0:
|
||||
return None
|
||||
|
||||
logger.debug("Cleaning document.")
|
||||
to_drop = []
|
||||
|
||||
for n in node.iter():
|
||||
logger.debug("Cleaning node: %s %r", n.tag, n.attrib)
|
||||
# clean out any in-line style properties
|
||||
if "style" in n.attrib:
|
||||
n.set("style", "")
|
||||
|
||||
# remove embended objects unless it's wanted video
|
||||
if n.tag in ("object", "embed") and not ok_embedded_video(n):
|
||||
logger.debug("Dropping node %s %r", n.tag, n.attrib)
|
||||
to_drop.append(n)
|
||||
|
||||
# clean headings with bad css or high link density
|
||||
if n.tag in ("h1", "h2", "h3", "h4") and get_class_weight(n) < 0:
|
||||
logger.debug("Dropping <%s>, it's insignificant", n.tag)
|
||||
to_drop.append(n)
|
||||
|
||||
if n.tag in ("h3", "h4") and get_link_density(n) > 0.33:
|
||||
logger.debug("Dropping <%s>, it's insignificant", n.tag)
|
||||
to_drop.append(n)
|
||||
|
||||
# drop block element without content and children
|
||||
if n.tag in ("div", "p"):
|
||||
text_content = shrink_text(n.text_content())
|
||||
if len(text_content) < 5 and not n.getchildren():
|
||||
logger.debug("Dropping %s %r without content.", n.tag, n.attrib)
|
||||
to_drop.append(n)
|
||||
|
||||
# finally try out the conditional cleaning of the target node
|
||||
if clean_conditionally(n):
|
||||
to_drop.append(n)
|
||||
|
||||
drop_nodes_with_parents(to_drop)
|
||||
|
||||
return node
|
||||
|
||||
|
||||
def drop_nodes_with_parents(nodes):
|
||||
for node in nodes:
|
||||
if node.getparent() is not None:
|
||||
logger.debug("Droping node with parent %s %r", node.tag, node.attrib)
|
||||
node.drop_tree()
|
||||
|
||||
|
||||
def clean_conditionally(node):
|
||||
"""Remove the clean_el if it looks like bad content based on rules."""
|
||||
logger.debug('Cleaning conditionally node: %s %r', node.tag, node.attrib)
|
||||
|
||||
if node.tag not in ('form', 'table', 'ul', 'div', 'p'):
|
||||
# this is not the tag you're looking for
|
||||
logger.debug('Node cleared: %s %r', node.tag, node.attrib)
|
||||
return
|
||||
|
||||
weight = get_class_weight(node)
|
||||
# content_score = LOOK up the content score for this node we found
|
||||
# before else default to 0
|
||||
content_score = 0
|
||||
|
||||
if weight + content_score < 0:
|
||||
logger.debug('Dropping conditional node')
|
||||
logger.debug('Weight + score < 0')
|
||||
return True
|
||||
|
||||
commas_count = node.text_content().count(',')
|
||||
if commas_count < 10:
|
||||
logger.debug("There are %d commas so we're processing more.", commas_count)
|
||||
|
||||
# If there are not very many commas, and the number of
|
||||
# non-paragraph elements is more than paragraphs or other ominous
|
||||
# signs, remove the element.
|
||||
p = len(node.findall('.//p'))
|
||||
img = len(node.findall('.//img'))
|
||||
li = len(node.findall('.//li')) - 100
|
||||
inputs = len(node.findall('.//input'))
|
||||
|
||||
embed = 0
|
||||
embeds = node.findall('.//embed')
|
||||
for e in embeds:
|
||||
if ok_embedded_video(e):
|
||||
embed += 1
|
||||
link_density = get_link_density(node)
|
||||
content_length = len(node.text_content())
|
||||
|
||||
remove_node = False
|
||||
|
||||
if li > p and node.tag != 'ul' and node.tag != 'ol':
|
||||
logger.debug('Conditional drop: li > p and not ul/ol')
|
||||
remove_node = True
|
||||
elif inputs > p / 3.0:
|
||||
logger.debug('Conditional drop: inputs > p/3.0')
|
||||
remove_node = True
|
||||
elif content_length < 25 and (img == 0 or img > 2):
|
||||
logger.debug('Conditional drop: len < 25 and 0/>2 images')
|
||||
remove_node = True
|
||||
elif weight < 25 and link_density > 0.2:
|
||||
logger.debug('Conditional drop: weight small and link is dense')
|
||||
remove_node = True
|
||||
elif weight >= 25 and link_density > 0.5:
|
||||
logger.debug('Conditional drop: weight big but link heavy')
|
||||
remove_node = True
|
||||
elif (embed == 1 and content_length < 75) or embed > 1:
|
||||
logger.debug('Conditional drop: embed w/o much content or many embed')
|
||||
remove_node = True
|
||||
|
||||
if remove_node:
|
||||
logger.debug('Node will be removed')
|
||||
else:
|
||||
logger.debug('Node cleared: %s %r', node.tag, node.attrib)
|
||||
return remove_node
|
||||
|
||||
# nope, don't remove anything
|
||||
logger.debug('Node Cleared final.')
|
||||
return False
|
||||
|
||||
|
||||
def prep_article(doc):
|
||||
"""Once we've found our target article we want to clean it up.
|
||||
|
||||
Clean out:
|
||||
- inline styles
|
||||
- forms
|
||||
- strip empty <p>
|
||||
- extra tags
|
||||
"""
|
||||
return clean_document(doc)
|
||||
|
||||
|
||||
def find_candidates(document):
|
||||
"""
|
||||
Finds cadidate nodes for the readable version of the article.
|
||||
|
||||
Here's we're going to remove unlikely nodes, find scores on the rest,
|
||||
clean up and return the final best match.
|
||||
"""
|
||||
nodes_to_score = set()
|
||||
should_remove = set()
|
||||
|
||||
for node in document.iter():
|
||||
if is_unlikely_node(node):
|
||||
logger.debug("We should drop unlikely: %s %r", node.tag, node.attrib)
|
||||
should_remove.add(node)
|
||||
elif is_bad_link(node):
|
||||
logger.debug("We should drop bad link: %s %r", node.tag, node.attrib)
|
||||
should_remove.add(node)
|
||||
elif node.tag in SCORABLE_TAGS:
|
||||
nodes_to_score.add(node)
|
||||
|
||||
return score_candidates(nodes_to_score), should_remove
|
||||
|
||||
|
||||
def is_bad_link(node):
|
||||
"""
|
||||
Helper to determine if the node is link that is useless.
|
||||
|
||||
We've hit articles with many multiple links that should be cleaned out
|
||||
because they're just there to pollute the space. See tests for examples.
|
||||
"""
|
||||
if node.tag != "a":
|
||||
return False
|
||||
|
||||
name = node.get("name")
|
||||
href = node.get("href")
|
||||
if name and not href:
|
||||
return True
|
||||
|
||||
if href:
|
||||
href_parts = href.split("#")
|
||||
if len(href_parts) == 2 and len(href_parts[1]) > 25:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
class Article(object):
|
||||
"""Parsed readable object"""
|
||||
|
||||
def __init__(self, html, url=None, return_fragment=True):
|
||||
"""
|
||||
Create the Article we're going to use.
|
||||
|
||||
:param html: The string of HTML we're going to parse.
|
||||
:param url: The url so we can adjust the links to still work.
|
||||
:param return_fragment: Should we return a <div> fragment or
|
||||
a full <html> document.
|
||||
"""
|
||||
self._original_document = OriginalDocument(html, url=url)
|
||||
self._return_fragment = return_fragment
|
||||
|
||||
def __str__(self):
|
||||
return tostring(self._readable())
|
||||
|
||||
def __unicode__(self):
|
||||
return tounicode(self._readable())
|
||||
|
||||
@cached_property
|
||||
def dom(self):
|
||||
"""Parsed lxml tree (Document Object Model) of the given html."""
|
||||
try:
|
||||
dom = self._original_document.dom
|
||||
# cleaning doesn't return, just wipes in place
|
||||
html_cleaner(dom)
|
||||
return leaf_div_elements_into_paragraphs(dom)
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
@cached_property
|
||||
def candidates(self):
|
||||
"""Generates list of candidates from the DOM."""
|
||||
dom = self.dom
|
||||
if dom is None or len(dom) == 0:
|
||||
return None
|
||||
|
||||
candidates, unlikely_candidates = find_candidates(dom)
|
||||
drop_nodes_with_parents(unlikely_candidates)
|
||||
|
||||
return candidates
|
||||
|
||||
@cached_property
|
||||
def main_text(self):
|
||||
dom = deepcopy(self.readable_dom).get_element_by_id("readabilityBody")
|
||||
return AnnotatedTextHandler.parse(dom)
|
||||
|
||||
@cached_property
|
||||
def readable(self):
|
||||
return tounicode(self.readable_dom)
|
||||
|
||||
@cached_property
|
||||
def readable_dom(self):
|
||||
return self._readable()
|
||||
|
||||
def _readable(self):
|
||||
"""The readable parsed article"""
|
||||
if not self.candidates:
|
||||
logger.warning("No candidates found in document.")
|
||||
return self._handle_no_candidates()
|
||||
|
||||
# right now we return the highest scoring candidate content
|
||||
best_candidates = sorted((c for c in self.candidates.values()),
|
||||
key=attrgetter("content_score"), reverse=True)
|
||||
|
||||
printer = PrettyPrinter(indent=2)
|
||||
logger.debug(printer.pformat(best_candidates))
|
||||
|
||||
# since we have several candidates, check the winner's siblings
|
||||
# for extra content
|
||||
winner = best_candidates[0]
|
||||
updated_winner = check_siblings(winner, self.candidates)
|
||||
updated_winner.node = prep_article(updated_winner.node)
|
||||
if updated_winner.node is not None:
|
||||
dom = build_base_document(updated_winner.node, self._return_fragment)
|
||||
else:
|
||||
logger.warning('Had candidates but failed to find a cleaned winning DOM.')
|
||||
dom = self._handle_no_candidates()
|
||||
|
||||
return self._remove_orphans(dom.get_element_by_id("readabilityBody"))
|
||||
|
||||
def _remove_orphans(self, dom):
|
||||
for node in dom.iterdescendants():
|
||||
if len(node) == 1 and tuple(node)[0].tag == node.tag:
|
||||
node.drop_tag()
|
||||
|
||||
return dom
|
||||
|
||||
def _handle_no_candidates(self):
|
||||
"""
|
||||
If we fail to find a good candidate we need to find something else.
|
||||
"""
|
||||
# since we've not found a good candidate we're should help this
|
||||
if self.dom is not None and len(self.dom):
|
||||
dom = prep_article(self.dom)
|
||||
dom = build_base_document(dom, self._return_fragment)
|
||||
return self._remove_orphans(dom.get_element_by_id("readabilityBody"))
|
||||
else:
|
||||
logger.warning("No document to use.")
|
||||
return build_error_document(self._return_fragment)
|
||||
|
||||
|
||||
def leaf_div_elements_into_paragraphs(document):
|
||||
"""
|
||||
Turn some block elements that don't have children block level
|
||||
elements into <p> elements.
|
||||
|
||||
Since we can't change the tree as we iterate over it, we must do this
|
||||
before we process our document.
|
||||
"""
|
||||
for element in document.iter(tag="div"):
|
||||
child_tags = tuple(n.tag for n in element.getchildren())
|
||||
if "div" not in child_tags and "p" not in child_tags:
|
||||
logger.debug("Changing leaf block element <%s> into <p>", element.tag)
|
||||
element.tag = "p"
|
||||
|
||||
return document
|
@ -0,0 +1,251 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
"""Handle dealing with scoring nodes and content for our parsing."""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function
|
||||
|
||||
import re
|
||||
import logging
|
||||
|
||||
from hashlib import md5
|
||||
from lxml.etree import tostring
|
||||
from ._compat import to_bytes
|
||||
from .utils import normalize_whitespace
|
||||
|
||||
|
||||
# A series of sets of attributes we check to help in determining if a node is
|
||||
# a potential candidate or not.
|
||||
CLS_UNLIKELY = re.compile(
|
||||
"combx|comment|community|disqus|extra|foot|header|menu|remark|rss|shoutbox|"
|
||||
"sidebar|sponsor|ad-break|agegate|pagination|pager|perma|popup|tweet|"
|
||||
"twitter|social|breadcrumb",
|
||||
re.IGNORECASE
|
||||
)
|
||||
CLS_MAYBE = re.compile(
|
||||
"and|article|body|column|main|shadow|entry",
|
||||
re.IGNORECASE
|
||||
)
|
||||
CLS_WEIGHT_POSITIVE = re.compile(
|
||||
"article|body|content|entry|main|page|pagination|post|text|blog|story",
|
||||
re.IGNORECASE
|
||||
)
|
||||
CLS_WEIGHT_NEGATIVE = re.compile(
|
||||
"combx|comment|com-|contact|foot|footer|footnote|head|masthead|media|meta|"
|
||||
"outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|"
|
||||
"widget",
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
logger = logging.getLogger("readability")
|
||||
|
||||
|
||||
def check_node_attributes(pattern, node, *attributes):
|
||||
"""
|
||||
Searches match in attributes against given pattern and if
|
||||
finds the match against any of them returns True.
|
||||
"""
|
||||
for attribute_name in attributes:
|
||||
attribute = node.get(attribute_name)
|
||||
if attribute is not None and pattern.search(attribute):
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def generate_hash_id(node):
|
||||
"""
|
||||
Generates a hash_id for the node in question.
|
||||
|
||||
:param node: lxml etree node
|
||||
"""
|
||||
try:
|
||||
content = tostring(node)
|
||||
except Exception as e:
|
||||
logger.exception("Generating of hash failed")
|
||||
content = to_bytes(repr(node))
|
||||
|
||||
hash_id = md5(content).hexdigest()
|
||||
return hash_id[:8]
|
||||
|
||||
|
||||
def get_link_density(node, node_text=None):
|
||||
"""
|
||||
Computes the ratio for text in given node and text in links
|
||||
contained in the node. It is computed from number of
|
||||
characters in the texts.
|
||||
|
||||
:parameter Element node:
|
||||
HTML element in which links density is computed.
|
||||
:parameter string node_text:
|
||||
Text content of given node if it was obtained before.
|
||||
:returns float:
|
||||
Returns value of computed 0 <= density <= 1, where 0 means
|
||||
no links and 1 means that node contains only links.
|
||||
"""
|
||||
if node_text is None:
|
||||
node_text = node.text_content()
|
||||
node_text = normalize_whitespace(node_text.strip())
|
||||
|
||||
text_length = len(node_text)
|
||||
if text_length == 0:
|
||||
return 0.0
|
||||
|
||||
links_length = sum(map(_get_normalized_text_length, node.findall(".//a")))
|
||||
return links_length / text_length
|
||||
|
||||
|
||||
def _get_normalized_text_length(node):
|
||||
return len(normalize_whitespace(node.text_content().strip()))
|
||||
|
||||
|
||||
def get_class_weight(node):
|
||||
"""
|
||||
Computes weight of element according to its class/id.
|
||||
|
||||
We're using sets to help efficiently check for existence of matches.
|
||||
"""
|
||||
weight = 0
|
||||
|
||||
if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "class"):
|
||||
weight -= 25
|
||||
if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "class"):
|
||||
weight += 25
|
||||
|
||||
if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "id"):
|
||||
weight -= 25
|
||||
if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "id"):
|
||||
weight += 25
|
||||
|
||||
return weight
|
||||
|
||||
|
||||
def is_unlikely_node(node):
|
||||
"""
|
||||
Short helper for checking unlikely status.
|
||||
|
||||
If the class or id are in the unlikely list, and there's not also a
|
||||
class/id in the likely list then it might need to be removed.
|
||||
"""
|
||||
unlikely = check_node_attributes(CLS_UNLIKELY, node, "class", "id")
|
||||
maybe = check_node_attributes(CLS_MAYBE, node, "class", "id")
|
||||
|
||||
return bool(unlikely and not maybe and node.tag != "body")
|
||||
|
||||
|
||||
def score_candidates(nodes):
|
||||
"""Given a list of potential nodes, find some initial scores to start"""
|
||||
MIN_HIT_LENTH = 25
|
||||
candidates = {}
|
||||
|
||||
for node in nodes:
|
||||
logger.debug("* Scoring candidate %s %r", node.tag, node.attrib)
|
||||
|
||||
# if the node has no parent it knows of
|
||||
# then it ends up creating a body & html tag to parent the html fragment
|
||||
parent = node.getparent()
|
||||
if parent is None:
|
||||
logger.debug("Skipping candidate - parent node is 'None'.")
|
||||
continue
|
||||
|
||||
grand = parent.getparent()
|
||||
if grand is None:
|
||||
logger.debug("Skipping candidate - grand parent node is 'None'.")
|
||||
continue
|
||||
|
||||
# if paragraph is < `MIN_HIT_LENTH` characters don't even count it
|
||||
inner_text = node.text_content().strip()
|
||||
if len(inner_text) < MIN_HIT_LENTH:
|
||||
logger.debug("Skipping candidate - inner text < %d characters.", MIN_HIT_LENTH)
|
||||
continue
|
||||
|
||||
# initialize readability data for the parent
|
||||
# add parent node if it isn't in the candidate list
|
||||
if parent not in candidates:
|
||||
candidates[parent] = ScoredNode(parent)
|
||||
|
||||
if grand not in candidates:
|
||||
candidates[grand] = ScoredNode(grand)
|
||||
|
||||
# add a point for the paragraph itself as a base
|
||||
content_score = 1
|
||||
|
||||
if inner_text:
|
||||
# add 0.25 points for any commas within this paragraph
|
||||
commas_count = inner_text.count(",")
|
||||
content_score += commas_count * 0.25
|
||||
logger.debug("Bonus points for %d commas.", commas_count)
|
||||
|
||||
# subtract 0.5 points for each double quote within this paragraph
|
||||
double_quotes_count = inner_text.count('"')
|
||||
content_score += double_quotes_count * -0.5
|
||||
logger.debug("Penalty points for %d double-quotes.", double_quotes_count)
|
||||
|
||||
# for every 100 characters in this paragraph, add another point
|
||||
# up to 3 points
|
||||
length_points = len(inner_text) / 100
|
||||
content_score += min(length_points, 3.0)
|
||||
logger.debug("Bonus points for length of text: %f", length_points)
|
||||
|
||||
# add the score to the parent
|
||||
logger.debug("Bonus points for parent %s %r with score %f: %f",
|
||||
parent.tag, parent.attrib, candidates[parent].content_score,
|
||||
content_score)
|
||||
candidates[parent].content_score += content_score
|
||||
# the grand node gets half
|
||||
logger.debug("Bonus points for grand %s %r with score %f: %f",
|
||||
grand.tag, grand.attrib, candidates[grand].content_score,
|
||||
content_score / 2.0)
|
||||
candidates[grand].content_score += content_score / 2.0
|
||||
|
||||
if node not in candidates:
|
||||
candidates[node] = ScoredNode(node)
|
||||
candidates[node].content_score += content_score
|
||||
|
||||
for candidate in candidates.values():
|
||||
adjustment = 1.0 - get_link_density(candidate.node)
|
||||
candidate.content_score *= adjustment
|
||||
logger.debug("Link density adjustment for %s %r: %f",
|
||||
candidate.node.tag, candidate.node.attrib, adjustment)
|
||||
|
||||
return candidates
|
||||
|
||||
|
||||
class ScoredNode(object):
|
||||
"""
|
||||
We need Scored nodes we use to track possible article matches
|
||||
|
||||
We might have a bunch of these so we use __slots__ to keep memory usage
|
||||
down.
|
||||
"""
|
||||
__slots__ = ('node', 'content_score')
|
||||
|
||||
def __init__(self, node):
|
||||
"""Given node, set an initial score and weigh based on css and id"""
|
||||
self.node = node
|
||||
self.content_score = 0
|
||||
|
||||
if node.tag in ('div', 'article'):
|
||||
self.content_score = 5
|
||||
if node.tag in ('pre', 'td', 'blockquote'):
|
||||
self.content_score = 3
|
||||
|
||||
if node.tag in ('address', 'ol', 'ul', 'dl', 'dd', 'dt', 'li', 'form'):
|
||||
self.content_score = -3
|
||||
if node.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'th'):
|
||||
self.content_score = -5
|
||||
|
||||
self.content_score += get_class_weight(node)
|
||||
|
||||
@property
|
||||
def hash_id(self):
|
||||
return generate_hash_id(self.node)
|
||||
|
||||
def __repr__(self):
|
||||
if self.node is None:
|
||||
return "<NullScoredNode with score {2:0.1F}>" % self.content_score
|
||||
|
||||
return "<ScoredNode {0} {1}: {2:0.1F}>".format(
|
||||
self.node.tag,
|
||||
self.node.attrib,
|
||||
self.content_score
|
||||
)
|
@ -0,0 +1,87 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
"""
|
||||
A fast python port of arc90's readability tool
|
||||
|
||||
Usage:
|
||||
readability [options] <resource>
|
||||
readability --version
|
||||
readability --help
|
||||
|
||||
Arguments:
|
||||
<resource> URL or file path to process in readable form.
|
||||
|
||||
Options:
|
||||
-f, --fragment Output html fragment by default.
|
||||
-b, --browser Open the parsed content in your web browser.
|
||||
-d, --debug Output the detailed scoring information for debugging
|
||||
parsing.
|
||||
-v, --verbose Increase logging verbosity to DEBUG.
|
||||
--version Display program's version number and exit.
|
||||
-h, --help Display this help message and exit.
|
||||
"""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
|
||||
import logging
|
||||
import locale
|
||||
import webbrowser
|
||||
|
||||
from tempfile import NamedTemporaryFile
|
||||
from docopt import docopt
|
||||
from .. import __version__
|
||||
from .._compat import urllib
|
||||
from ..readable import Article
|
||||
|
||||
|
||||
HEADERS = {
|
||||
"User-Agent": "Readability (Readable content parser) Version/%s" % __version__,
|
||||
}
|
||||
|
||||
|
||||
def parse_args():
|
||||
return docopt(__doc__, version=__version__)
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
logger = logging.getLogger("readability")
|
||||
|
||||
if args["--verbose"]:
|
||||
logger.setLevel(logging.DEBUG)
|
||||
|
||||
resource = args["<resource>"]
|
||||
if resource.startswith("www"):
|
||||
resource = "http://" + resource
|
||||
|
||||
url = None
|
||||
if resource.startswith("http://") or resource.startswith("https://"):
|
||||
url = resource
|
||||
|
||||
request = urllib.Request(url, headers=HEADERS)
|
||||
response = urllib.urlopen(request)
|
||||
content = response.read()
|
||||
response.close()
|
||||
else:
|
||||
with open(resource, "r") as file:
|
||||
content = file.read()
|
||||
|
||||
document = Article(content, url=url, return_fragment=args["--fragment"])
|
||||
if args["--browser"]:
|
||||
html_file = NamedTemporaryFile(mode="wb", suffix=".html", delete=False)
|
||||
|
||||
content = document.readable.encode("utf8")
|
||||
html_file.write(content)
|
||||
html_file.close()
|
||||
|
||||
webbrowser.open(html_file.name)
|
||||
else:
|
||||
encoding = locale.getpreferredencoding()
|
||||
content = document.readable.encode(encoding)
|
||||
print(content)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@ -0,0 +1,127 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
"""
|
||||
Helper to generate a new set of article test files for readability.
|
||||
|
||||
Usage:
|
||||
readability_test --name <name> <url>
|
||||
readability_test --version
|
||||
readability_test --help
|
||||
|
||||
Arguments:
|
||||
<url> The url of content to fetch for the article.html
|
||||
|
||||
Options:
|
||||
-n <name>, --name=<name> Name of the test directory.
|
||||
--version Show program's version number and exit.
|
||||
-h, --help Show this help message and exit.
|
||||
"""
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from os import mkdir
|
||||
from os.path import join, dirname, pardir, exists as path_exists
|
||||
from docopt import docopt
|
||||
from .. import __version__
|
||||
from .._compat import to_unicode, urllib
|
||||
|
||||
|
||||
TEST_PATH = join(
|
||||
dirname(__file__),
|
||||
pardir, pardir,
|
||||
"tests/test_articles"
|
||||
)
|
||||
|
||||
TEST_TEMPLATE = '''# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from os.path import join, dirname
|
||||
from readability.readable import Article
|
||||
from ...compat import unittest
|
||||
|
||||
|
||||
class TestArticle(unittest.TestCase):
|
||||
"""
|
||||
Test the scoring and parsing of the article from URL below:
|
||||
%(source_url)s
|
||||
"""
|
||||
|
||||
def setUp(self):
|
||||
"""Load up the article for us"""
|
||||
article_path = join(dirname(__file__), "article.html")
|
||||
with open(article_path, "rb") as file:
|
||||
self.document = Article(file.read(), "%(source_url)s")
|
||||
|
||||
def tearDown(self):
|
||||
"""Drop the article"""
|
||||
self.document = None
|
||||
|
||||
def test_parses(self):
|
||||
"""Verify we can parse the document."""
|
||||
self.assertIn('id="readabilityBody"', self.document.readable)
|
||||
|
||||
def test_content_exists(self):
|
||||
"""Verify that some content exists."""
|
||||
self.assertIn("#&@#&@#&@", self.document.readable)
|
||||
|
||||
def test_content_does_not_exist(self):
|
||||
"""Verify we cleaned out some content that shouldn't exist."""
|
||||
self.assertNotIn("", self.document.readable)
|
||||
'''
|
||||
|
||||
|
||||
def parse_args():
|
||||
return docopt(__doc__, version=__version__)
|
||||
|
||||
|
||||
def make_test_directory(name):
|
||||
"""Generates a new directory for tests."""
|
||||
directory_name = "test_" + name.replace(" ", "_")
|
||||
directory_path = join(TEST_PATH, directory_name)
|
||||
|
||||
if not path_exists(directory_path):
|
||||
mkdir(directory_path)
|
||||
|
||||
return directory_path
|
||||
|
||||
|
||||
def make_test_files(directory_path, url):
|
||||
init_file = join(directory_path, "__init__.py")
|
||||
open(init_file, "a").close()
|
||||
|
||||
data = TEST_TEMPLATE % {
|
||||
"source_url": to_unicode(url)
|
||||
}
|
||||
|
||||
test_file = join(directory_path, "test.py")
|
||||
with open(test_file, "w") as file:
|
||||
file.write(data)
|
||||
|
||||
|
||||
def fetch_article(directory_path, url):
|
||||
"""Get the content of the url and make it the article.html"""
|
||||
opener = urllib.build_opener()
|
||||
opener.addheaders = [("Accept-Charset", "utf-8")]
|
||||
|
||||
response = opener.open(url)
|
||||
html_data = response.read()
|
||||
response.close()
|
||||
|
||||
path = join(directory_path, "article.html")
|
||||
with open(path, "wb") as file:
|
||||
file.write(html_data)
|
||||
|
||||
|
||||
def main():
|
||||
"""Run the script."""
|
||||
args = parse_args()
|
||||
directory = make_test_directory(args["--name"])
|
||||
make_test_files(directory, args["<url>"])
|
||||
fetch_article(directory, args["<url>"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
@ -0,0 +1,58 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
import re
|
||||
|
||||
|
||||
def is_blank(text):
|
||||
"""
|
||||
Returns ``True`` if string contains only whitespace characters
|
||||
or is empty. Otherwise ``False`` is returned.
|
||||
"""
|
||||
return not text or text.isspace()
|
||||
|
||||
|
||||
def shrink_text(text):
|
||||
return normalize_whitespace(text.strip())
|
||||
|
||||
|
||||
MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE)
|
||||
def normalize_whitespace(text):
|
||||
"""
|
||||
Translates multiple whitespace into single space character.
|
||||
If there is at least one new line character chunk is replaced
|
||||
by single LF (Unix new line) character.
|
||||
"""
|
||||
return MULTIPLE_WHITESPACE_PATTERN.sub(_replace_whitespace, text)
|
||||
|
||||
|
||||
def _replace_whitespace(match):
|
||||
text = match.group()
|
||||
|
||||
if "\n" in text or "\r" in text:
|
||||
return "\n"
|
||||
else:
|
||||
return " "
|
||||
|
||||
|
||||
def cached_property(getter):
|
||||
"""
|
||||
Decorator that converts a method into memoized property.
|
||||
The decorator works as expected only for classes with
|
||||
attribute '__dict__' and immutable properties.
|
||||
"""
|
||||
def decorator(self):
|
||||
key = "_cached_property_" + getter.__name__
|
||||
|
||||
if not hasattr(self, key):
|
||||
setattr(self, key, getter(self))
|
||||
|
||||
return getattr(self, key)
|
||||
|
||||
decorator.__name__ = getter.__name__
|
||||
decorator.__module__ = getter.__module__
|
||||
decorator.__doc__ = getter.__doc__
|
||||
|
||||
return property(decorator)
|
@ -0,0 +1,508 @@
|
||||
import re
|
||||
from lxml.etree import tounicode
|
||||
from lxml.etree import tostring
|
||||
from lxml.html.clean import Cleaner
|
||||
from lxml.html import fragment_fromstring
|
||||
from lxml.html import fromstring
|
||||
from operator import attrgetter
|
||||
from pprint import PrettyPrinter
|
||||
|
||||
from breadability.document import OriginalDocument
|
||||
from breadability.logconfig import LOG
|
||||
from breadability.logconfig import LNODE
|
||||
from breadability.scoring import score_candidates
|
||||
from breadability.scoring import get_link_density
|
||||
from breadability.scoring import get_class_weight
|
||||
from breadability.scoring import is_unlikely_node
|
||||
from breadability.utils import cached_property
|
||||
|
||||
|
||||
html_cleaner = Cleaner(scripts=True, javascript=True, comments=True,
|
||||
style=True, links=True, meta=False, add_nofollow=False,
|
||||
page_structure=False, processing_instructions=True,
|
||||
embedded=False, frames=False, forms=False,
|
||||
annoying_tags=False, remove_tags=None,
|
||||
remove_unknown_tags=False, safe_attrs_only=False)
|
||||
|
||||
|
||||
BASE_DOC = """
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
||||
</head>
|
||||
<body>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
SCORABLE_TAGS = ['div', 'p', 'td', 'pre', 'article']
|
||||
|
||||
|
||||
def drop_tag(doc, *tags):
|
||||
"""Helper to just remove any nodes that match this html tag passed in
|
||||
|
||||
:param *tags: one or more html tag strings to remove e.g. style, script
|
||||
|
||||
"""
|
||||
for tag in tags:
|
||||
found = doc.iterfind(".//" + tag)
|
||||
for n in found:
|
||||
LNODE.log(n, 1, "Dropping tag")
|
||||
n.drop_tree()
|
||||
return doc
|
||||
|
||||
|
||||
def is_bad_link(a_node):
|
||||
"""Helper to determine if the link is something to clean out
|
||||
|
||||
We've hit articles with many multiple links that should be cleaned out
|
||||
because they're just there to pollute the space. See tests for examples.
|
||||
|
||||
"""
|
||||
if a_node.tag == 'a':
|
||||
name = a_node.get('name')
|
||||
href = a_node.get('href')
|
||||
if name and not href:
|
||||
return True
|
||||
|
||||
if href:
|
||||
url_bits = href.split('#')
|
||||
if len(url_bits) == 2:
|
||||
if len(url_bits[1]) > 25:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def ok_embedded_video(node):
|
||||
"""Check if this embed/video is an ok one to count."""
|
||||
keep_keywords = ['youtube', 'blip.tv', 'vimeo']
|
||||
node_str = tounicode(node)
|
||||
for key in keep_keywords:
|
||||
if key in node_str:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def build_base_document(html, fragment=True):
|
||||
"""Return a base document with the body as root.
|
||||
|
||||
:param html: Parsed Element object
|
||||
:param fragment: Should we return a <div> doc fragment or a full <html>
|
||||
doc.
|
||||
|
||||
"""
|
||||
if html.tag == 'body':
|
||||
html.tag = 'div'
|
||||
found_body = html
|
||||
else:
|
||||
found_body = html.find('.//body')
|
||||
|
||||
if found_body is None:
|
||||
frag = fragment_fromstring('<div/>')
|
||||
frag.set('id', 'readabilityBody')
|
||||
frag.append(html)
|
||||
|
||||
if not fragment:
|
||||
output = fromstring(BASE_DOC)
|
||||
insert_point = output.find('.//body')
|
||||
insert_point.append(frag)
|
||||
else:
|
||||
output = frag
|
||||
else:
|
||||
|
||||
found_body.tag = 'div'
|
||||
found_body.set('id', 'readabilityBody')
|
||||
|
||||
if not fragment:
|
||||
output = fromstring(BASE_DOC)
|
||||
insert_point = output.find('.//body')
|
||||
insert_point.append(found_body)
|
||||
else:
|
||||
output = found_body
|
||||
|
||||
output.doctype = "<!DOCTYPE html>"
|
||||
return output
|
||||
|
||||
|
||||
def build_error_document(html, fragment=True):
|
||||
"""Return an empty erorr document with the body as root.
|
||||
|
||||
:param fragment: Should we return a <div> doc fragment or a full <html>
|
||||
doc.
|
||||
|
||||
"""
|
||||
frag = fragment_fromstring('<div/>')
|
||||
frag.set('id', 'readabilityBody')
|
||||
frag.set('class', 'parsing-error')
|
||||
|
||||
if not fragment:
|
||||
output = fromstring(BASE_DOC)
|
||||
insert_point = output.find('.//body')
|
||||
insert_point.append(frag)
|
||||
else:
|
||||
output = frag
|
||||
|
||||
output.doctype = "<!DOCTYPE html>"
|
||||
return output
|
||||
|
||||
|
||||
def transform_misused_divs_into_paragraphs(doc):
|
||||
"""Turn all divs that don't have children block level elements into p's
|
||||
|
||||
Since we can't change the tree as we iterate over it, we must do this
|
||||
before we process our document.
|
||||
|
||||
The idea is that we process all divs and if the div does not contain
|
||||
another list of divs, then we replace it with a p tag instead appending
|
||||
it's contents/children to it.
|
||||
|
||||
"""
|
||||
for elem in doc.iter(tag='div'):
|
||||
child_tags = [n.tag for n in elem.getchildren()]
|
||||
if 'div' not in child_tags:
|
||||
# if there is no div inside of this div...then it's a leaf
|
||||
# node in a sense.
|
||||
# We need to create a <p> and put all it's contents in there
|
||||
# We'll just stringify it, then regex replace the first/last
|
||||
# div bits to turn them into <p> vs <div>.
|
||||
LNODE.log(elem, 1, 'Turning leaf <div> into <p>')
|
||||
orig = tounicode(elem).strip()
|
||||
started = re.sub(r'^<\s*div', '<p', orig)
|
||||
ended = re.sub(r'div>$', 'p>', started)
|
||||
elem.getparent().replace(elem, fromstring(ended))
|
||||
return doc
|
||||
|
||||
|
||||
def check_siblings(candidate_node, candidate_list):
|
||||
"""Look through siblings for content that might also be related.
|
||||
|
||||
Things like preambles, content split by ads that we removed, etc.
|
||||
|
||||
"""
|
||||
candidate_css = candidate_node.node.get('class')
|
||||
potential_target = candidate_node.content_score * 0.2
|
||||
sibling_target_score = potential_target if potential_target > 10 else 10
|
||||
parent = candidate_node.node.getparent()
|
||||
siblings = parent.getchildren() if parent is not None else []
|
||||
|
||||
for sibling in siblings:
|
||||
append = False
|
||||
content_bonus = 0
|
||||
|
||||
if sibling is candidate_node.node:
|
||||
LNODE.log(sibling, 1, 'Sibling is the node so append')
|
||||
append = True
|
||||
|
||||
# Give a bonus if sibling nodes and top candidates have the example
|
||||
# same class name
|
||||
if candidate_css and sibling.get('class') == candidate_css:
|
||||
content_bonus += candidate_node.content_score * 0.2
|
||||
|
||||
if sibling in candidate_list:
|
||||
adjusted_score = candidate_list[sibling].content_score + \
|
||||
content_bonus
|
||||
|
||||
if adjusted_score >= sibling_target_score:
|
||||
append = True
|
||||
|
||||
if sibling.tag == 'p':
|
||||
link_density = get_link_density(sibling)
|
||||
content = sibling.text_content()
|
||||
content_length = len(content)
|
||||
|
||||
if content_length > 80 and link_density < 0.25:
|
||||
append = True
|
||||
elif content_length < 80 and link_density == 0:
|
||||
if ". " in content:
|
||||
append = True
|
||||
|
||||
if append:
|
||||
LNODE.log(sibling, 1, 'Sibling being appended')
|
||||
if sibling.tag not in ['div', 'p']:
|
||||
# We have a node that isn't a common block level element, like
|
||||
# a form or td tag. Turn it into a div so it doesn't get
|
||||
# filtered out later by accident.
|
||||
sibling.tag = 'div'
|
||||
|
||||
if candidate_node.node != sibling:
|
||||
candidate_node.node.append(sibling)
|
||||
|
||||
return candidate_node
|
||||
|
||||
|
||||
def clean_document(node):
|
||||
"""Clean up the final document we return as the readable article"""
|
||||
if node is None or len(node) == 0:
|
||||
return
|
||||
|
||||
LNODE.log(node, 2, "Processing doc")
|
||||
clean_list = ['object', 'h1']
|
||||
to_drop = []
|
||||
|
||||
# If there is only one h2, they are probably using it as a header and
|
||||
# not a subheader, so remove it since we already have a header.
|
||||
if len(node.findall('.//h2')) == 1:
|
||||
LOG.debug('Adding H2 to list of nodes to clean.')
|
||||
clean_list.append('h2')
|
||||
|
||||
for n in node.iter():
|
||||
LNODE.log(n, 2, "Cleaning iter node")
|
||||
# clean out any in-line style properties
|
||||
if 'style' in n.attrib:
|
||||
n.set('style', '')
|
||||
|
||||
# remove all of the following tags
|
||||
# Clean a node of all elements of type "tag".
|
||||
# (Unless it's a youtube/vimeo video. People love movies.)
|
||||
is_embed = True if n.tag in ['object', 'embed'] else False
|
||||
if n.tag in clean_list:
|
||||
allow = False
|
||||
|
||||
# Allow youtube and vimeo videos through as people usually
|
||||
# want to see those.
|
||||
if is_embed:
|
||||
if ok_embedded_video(n):
|
||||
allow = True
|
||||
|
||||
if not allow:
|
||||
LNODE.log(n, 2, "Dropping Node")
|
||||
to_drop.append(n)
|
||||
|
||||
if n.tag in ['h1', 'h2', 'h3', 'h4']:
|
||||
# clean headings
|
||||
# if the heading has no css weight or a high link density,
|
||||
# remove it
|
||||
if get_class_weight(n) < 0 or get_link_density(n) > .33:
|
||||
LNODE.log(n, 2, "Dropping <hX>, it's insignificant")
|
||||
to_drop.append(n)
|
||||
|
||||
# clean out extra <p>
|
||||
if n.tag == 'p':
|
||||
# if the p has no children and has no content...well then down
|
||||
# with it.
|
||||
if not n.getchildren() and len(n.text_content()) < 5:
|
||||
LNODE.log(n, 2, 'Dropping extra <p>')
|
||||
to_drop.append(n)
|
||||
|
||||
# finally try out the conditional cleaning of the target node
|
||||
if clean_conditionally(n):
|
||||
to_drop.append(n)
|
||||
|
||||
[n.drop_tree() for n in to_drop if n.getparent() is not None]
|
||||
return node
|
||||
|
||||
|
||||
def clean_conditionally(node):
|
||||
"""Remove the clean_el if it looks like bad content based on rules."""
|
||||
target_tags = ['form', 'table', 'ul', 'div', 'p']
|
||||
|
||||
LNODE.log(node, 2, 'Cleaning conditionally node.')
|
||||
|
||||
if node.tag not in target_tags:
|
||||
# this is not the tag you're looking for
|
||||
LNODE.log(node, 2, 'Node cleared.')
|
||||
return
|
||||
|
||||
weight = get_class_weight(node)
|
||||
# content_score = LOOK up the content score for this node we found
|
||||
# before else default to 0
|
||||
content_score = 0
|
||||
|
||||
if (weight + content_score < 0):
|
||||
LNODE.log(node, 2, 'Dropping conditional node')
|
||||
LNODE.log(node, 2, 'Weight + score < 0')
|
||||
return True
|
||||
|
||||
if node.text_content().count(',') < 10:
|
||||
LOG.debug("There aren't 10 ,s so we're processing more")
|
||||
|
||||
# If there are not very many commas, and the number of
|
||||
# non-paragraph elements is more than paragraphs or other ominous
|
||||
# signs, remove the element.
|
||||
p = len(node.findall('.//p'))
|
||||
img = len(node.findall('.//img'))
|
||||
li = len(node.findall('.//li')) - 100
|
||||
inputs = len(node.findall('.//input'))
|
||||
|
||||
embed = 0
|
||||
embeds = node.findall('.//embed')
|
||||
for e in embeds:
|
||||
if ok_embedded_video(e):
|
||||
embed += 1
|
||||
link_density = get_link_density(node)
|
||||
content_length = len(node.text_content())
|
||||
|
||||
remove_node = False
|
||||
|
||||
if li > p and node.tag != 'ul' and node.tag != 'ol':
|
||||
LNODE.log(node, 2, 'Conditional drop: li > p and not ul/ol')
|
||||
remove_node = True
|
||||
elif inputs > p / 3.0:
|
||||
LNODE.log(node, 2, 'Conditional drop: inputs > p/3.0')
|
||||
remove_node = True
|
||||
elif content_length < 25 and (img == 0 or img > 2):
|
||||
LNODE.log(node, 2,
|
||||
'Conditional drop: len < 25 and 0/>2 images')
|
||||
remove_node = True
|
||||
elif weight < 25 and link_density > 0.2:
|
||||
LNODE.log(node, 2,
|
||||
'Conditional drop: weight small and link is dense')
|
||||
remove_node = True
|
||||
elif weight >= 25 and link_density > 0.5:
|
||||
LNODE.log(node, 2,
|
||||
'Conditional drop: weight big but link heavy')
|
||||
remove_node = True
|
||||
elif (embed == 1 and content_length < 75) or embed > 1:
|
||||
LNODE.log(node, 2,
|
||||
'Conditional drop: embed w/o much content or many embed')
|
||||
remove_node = True
|
||||
|
||||
if remove_node:
|
||||
LNODE.log(node, 2, 'Node will be removed')
|
||||
else:
|
||||
LNODE.log(node, 2, 'Node cleared')
|
||||
return remove_node
|
||||
|
||||
# nope, don't remove anything
|
||||
LNODE.log(node, 2, 'Node Cleared final.')
|
||||
return False
|
||||
|
||||
|
||||
def prep_article(doc):
|
||||
"""Once we've found our target article we want to clean it up.
|
||||
|
||||
Clean out:
|
||||
- inline styles
|
||||
- forms
|
||||
- strip empty <p>
|
||||
- extra tags
|
||||
|
||||
"""
|
||||
doc = clean_document(doc)
|
||||
return doc
|
||||
|
||||
|
||||
def find_candidates(doc):
|
||||
"""Find cadidate nodes for the readable version of the article.
|
||||
|
||||
Here's we're going to remove unlikely nodes, find scores on the rest, and
|
||||
clean up and return the final best match.
|
||||
|
||||
"""
|
||||
scorable_node_tags = SCORABLE_TAGS
|
||||
nodes_to_score = []
|
||||
should_remove = []
|
||||
|
||||
for node in doc.iter():
|
||||
if is_unlikely_node(node):
|
||||
LOG.debug('We should drop unlikely: ' + str(node))
|
||||
should_remove.append(node)
|
||||
continue
|
||||
if node.tag == 'a' and is_bad_link(node):
|
||||
LOG.debug('We should drop bad link: ' + str(node))
|
||||
should_remove.append(node)
|
||||
continue
|
||||
if node.tag in scorable_node_tags and node not in nodes_to_score:
|
||||
nodes_to_score.append(node)
|
||||
return score_candidates(nodes_to_score), should_remove
|
||||
|
||||
|
||||
class Article(object):
|
||||
"""Parsed readable object"""
|
||||
_should_drop = []
|
||||
|
||||
def __init__(self, html, url=None, fragment=True):
|
||||
"""Create the Article we're going to use.
|
||||
|
||||
:param html: The string of html we're going to parse.
|
||||
:param url: The url so we can adjust the links to still work.
|
||||
:param fragment: Should we return a <div> fragment or a full <html>
|
||||
doc.
|
||||
|
||||
"""
|
||||
LOG.debug('Url: ' + str(url))
|
||||
self.orig = OriginalDocument(html, url=url)
|
||||
self.fragment = fragment
|
||||
|
||||
def __str__(self):
|
||||
return tostring(self._readable)
|
||||
|
||||
def __unicode__(self):
|
||||
return tounicode(self._readable)
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def doc(self):
|
||||
"""The doc is the parsed xml tree of the given html."""
|
||||
try:
|
||||
doc = self.orig.html
|
||||
# cleaning doesn't return, just wipes in place
|
||||
html_cleaner(doc)
|
||||
doc = drop_tag(doc, 'noscript', 'iframe')
|
||||
doc = transform_misused_divs_into_paragraphs(doc)
|
||||
return doc
|
||||
except ValueError:
|
||||
return None
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def candidates(self):
|
||||
"""Generate the list of candidates from the doc."""
|
||||
doc = self.doc
|
||||
if doc is not None and len(doc):
|
||||
candidates, should_drop = find_candidates(doc)
|
||||
self._should_drop = should_drop
|
||||
return candidates
|
||||
else:
|
||||
return None
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def readable(self):
|
||||
return tounicode(self._readable)
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def _readable(self):
|
||||
"""The readable parsed article"""
|
||||
if self.candidates:
|
||||
LOG.debug('Candidates found:')
|
||||
pp = PrettyPrinter(indent=2)
|
||||
|
||||
# cleanup by removing the should_drop we spotted.
|
||||
[n.drop_tree() for n in self._should_drop
|
||||
if n.getparent() is not None]
|
||||
|
||||
# right now we return the highest scoring candidate content
|
||||
by_score = sorted([c for c in self.candidates.values()],
|
||||
key=attrgetter('content_score'), reverse=True)
|
||||
LOG.debug(pp.pformat(by_score))
|
||||
|
||||
# since we have several candidates, check the winner's siblings
|
||||
# for extra content
|
||||
winner = by_score[0]
|
||||
LOG.debug('Selected winning node: ' + str(winner))
|
||||
updated_winner = check_siblings(winner, self.candidates)
|
||||
LOG.debug('Begin final prep of article')
|
||||
updated_winner.node = prep_article(updated_winner.node)
|
||||
if updated_winner.node is not None:
|
||||
doc = build_base_document(updated_winner.node, self.fragment)
|
||||
else:
|
||||
LOG.warning('Had candidates but failed to find a cleaned winning doc.')
|
||||
doc = self._handle_no_candidates()
|
||||
else:
|
||||
LOG.warning('No candidates found: using document.')
|
||||
LOG.debug('Begin final prep of article')
|
||||
doc = self._handle_no_candidates()
|
||||
|
||||
return doc
|
||||
|
||||
def _handle_no_candidates(self):
|
||||
"""If we fail to find a good candidate we need to find something else."""
|
||||
# since we've not found a good candidate we're should help this
|
||||
if self.doc is not None and len(self.doc):
|
||||
# cleanup by removing the should_drop we spotted.
|
||||
[n.drop_tree() for n in self._should_drop
|
||||
if n.getparent() is not None]
|
||||
doc = prep_article(self.doc)
|
||||
doc = build_base_document(doc, self.fragment)
|
||||
else:
|
||||
LOG.warning('No document to use.')
|
||||
doc = build_error_document(self.fragment)
|
||||
|
||||
return doc
|
@ -0,0 +1,237 @@
|
||||
"""Handle dealing with scoring nodes and content for our parsing."""
|
||||
import re
|
||||
from hashlib import md5
|
||||
from lxml.etree import tounicode
|
||||
|
||||
from breadability.logconfig import LNODE
|
||||
from breadability.logconfig import LOG
|
||||
|
||||
# A series of sets of attributes we check to help in determining if a node is
|
||||
# a potential candidate or not.
|
||||
CLS_UNLIKELY = re.compile(('combx|comment|community|disqus|extra|foot|header|'
|
||||
'menu|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|'
|
||||
'pager|perma|popup|tweet|twitter'), re.I)
|
||||
CLS_MAYBE = re.compile('and|article|body|column|main|shadow', re.I)
|
||||
CLS_WEIGHT_POSITIVE = re.compile(('article|body|content|entry|hentry|main|'
|
||||
'page|pagination|post|text|blog|story'), re.I)
|
||||
CLS_WEIGHT_NEGATIVE = re.compile(('combx|comment|com-|contact|foot|footer|'
|
||||
'footnote|head|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|'
|
||||
'sidebar|sponsor|shopping|tags|tool|widget'), re.I)
|
||||
|
||||
|
||||
def check_node_attr(node, attr, checkset):
|
||||
value = node.get(attr) or ""
|
||||
check = checkset.search(value)
|
||||
if check:
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def generate_hash_id(node):
|
||||
"""Generate a hash_id for the node in question.
|
||||
|
||||
:param node: lxml etree node
|
||||
|
||||
"""
|
||||
content = tounicode(node)
|
||||
hashed = md5()
|
||||
try:
|
||||
hashed.update(content.encode('utf-8', "replace"))
|
||||
except Exception, e:
|
||||
LOG.error("BOOM! " + str(e))
|
||||
|
||||
return hashed.hexdigest()[0:8]
|
||||
|
||||
|
||||
def get_link_density(node, node_text=None):
|
||||
"""Generate a value for the number of links in the node.
|
||||
|
||||
:param node: pared elementree node
|
||||
:param node_text: if we already have the text_content() make this easier
|
||||
on us.
|
||||
:returns float:
|
||||
|
||||
"""
|
||||
link_length = sum([len(a.text_content()) or 0
|
||||
for a in node.findall(".//a")])
|
||||
# For each img, give 50 bonus chars worth of length.
|
||||
# Tweaking this 50 down a notch should help if we hit false positives.
|
||||
link_length = max(link_length -
|
||||
sum([50 for img in node.findall(".//img")]), 0)
|
||||
if node_text:
|
||||
text_length = len(node_text)
|
||||
else:
|
||||
text_length = len(node.text_content())
|
||||
return float(link_length) / max(text_length, 1)
|
||||
|
||||
|
||||
def get_class_weight(node):
|
||||
"""Get an elements class/id weight.
|
||||
|
||||
We're using sets to help efficiently check for existence of matches.
|
||||
|
||||
"""
|
||||
weight = 0
|
||||
if check_node_attr(node, 'class', CLS_WEIGHT_NEGATIVE):
|
||||
weight = weight - 25
|
||||
if check_node_attr(node, 'class', CLS_WEIGHT_POSITIVE):
|
||||
weight = weight + 25
|
||||
|
||||
if check_node_attr(node, 'id', CLS_WEIGHT_NEGATIVE):
|
||||
weight = weight - 25
|
||||
if check_node_attr(node, 'id', CLS_WEIGHT_POSITIVE):
|
||||
weight = weight + 25
|
||||
|
||||
return weight
|
||||
|
||||
|
||||
def is_unlikely_node(node):
|
||||
"""Short helper for checking unlikely status.
|
||||
|
||||
If the class or id are in the unlikely list, and there's not also a
|
||||
class/id in the likely list then it might need to be removed.
|
||||
|
||||
"""
|
||||
unlikely = check_node_attr(node, 'class', CLS_UNLIKELY) or \
|
||||
check_node_attr(node, 'id', CLS_UNLIKELY)
|
||||
|
||||
maybe = check_node_attr(node, 'class', CLS_MAYBE) or \
|
||||
check_node_attr(node, 'id', CLS_MAYBE)
|
||||
|
||||
if unlikely and not maybe and node.tag != 'body':
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
|
||||
def score_candidates(nodes):
|
||||
"""Given a list of potential nodes, find some initial scores to start"""
|
||||
MIN_HIT_LENTH = 25
|
||||
candidates = {}
|
||||
|
||||
for node in nodes:
|
||||
LNODE.log(node, 1, "Scoring Node")
|
||||
|
||||
content_score = 0
|
||||
# if the node has no parent it knows of, then it ends up creating a
|
||||
# body and html tag to parent the html fragment.
|
||||
parent = node.getparent()
|
||||
grand = parent.getparent() if parent is not None else None
|
||||
innertext = node.text_content()
|
||||
|
||||
if parent is None or grand is None:
|
||||
LNODE.log(
|
||||
node, 1,
|
||||
"Skipping candidate because parent/grand are none")
|
||||
continue
|
||||
|
||||
# If this paragraph is less than 25 characters, don't even count it.
|
||||
if innertext and len(innertext) < MIN_HIT_LENTH:
|
||||
LNODE.log(
|
||||
node, 1,
|
||||
"Skipping candidate because not enough content.")
|
||||
continue
|
||||
|
||||
# Initialize readability data for the parent.
|
||||
# if the parent node isn't in the candidate list, add it
|
||||
if parent not in candidates:
|
||||
candidates[parent] = ScoredNode(parent)
|
||||
|
||||
if grand not in candidates:
|
||||
candidates[grand] = ScoredNode(grand)
|
||||
|
||||
# Add a point for the paragraph itself as a base.
|
||||
content_score += 1
|
||||
|
||||
if innertext:
|
||||
# Add 0.25 points for any commas within this paragraph
|
||||
content_score += innertext.count(',') * 0.25
|
||||
LNODE.log(node, 1,
|
||||
"Bonus points for ,: " + str(innertext.count(',')))
|
||||
|
||||
# Subtract 0.5 points for each double quote within this paragraph
|
||||
content_score += innertext.count('"') * (-0.5)
|
||||
LNODE.log(node, 1,
|
||||
'Penalty points for ": ' + str(innertext.count('"')))
|
||||
|
||||
# For every 100 characters in this paragraph, add another point.
|
||||
# Up to 3 points.
|
||||
length_points = len(innertext) / 100
|
||||
|
||||
if length_points > 3:
|
||||
content_score += 3
|
||||
else:
|
||||
content_score += length_points
|
||||
LNODE.log(
|
||||
node, 1,
|
||||
"Length/content points: {0} : {1}".format(length_points,
|
||||
content_score))
|
||||
|
||||
# Add the score to the parent.
|
||||
LNODE.log(node, 1, "From this current node.")
|
||||
candidates[parent].content_score += content_score
|
||||
LNODE.log(
|
||||
candidates[parent].node,
|
||||
1,
|
||||
"Giving parent bonus points: " + str(
|
||||
candidates[parent].content_score))
|
||||
# The grandparent gets half.
|
||||
LNODE.log(candidates[grand].node, 1, "Giving grand bonus points")
|
||||
candidates[grand].content_score += (content_score / 2.0)
|
||||
LNODE.log(
|
||||
candidates[parent].node,
|
||||
1,
|
||||
"Giving grand bonus points: " + str(
|
||||
candidates[grand].content_score))
|
||||
|
||||
for candidate in candidates.values():
|
||||
adjustment = 1 - get_link_density(candidate.node)
|
||||
LNODE.log(
|
||||
candidate.node,
|
||||
1,
|
||||
"Getting link density adjustment: {0} * {1} ".format(
|
||||
candidate.content_score, adjustment))
|
||||
candidate.content_score = candidate.content_score * (adjustment)
|
||||
|
||||
return candidates
|
||||
|
||||
|
||||
class ScoredNode(object):
|
||||
"""We need Scored nodes we use to track possible article matches
|
||||
|
||||
We might have a bunch of these so we use __slots__ to keep memory usage
|
||||
down.
|
||||
|
||||
"""
|
||||
__slots__ = ['node', 'content_score']
|
||||
|
||||
def __repr__(self):
|
||||
"""Helpful representation of our Scored Node"""
|
||||
return "{0}: {1:0.1F}\t{2}".format(
|
||||
self.hash_id,
|
||||
self.content_score,
|
||||
self.node)
|
||||
|
||||
def __init__(self, node):
|
||||
"""Given node, set an initial score and weigh based on css and id"""
|
||||
self.node = node
|
||||
content_score = 0
|
||||
if node.tag in ['div', 'article']:
|
||||
content_score = 5
|
||||
|
||||
if node.tag in ['pre', 'td', 'blockquote']:
|
||||
content_score = 3
|
||||
|
||||
if node.tag in ['address', 'ol', 'ul', 'dl', 'dd', 'dt', 'li',
|
||||
'form']:
|
||||
content_score = -3
|
||||
if node.tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'th']:
|
||||
content_score = -5
|
||||
|
||||
content_score += get_class_weight(node)
|
||||
self.content_score = content_score
|
||||
|
||||
@property
|
||||
def hash_id(self):
|
||||
return generate_hash_id(self.node)
|
@ -1,59 +1,81 @@
|
||||
from setuptools import setup, find_packages
|
||||
import sys
|
||||
import os
|
||||
|
||||
here = os.path.abspath(os.path.dirname(__file__))
|
||||
README = open(os.path.join(here, 'README.rst')).read()
|
||||
NEWS = open(os.path.join(here, 'NEWS.txt')).read()
|
||||
from os.path import abspath, dirname, join
|
||||
from setuptools import setup, find_packages
|
||||
from readability import __version__
|
||||
|
||||
|
||||
VERSION_SUFFIX = "%d.%d" % sys.version_info[:2]
|
||||
CURRENT_DIRECTORY = abspath(dirname(__file__))
|
||||
|
||||
|
||||
with open(join(CURRENT_DIRECTORY, "README.rst")) as readme:
|
||||
with open(join(CURRENT_DIRECTORY, "CHANGELOG.rst")) as changelog:
|
||||
long_description = "%s\n\n%s" % (readme.read(), changelog.read())
|
||||
|
||||
|
||||
version = '0.1.14'
|
||||
install_requires = [
|
||||
# List your project dependencies here.
|
||||
# For more details, see:
|
||||
# http://packages.python.org/distribute/setuptools.html#declaring-dependencies
|
||||
'chardet',
|
||||
'lxml',
|
||||
"docopt>=0.6.1,<0.7",
|
||||
"charade",
|
||||
"lxml>=2.0",
|
||||
]
|
||||
tests_require = [
|
||||
'coverage',
|
||||
'nose',
|
||||
'pep8',
|
||||
'pylint',
|
||||
"coverage",
|
||||
"nose",
|
||||
]
|
||||
|
||||
|
||||
if sys.version_info < (2, 7):
|
||||
# Require argparse since it's not in the stdlib yet.
|
||||
install_requires.append('argparse')
|
||||
install_requires.append('unittest2')
|
||||
install_requires.append("unittest2")
|
||||
|
||||
|
||||
setup(
|
||||
name='breadability',
|
||||
version=version,
|
||||
description="Redone port of Readability API in Python",
|
||||
long_description=README + '\n\n' + NEWS,
|
||||
name="readability",
|
||||
version=__version__,
|
||||
description="Port of Readability HTML parser in Python",
|
||||
long_description=long_description,
|
||||
keywords=[
|
||||
"readability",
|
||||
"readable",
|
||||
"parsing",
|
||||
"HTML",
|
||||
"content",
|
||||
],
|
||||
author="Michal Belica",
|
||||
author_email="miso.belica@gmail.com",
|
||||
url="https://github.com/miso-belica/readability.py",
|
||||
license="BSD",
|
||||
classifiers=[
|
||||
# Get strings from
|
||||
# http://pypi.python.org/pypi?%3Aaction=list_classifiers
|
||||
"Development Status :: 5 - Production/Stable",
|
||||
"Intended Audience :: Developers",
|
||||
"License :: OSI Approved :: BSD License",
|
||||
"Operating System :: OS Independent",
|
||||
"Programming Language :: Python",
|
||||
"Programming Language :: Python :: 2",
|
||||
"Programming Language :: Python :: 2.6",
|
||||
"Programming Language :: Python :: 2.7",
|
||||
"Programming Language :: Python :: 3",
|
||||
"Programming Language :: Python :: 3.2",
|
||||
"Programming Language :: Python :: 3.3",
|
||||
"Programming Language :: Python :: Implementation :: CPython",
|
||||
"Topic :: Internet :: WWW/HTTP",
|
||||
"Topic :: Software Development :: Pre-processors",
|
||||
"Topic :: Text Processing :: Filters",
|
||||
"Topic :: Text Processing :: Markup :: HTML",
|
||||
|
||||
],
|
||||
keywords='readable parsing html content bookie',
|
||||
author='Rick Harding',
|
||||
author_email='rharding@mitechie.com',
|
||||
url='http://docs.bmark.us',
|
||||
license='BSD',
|
||||
packages=find_packages('src'),
|
||||
package_dir={'': 'src'},
|
||||
packages=find_packages(),
|
||||
include_package_data=True,
|
||||
zip_safe=False,
|
||||
install_requires=install_requires,
|
||||
tests_require=tests_require,
|
||||
extras_require={
|
||||
'test': tests_require
|
||||
},
|
||||
test_suite="tests.run_tests.run",
|
||||
entry_points={
|
||||
'console_scripts': [
|
||||
'breadability=breadability:client.main',
|
||||
'breadability_newtest=breadability:newtest.main',
|
||||
"console_scripts": [
|
||||
"readability = readability.scripts.client:main",
|
||||
"readability-%s = readability.scripts.client:main" % VERSION_SUFFIX,
|
||||
"readability_test = readability.scripts.test_helper:main",
|
||||
"readability_test-%s = readability.scripts.test_helper:main" % VERSION_SUFFIX,
|
||||
]
|
||||
}
|
||||
)
|
||||
|
@ -1,115 +0,0 @@
|
||||
"""Generate a clean nice starting html document to process for an article."""
|
||||
|
||||
import chardet
|
||||
import re
|
||||
from lxml.etree import tostring
|
||||
from lxml.etree import tounicode
|
||||
from lxml.etree import XMLSyntaxError
|
||||
from lxml.html import document_fromstring
|
||||
from lxml.html import HTMLParser
|
||||
|
||||
from breadability.logconfig import LOG
|
||||
from breadability.utils import cached_property
|
||||
|
||||
|
||||
utf8_parser = HTMLParser(encoding='utf-8')
|
||||
|
||||
|
||||
def get_encoding(page):
|
||||
text = re.sub('</?[^>]*>\s*', ' ', page)
|
||||
enc = 'utf-8'
|
||||
if not text.strip() or len(text) < 10:
|
||||
return enc # can't guess
|
||||
try:
|
||||
diff = text.decode(enc, 'ignore').encode(enc)
|
||||
sizes = len(diff), len(text)
|
||||
# 99% of utf-8
|
||||
if abs(len(text) - len(diff)) < max(sizes) * 0.01:
|
||||
return enc
|
||||
except UnicodeDecodeError:
|
||||
pass
|
||||
res = chardet.detect(text)
|
||||
enc = res['encoding']
|
||||
# print '->', enc, "%.2f" % res['confidence']
|
||||
if enc == 'MacCyrillic':
|
||||
enc = 'cp1251'
|
||||
if not enc:
|
||||
enc = 'utf-8'
|
||||
return enc
|
||||
|
||||
|
||||
def replace_multi_br_to_paragraphs(html):
|
||||
"""Convert multiple <br>s into paragraphs"""
|
||||
LOG.debug('Replacing multiple <br/> to <p>')
|
||||
rep = re.compile("(<br[^>]*>[ \n\r\t]*){2,}", re.I)
|
||||
return rep.sub('</p><p>', html)
|
||||
|
||||
|
||||
def build_doc(page):
|
||||
"""Requires that the `page` not be None"""
|
||||
if page is None:
|
||||
LOG.error("Page content is None, can't build_doc")
|
||||
return ''
|
||||
if isinstance(page, unicode):
|
||||
page_unicode = page
|
||||
else:
|
||||
enc = get_encoding(page)
|
||||
page_unicode = page.decode(enc, 'replace')
|
||||
try:
|
||||
doc = document_fromstring(
|
||||
page_unicode.encode('utf-8', 'replace'),
|
||||
parser=utf8_parser)
|
||||
return doc
|
||||
except XMLSyntaxError, exc:
|
||||
LOG.error('Failed to parse: ' + str(exc))
|
||||
raise ValueError('Failed to parse document contents.')
|
||||
|
||||
|
||||
class OriginalDocument(object):
|
||||
"""The original document to process"""
|
||||
_base_href = None
|
||||
|
||||
def __init__(self, html, url=None):
|
||||
self.orig_html = html
|
||||
self.url = url
|
||||
|
||||
def __str__(self):
|
||||
"""Render out our document as a string"""
|
||||
return tostring(self.html)
|
||||
|
||||
def __unicode__(self):
|
||||
"""Render out our document as a string"""
|
||||
return tounicode(self.html)
|
||||
|
||||
def _parse(self, html):
|
||||
"""Generate an lxml document from our html."""
|
||||
html = replace_multi_br_to_paragraphs(html)
|
||||
doc = build_doc(html)
|
||||
|
||||
# doc = html_cleaner.clean_html(doc)
|
||||
base_href = self.url
|
||||
if base_href:
|
||||
LOG.debug('Making links absolute')
|
||||
doc.make_links_absolute(base_href, resolve_base_href=True)
|
||||
else:
|
||||
doc.resolve_base_href()
|
||||
return doc
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def html(self):
|
||||
"""The parsed html document from the input"""
|
||||
return self._parse(self.orig_html)
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def links(self):
|
||||
"""Links within the document"""
|
||||
return self.html.findall(".//a")
|
||||
|
||||
@cached_property(ttl=600)
|
||||
def title(self):
|
||||
"""Pull the title attribute out of the parsed document"""
|
||||
titleElem = self.html.find('.//title')
|
||||
if titleElem is None or titleElem.text is None:
|
||||
return ''
|
||||
else:
|
||||
return titleElem.text
|
@ -1,190 +0,0 @@
|
||||
"""Setup a logging helper for our module.
|
||||
|
||||
|
||||
Helpers:
|
||||
LOG - out active logger instance
|
||||
set_logging_level(level) - adjust the current logging level
|
||||
"""
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from collections import namedtuple
|
||||
from hashlib import md5
|
||||
from lxml.etree import tounicode
|
||||
|
||||
|
||||
# For pretty log messages, if available
|
||||
try:
|
||||
import curses
|
||||
except ImportError:
|
||||
curses = None
|
||||
|
||||
LOGLEVEL = "WARNING"
|
||||
|
||||
|
||||
# Logging bits stolen and adapted from:
|
||||
# http://www.tornadoweb.org/documentation/_modules/tornado/options.html
|
||||
LogOptions = namedtuple('LogOptions', [
|
||||
'loglevel',
|
||||
'log_file_prefix',
|
||||
'log_file_max_size',
|
||||
'log_file_num_backups',
|
||||
'log_to_stderr',
|
||||
])
|
||||
|
||||
options = LogOptions(
|
||||
loglevel=LOGLEVEL,
|
||||
log_file_prefix="",
|
||||
log_file_max_size=100 * 1000 * 1000,
|
||||
log_file_num_backups=5,
|
||||
log_to_stderr=True,
|
||||
)
|
||||
|
||||
|
||||
def set_logging_level(level):
|
||||
"""Adjust the current logging level.
|
||||
|
||||
Expect a string of DEBUG, WARNING, INFO, etc.
|
||||
|
||||
"""
|
||||
logging.getLogger('breadable').setLevel(getattr(logging, level))
|
||||
|
||||
|
||||
def enable_pretty_logging():
|
||||
"""Turns on formatted logging output as configured.
|
||||
|
||||
This is called automatically by `parse_command_line`.
|
||||
"""
|
||||
root_logger = logging.getLogger()
|
||||
if options.log_file_prefix:
|
||||
channel = logging.handlers.RotatingFileHandler(
|
||||
filename=options.log_file_prefix,
|
||||
maxBytes=options.log_file_max_size,
|
||||
backupCount=options.log_file_num_backups)
|
||||
channel.setFormatter(_LogFormatter(color=False))
|
||||
root_logger.addHandler(channel)
|
||||
|
||||
if (options.log_to_stderr or
|
||||
(options.log_to_stderr is None and not root_logger.handlers)):
|
||||
# Set up color if we are in a tty and curses is installed
|
||||
color = False
|
||||
if curses and sys.stderr.isatty():
|
||||
try:
|
||||
curses.setupterm()
|
||||
if curses.tigetnum("colors") > 0:
|
||||
color = True
|
||||
except Exception:
|
||||
pass
|
||||
channel = logging.StreamHandler()
|
||||
channel.setFormatter(_LogFormatter(color=color))
|
||||
root_logger.addHandler(channel)
|
||||
|
||||
|
||||
class LogHelper(object):
|
||||
"""Helper to allow us to log as we want for debugging"""
|
||||
scoring = 1
|
||||
removing = 2
|
||||
_active = False
|
||||
|
||||
_actions = None
|
||||
|
||||
def __init__(self, log, actions=None, content=False):
|
||||
if actions is None:
|
||||
self._actions = tuple()
|
||||
else:
|
||||
self._actions = actions
|
||||
|
||||
self._log = log
|
||||
self.content = content
|
||||
|
||||
@property
|
||||
def actions(self):
|
||||
"""Return a tuple of the actions we want to log"""
|
||||
return self._actions
|
||||
|
||||
def activate(self):
|
||||
"""Turn on this logger."""
|
||||
self._active = True
|
||||
|
||||
def deactivate(self):
|
||||
"""Turn off the logger"""
|
||||
self._active = False
|
||||
|
||||
def log(self, node, action, description):
|
||||
"""Write out our log info based on the node and event specified.
|
||||
|
||||
We only log this information if we're are DEBUG loglevel
|
||||
|
||||
"""
|
||||
if self._active:
|
||||
content = tounicode(node)
|
||||
hashed = md5()
|
||||
try:
|
||||
hashed.update(content.encode('utf-8', errors="replace"))
|
||||
except Exception, exc:
|
||||
LOG.error("Cannot hash the current node." + str(exc))
|
||||
hash_id = hashed.hexdigest()[0:8]
|
||||
# if hash_id in ['9c880b27', '8393b7d7', '69bfebdd']:
|
||||
print(u"{0} :: {1}\n{2}".format(
|
||||
hash_id,
|
||||
description,
|
||||
content.replace(u"\n", u"")[0:202],
|
||||
))
|
||||
|
||||
|
||||
class _LogFormatter(logging.Formatter):
|
||||
def __init__(self, color, *args, **kwargs):
|
||||
logging.Formatter.__init__(self, *args, **kwargs)
|
||||
self._color = color
|
||||
if color:
|
||||
# The curses module has some str/bytes confusion in python3.
|
||||
# Most methods return bytes, but only accept strings.
|
||||
# The explict calls to unicode() below are harmless in python2,
|
||||
# but will do the right conversion in python3.
|
||||
fg_color = unicode(curses.tigetstr("setaf") or
|
||||
curses.tigetstr("setf") or "", "ascii")
|
||||
self._colors = {
|
||||
logging.DEBUG: unicode(
|
||||
curses.tparm(fg_color, curses.COLOR_CYAN),
|
||||
"ascii"),
|
||||
logging.INFO: unicode(
|
||||
curses.tparm(fg_color, curses.COLOR_GREEN),
|
||||
"ascii"),
|
||||
logging.WARNING: unicode(
|
||||
curses.tparm(fg_color, curses.COLOR_YELLOW), # Yellow
|
||||
"ascii"),
|
||||
logging.ERROR: unicode(
|
||||
curses.tparm(fg_color, curses.COLOR_RED), # Red
|
||||
"ascii"),
|
||||
}
|
||||
self._normal = unicode(curses.tigetstr("sgr0"), "ascii")
|
||||
|
||||
def format(self, record):
|
||||
try:
|
||||
record.message = record.getMessage()
|
||||
except Exception, e:
|
||||
record.message = "Bad message (%r): %r" % (e, record.__dict__)
|
||||
record.asctime = time.strftime(
|
||||
"%y%m%d %H:%M:%S", self.converter(record.created))
|
||||
prefix = '[%(levelname)1.1s %(asctime)s %(module)s:%(lineno)d]' % \
|
||||
record.__dict__
|
||||
if self._color:
|
||||
prefix = (self._colors.get(record.levelno, self._normal) +
|
||||
prefix + self._normal)
|
||||
formatted = prefix + " " + record.message
|
||||
if record.exc_info:
|
||||
if not record.exc_text:
|
||||
record.exc_text = self.formatException(record.exc_info)
|
||||
if record.exc_text:
|
||||
formatted = formatted.rstrip() + "\n" + record.exc_text
|
||||
return formatted.replace("\n", "\n ")
|
||||
|
||||
|
||||
# Set up log level and pretty console logging by default
|
||||
logging.getLogger('breadable').setLevel(getattr(logging, LOGLEVEL))
|
||||
enable_pretty_logging()
|
||||
LOG = logging.getLogger('breadable')
|
||||
LNODE = LogHelper(LOG,
|
||||
actions=(LogHelper.scoring, LogHelper.removing),
|
||||
content=True
|
||||
)
|
@ -1,109 +0,0 @@
|
||||
import argparse
|
||||
import codecs
|
||||
import urllib2
|
||||
from os import mkdir
|
||||
from os import path
|
||||
|
||||
from breadability import VERSION
|
||||
|
||||
|
||||
TESTPATH = path.join(
|
||||
path.dirname(path.dirname(__file__)),
|
||||
'tests', 'test_articles')
|
||||
|
||||
TESTTPL = """
|
||||
import os
|
||||
try:
|
||||
# Python < 2.7
|
||||
import unittest2 as unittest
|
||||
except ImportError:
|
||||
import unittest
|
||||
|
||||
from breadability.readable import Article
|
||||
|
||||
|
||||
class TestArticle(unittest.TestCase):
|
||||
\"\"\"Test the scoring and parsing of the Article\"\"\"
|
||||
|
||||
def setUp(self):
|
||||
\"\"\"Load up the article for us\"\"\"
|
||||
article_path = os.path.join(os.path.dirname(__file__), 'article.html')
|
||||
self.article = open(article_path).read()
|
||||
|
||||
def tearDown(self):
|
||||
\"\"\"Drop the article\"\"\"
|
||||
self.article = None
|
||||
|
||||
def test_parses(self):
|
||||
\"\"\"Verify we can parse the document.\"\"\"
|
||||
doc = Article(self.article)
|
||||
self.assertTrue('id="readabilityBody"' in doc.readable)
|
||||
|
||||
def test_content_exists(self):
|
||||
\"\"\"Verify that some content exists.\"\"\"
|
||||
pass
|
||||
|
||||
def test_content_does_not_exist(self):
|
||||
\"\"\"Verify we cleaned out some content that shouldn't exist.\"\"\"
|
||||
pass
|
||||
"""
|
||||
|
||||
|
||||
def parse_args():
|
||||
desc = "breadability helper to generate a new set of article test files."
|
||||
parser = argparse.ArgumentParser(description=desc)
|
||||
parser.add_argument('--version',
|
||||
action='version', version=VERSION)
|
||||
|
||||
parser.add_argument('-n', '--name',
|
||||
action='store',
|
||||
required=True,
|
||||
help='Name of the test directory')
|
||||
|
||||
parser.add_argument('url', metavar='URL', type=str, nargs=1,
|
||||
help='The url of content to fetch for the article.html')
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
def make_dir(name):
|
||||
"""Generate a new directory for tests.
|
||||
|
||||
"""
|
||||
dir_name = 'test_' + name.replace(' ', '_')
|
||||
updated_name = path.join(TESTPATH, dir_name)
|
||||
mkdir(updated_name)
|
||||
return updated_name
|
||||
|
||||
|
||||
def make_files(dirname):
|
||||
init_file = path.join(dirname, '__init__.py')
|
||||
test_file = path.join(dirname, 'test.py')
|
||||
open(init_file, "a").close()
|
||||
with open(test_file, 'w') as f:
|
||||
f.write(TESTTPL)
|
||||
|
||||
|
||||
def fetch_article(dirname, url):
|
||||
"""Get the content of the url and make it the article.html"""
|
||||
opener = urllib2.build_opener()
|
||||
opener.addheaders = [('Accept-Charset', 'utf-8')]
|
||||
url_response = opener.open(url)
|
||||
dl_html = url_response.read().decode('utf-8')
|
||||
|
||||
fh = codecs.open(path.join(dirname, 'article.html'), "w", "utf-8")
|
||||
fh.write(dl_html)
|
||||
fh.close()
|
||||
|
||||
|
||||
def main():
|
||||
"""Run the script."""
|
||||
args = parse_args()
|
||||
new_dir = make_dir(args.name)
|
||||
make_files(new_dir)
|
||||
fetch_article(new_dir, args.url[0])
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@ -1,14 +0,0 @@
|
||||
from os import path
|
||||
|
||||
|
||||
TEST_DIR = path.dirname(__file__)
|
||||
|
||||
|
||||
def load_snippet(filename):
|
||||
"""Helper to fetch in the content of a test snippet"""
|
||||
return open(path.join(TEST_DIR, 'test_snippets', filename)).read()
|
||||
|
||||
|
||||
def load_article(filename):
|
||||
"""Helper to fetch in the content of a test article"""
|
||||
return open(path.join(TEST_DIR, 'test_articles', filename)).read()
|
@ -1,49 +0,0 @@
|
||||
from collections import defaultdict
|
||||
|
||||
try:
|
||||
# Python < 2.7
|
||||
import unittest2 as unittest
|
||||
except ImportError:
|
||||
import unittest
|
||||
|
||||
from breadability.document import OriginalDocument
|
||||
from breadability.tests import load_snippet
|
||||
|
||||
|
||||
class TestOriginalDocument(unittest.TestCase):
|
||||
|
||||
"""Verify we can process html into a document to work off of."""
|
||||
|
||||
def test_readin_min_document(self):
|
||||
"""Verify we can read in a min html document"""
|
||||
doc = OriginalDocument(load_snippet('document_min.html'))
|
||||
self.assertTrue(str(doc).startswith(u'<html>'))
|
||||
self.assertEqual(doc.title, 'Min Document Title')
|
||||
|
||||
def test_readin_with_base_url(self):
|
||||
"""Passing a url should update links to be absolute links"""
|
||||
doc = OriginalDocument(
|
||||
load_snippet('document_absolute_url.html'),
|
||||
url="http://blog.mitechie.com/test.html")
|
||||
self.assertTrue(str(doc).startswith(u'<html>'))
|
||||
|
||||
# find the links on the page and make sure each one starts with out
|
||||
# base url we told it to use.
|
||||
links = doc.links
|
||||
self.assertEqual(len(links), 3)
|
||||
# we should have two links that start with our blog url
|
||||
# and one link that starts with amazon
|
||||
link_counts = defaultdict(int)
|
||||
for link in links:
|
||||
if link.get('href').startswith('http://blog.mitechie.com'):
|
||||
link_counts['blog'] += 1
|
||||
else:
|
||||
link_counts['other'] += 1
|
||||
|
||||
self.assertEqual(link_counts['blog'], 2)
|
||||
self.assertEqual(link_counts['other'], 1)
|
||||
|
||||
def test_no_br_allowed(self):
|
||||
"""We convert all <br/> tags to <p> tags"""
|
||||
doc = OriginalDocument(load_snippet('document_min.html'))
|
||||
self.assertIsNone(doc.html.find('.//br'))
|
@ -1,61 +0,0 @@
|
||||
import time
|
||||
|
||||
|
||||
#
|
||||
# ? 2011 Christopher Arndt, MIT License
|
||||
#
|
||||
class cached_property(object):
|
||||
'''Decorator for read-only properties evaluated only once within TTL
|
||||
period.
|
||||
|
||||
It can be used to created a cached property like this::
|
||||
|
||||
import random
|
||||
|
||||
# the class containing the property must be a new-style class
|
||||
class MyClass(object):
|
||||
# create property whose value is cached for ten minutes
|
||||
@cached_property(ttl=600) def randint(self):
|
||||
# will only be evaluated every 10 min. at maximum.
|
||||
return random.randint(0, 100)
|
||||
|
||||
The value is cached in the '_cache' attribute of the object instance that
|
||||
has the property getter method wrapped by this decorator. The '_cache'
|
||||
attribute value is a dictionary which has a key for every property of the
|
||||
object which is wrapped by this decorator. Each entry in the cache is
|
||||
created only when the property is accessed for the first time and is a
|
||||
two-element tuple with the last computed property value and the last time
|
||||
it was updated in seconds since the epoch.
|
||||
|
||||
The default time-to-live (TTL) is 300 seconds (5 minutes). Set the TTL to
|
||||
zero for the cached value to never expire.
|
||||
|
||||
To expire a cached property value manually just do::
|
||||
|
||||
del instance._cache[<property name>]
|
||||
|
||||
'''
|
||||
def __init__(self, ttl=300):
|
||||
self.ttl = ttl
|
||||
|
||||
def __call__(self, fget, doc=None):
|
||||
self.fget = fget
|
||||
self.__doc__ = doc or fget.__doc__
|
||||
self.__name__ = fget.__name__
|
||||
self.__module__ = fget.__module__
|
||||
return self
|
||||
|
||||
def __get__(self, inst, owner):
|
||||
now = time.time()
|
||||
try:
|
||||
value, last_update = inst._cache[self.__name__]
|
||||
if self.ttl > 0 and now - last_update > self.ttl:
|
||||
raise AttributeError
|
||||
except (KeyError, AttributeError):
|
||||
value = self.fget(inst)
|
||||
try:
|
||||
cache = inst._cache
|
||||
except AttributeError:
|
||||
cache = inst._cache = {}
|
||||
cache[self.__name__] = (value, now)
|
||||
return value
|
@ -0,0 +1,9 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
try:
|
||||
import unittest2 as unittest
|
||||
except ImportError:
|
||||
import unittest
|
@ -0,0 +1,310 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="cs-CZ" prefix="og: http://ogp.me/ns#">
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
|
||||
<title>Automatické zabezpečení | Zdroják</title>
|
||||
<link rel="apple-touch-icon" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-iphone.png"/>
|
||||
<link rel="apple-touch-icon" sizes="72x72" href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-ipad.png"/>
|
||||
<link rel="apple-touch-icon" sizes="114x114"
|
||||
href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-iphone-retina.png"/>
|
||||
<link rel="apple-touch-icon" sizes="144x144"
|
||||
href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/touch-icon-ipad-retina.png"/>
|
||||
<link rel="apple-touch-icon" sizes="512x512"
|
||||
href="http://www.zdrojak.cz/wp-content/themes/zdrojak/img/apple-touch-icon-itunes.png"/>
|
||||
<meta name="apple-mobile-web-app-title" content="Zdrojak.cz"/>
|
||||
<link rel="profile" href="http://gmpg.org/xfn/11"/>
|
||||
<link rel="pingback" href="http://www.zdrojak.cz/xmlrpc.php"/>
|
||||
|
||||
<!-- This site is optimized with the Yoast WordPress SEO plugin v1.4.4 - http://yoast.com/wordpress/seo/ -->
|
||||
<meta name="description" content="Nespoléhejte se na to, že do kódu nezapomenete na všechna místa připsat ošetření dat. Snažte se aplikaci navrhnout tak, aby se na nic zapomenout nedalo. Za cenu o něco složitějšího jádra bude veškerý kód, který ho používá, obvykle taky mnohem jednodušší."/>
|
||||
<link rel="canonical" href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" />
|
||||
<meta property='og:locale' content='cs_CZ'/>
|
||||
<meta property='og:title' content='Automatické zabezpečení - Zdroják'/>
|
||||
<meta property='og:description' content='Nespoléhejte se na to, že do kódu nezapomenete na všechna místa připsat ošetření dat. Snažte se aplikaci navrhnout tak, aby se na nic zapomenout nedalo. Za cenu o něco složitějšího jádra bude veškerý kód, který ho používá, obvykle taky mnohem jednodušší.'/>
|
||||
<meta property='og:url' content='http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/'/>
|
||||
<meta property='og:site_name' content='Zdroják'/>
|
||||
<meta property='og:type' content='article'/>
|
||||
<meta property='og:image' content='http://www.zdrojak.cz/wp-content/uploads/2008/11/security-122617975844093.png'/>
|
||||
<!-- / Yoast WordPress SEO plugin. -->
|
||||
|
||||
<link rel="alternate" type="application/rss+xml" title="Zdroják » RSS zdroj" href="http://www.zdrojak.cz/feed/" />
|
||||
<link rel="alternate" type="application/rss+xml" title="Zdroják » RSS komentářů" href="http://www.zdrojak.cz/comments/feed/" />
|
||||
<link rel="alternate" type="application/rss+xml" title="Zdroják » RSS komentářů pro Automatické zabezpečení" href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/feed/" />
|
||||
<link rel='stylesheet' id='admin-bar-css' href='http://www.zdrojak.cz/wp-includes/css/admin-bar.min.css?ver=3.5.1' type='text/css' media='all' />
|
||||
<link rel='stylesheet' id='kindle-css-css' href='http://www.zdrojak.cz/wp-content/plugins/omSocialButtons/kindle/kindle.css?ver=3.5.1' type='text/css' media='all' />
|
||||
<link rel='stylesheet' id='twentytwelve-style-css' href='http://www.zdrojak.cz/wp-content/themes/zdrojak/style.css?ver=5f54643' type='text/css' media='all' />
|
||||
<script type='text/javascript' src='http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js?ver=3.5.1'></script>
|
||||
<meta name="generator" content="WordPress 3.5.1" />
|
||||
<!-- Google Plus -->
|
||||
<script type="text/javascript">
|
||||
window.___gcfg = {lang: 'cs'}; (function () {
|
||||
var po = document.createElement('script');
|
||||
po.type = 'text/javascript';
|
||||
po.async = true;
|
||||
po.src = 'https://apis.google.com/js/plusone.js';
|
||||
var s = document.getElementsByTagName('script')[0];
|
||||
s.parentNode.insertBefore(po, s);
|
||||
})();
|
||||
</script>
|
||||
<!-- Google Plus -->
|
||||
<style type="text/css">.recentcomments a{display:inline !important;padding:0 !important;margin:0 !important;}</style>
|
||||
<style type="text/css" media="print">#wpadminbar { display:none; }</style>
|
||||
<style type="text/css" id="custom-background-css">
|
||||
body.custom-background { background-color: #e6e6e6; }
|
||||
</style>
|
||||
<link rel="shortcut icon" href="/favicon.ico"/>
|
||||
<link rel="author" href="/humans.txt"/>
|
||||
</head>
|
||||
<body class="single single-post postid-3773 single-format-standard admin-bar no-customize-support custom-background">
|
||||
<div id="page" class="hfeed site">
|
||||
|
||||
|
||||
<div class="site-top">
|
||||
<div class="wrapper">
|
||||
</div> <div id="main" class="wrapper">
|
||||
<div id="primary" class="site-content">
|
||||
<div id="content" role="main">
|
||||
|
||||
<article id="post-3773" class="post-3773 post type-post status-publish format-standard hentry category-ruzne tag-bezpecnost tag-ruzne full">
|
||||
<header class="entry-header">
|
||||
<h1 class="entry-title">Automatické zabezpečení</h1>
|
||||
</header>
|
||||
|
||||
<div class="entry-content">
|
||||
<p>Úroveň zabezpečení aplikace bych rozdělil do tří úrovní:</p>
|
||||
<ol>
|
||||
<li>Aplikace zabezpečená není, neošetřuje uživatelské vstupy ani své výstupy.</li>
|
||||
<li>Aplikace se o zabezpečení snaží, ale takovým způsobem, že na ně lze zapomenout.</li>
|
||||
<li>Aplikace se o zabezpečení stará sama, prakticky se nedá udělat chyba.</li>
|
||||
</ol>
|
||||
<p>Jak se tyto úrovně projevují v jednotlivých oblastech?</p>
|
||||
<h2><a href="http://php.vrana.cz/cross-site-scripting.php">XSS</a></h2>
|
||||
<p>Druhou úroveň představuje ruční ošetřování pomocí <a href="http://www.php.net/manual/en/function.htmlspecialchars.php"><kbd>htmlspecialchars</kbd></a>. Třetí úroveň zdánlivě reprezentuje automatické ošetřování v šablonách, např. v <a href="http://doc.nette.org/cs/templating"><strong>Nette Latte</strong></a>. Proč píšu zdánlivě? Problém je v tom, že ošetření se dá obvykle snadno zakázat, např. v Latte pomocí <code>{!$var}</code>. Viděl jsem šablony plné vykřičníků i na místech, kde být neměly. Autor to vysvětlil tak, že psaní <code>{$var}</code> někde způsobovalo problémy, které po přidání vykřičníku zmizely, tak je začal psát všude.</p>
|
||||
<pre class="brush: php"><?php
|
||||
$safeHtml = $texy->process($content_texy);
|
||||
$content = Html::el()->setHtml($safeHtml);
|
||||
// v šabloně pak můžeme použít {$content}
|
||||
?></pre>
|
||||
<p>Ideální by bylo, když by už samotná metoda <code>process()</code> vracela instanci <code>Html</code>.</p>
|
||||
<div class="social-buttons"><div class="wrapper"> <!-- Facebook -->
|
||||
<div id="fb-root"></div>
|
||||
<script>(function(d, s, id) {
|
||||
var js, fjs = d.getElementsByTagName(s)[0];
|
||||
if (d.getElementById(id)) return;
|
||||
js = d.createElement(s); js.id = id;
|
||||
js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=556615607691233";
|
||||
fjs.parentNode.insertBefore(js, fjs);
|
||||
}(document, 'script', 'facebook-jssdk'));</script>
|
||||
<!-- Facebook -->
|
||||
|
||||
<div class="facebook">
|
||||
<div class="fb-like"
|
||||
data-href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" data-width="450" data-layout="button_count"></div>
|
||||
</div><div class="twitter">
|
||||
<a href="https://twitter.com/share" class="twitter-share-button" data-via="zdrojak" data-url="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" data-lang="en" data-text="Automatické zabezpečení"></a>
|
||||
<script>!function (d, s, id) {
|
||||
var js, fjs = d.getElementsByTagName(s)[0];
|
||||
if (!d.getElementById(id)) {
|
||||
js = d.createElement(s);
|
||||
js.id = id;
|
||||
js.src = "//platform.twitter.com/widgets.js";
|
||||
fjs.parentNode.insertBefore(js, fjs);
|
||||
}
|
||||
}(document, "script", "twitter-wjs");</script>
|
||||
</div><div class="google">
|
||||
<div class="g-plusone"
|
||||
data-href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/" data-width="300" data-size="medium"></div>
|
||||
</div><div class="kindle">
|
||||
<div class="kindleWidget kindleLight">
|
||||
<img src="https://d1xnn692s7u6t6.cloudfront.net/black-15.png"/><span>Kindle</span>
|
||||
</div>
|
||||
</div>
|
||||
</div></div> <a name="bottom"></a>
|
||||
</div>
|
||||
|
||||
</article>
|
||||
|
||||
|
||||
</div>
|
||||
<tr id="tr-comment-23965" class="comment byuser comment-author-okbob even thread-odd thread-alt depth-1 recent new">
|
||||
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2005/01/okbob.png' class='avatar avatar-16 photo' height='16' width='16' />okbob</td>
|
||||
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23965">trochu jiný přístup</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-30T09:29:35+00:00">30.1.2013 v 07:29</time>
|
||||
</td>
|
||||
</tr><tr id="tr-comment-23966" class="comment odd alt depth-2 recent new">
|
||||
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Aleš Roubíček</td>
|
||||
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23966">Re: trochu jiný přístup</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-30T09:37:35+00:00">30.1.2013 v 07:37</time>
|
||||
</td>
|
||||
</tr><tr id="tr-comment-23967" class="comment even depth-3 recent new">
|
||||
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Futrál</td>
|
||||
<th class="comment-title"><span class="line"></span><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23967">Re: trochu jiný přístup</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-30T11:23:28+00:00">30.1.2013 v 09:23</time>
|
||||
</td>
|
||||
</tr></div>
|
||||
</div>
|
||||
<tr id="tr-comment-23968" class="comment odd alt depth-2 recent new">
|
||||
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Futrál</td>
|
||||
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23968">Re: trochu jiný přístup</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-30T11:24:59+00:00">30.1.2013 v 09:24</time>
|
||||
</td>
|
||||
</tr></div>
|
||||
</div>
|
||||
<tr id="tr-comment-23969" class="comment even thread-even depth-1 recent new">
|
||||
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />Monty</td>
|
||||
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23969">Jaké ošetření sloupce?</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-30T14:30:19+00:00">30.1.2013 ve 12:30</time>
|
||||
</td>
|
||||
</tr><tr id="tr-comment-23971" class="comment byuser comment-author-jakub-vrana bypostauthor odd alt depth-2 recent new">
|
||||
<td class="author"><img alt='' src='/wp-content/uploads/avatars/2004/12/jakubvrana.jpg' class='avatar avatar-16 photo' height='16' width='16' />Jakub Vrána</td>
|
||||
<th class="comment-title"><span class="line"></span><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23971">Re: Jaké ošetření sloupce?</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-30T20:14:25+00:00">30.1.2013 ve 18:14</time>
|
||||
</td>
|
||||
</tr></div>
|
||||
</div>
|
||||
<tr id="tr-comment-23973" class="comment even thread-odd thread-alt depth-1 recent new">
|
||||
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />bene</td>
|
||||
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23973">Re: Automatické zabezpečení</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-31T08:58:47+00:00">31.1.2013 v 06:58</time>
|
||||
</td>
|
||||
</tr></div>
|
||||
<tr id="tr-comment-23974" class="comment odd alt thread-even depth-1 recent new">
|
||||
<td class="author"><img alt='' src='http://0.gravatar.com/avatar/ad516503a11cd5ca435acc9bb6523536?s=16' class='avatar avatar-16 photo avatar-default' height='16' width='16' />5o</td>
|
||||
<th class="comment-title"><a href="http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/?show=comments#comment-23974">ACL assertion</a></th>
|
||||
<td class="comments-date">
|
||||
<time datetime="2013-01-31T12:04:27+00:00">31.1.2013 v 10:04</time>
|
||||
</td>
|
||||
</tr></div>
|
||||
</table>
|
||||
</div>
|
||||
|
||||
<div class="comments-actions">
|
||||
<a class="btn add-comment" href="?show=comments#respond">Přidat komentář</a>
|
||||
<a class="btn btn-primary show-comments" href="?show=comments#comments">Zobrazit komentáře</a>
|
||||
</div>
|
||||
</div> <nav class="nav-single no-print">
|
||||
<h3 class="assistive-text">Navigace pro příspěvky</h3>
|
||||
<span class="nav-previous">
|
||||
<a href="http://www.zdrojak.cz/clanky/tvorba-moderniho-eshopu-kategorie-a-parametricke-hledani/" rel="next"><i class="icon-left"></i>Tvorba moderního eshopu: kategorie a parametrické hledání</a> </span>
|
||||
<span class="nav-next">
|
||||
<a href="http://www.zdrojak.cz/clanky/bezpecny-sandboxovany-iframe/" rel="prev">Bezpečný sandboxovaný iframe<i class="icon-right"></i></a> </span>
|
||||
</nav>
|
||||
|
||||
<div class="visible-print">Zdroj: http://www.zdrojak.cz/?p=3773</div>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
<div id="secondary" class="widget-area" role="complementary">
|
||||
<aside id="recent_category_posts-0" class="with-thumb widget widget_recent_entries"> <h3 class="widget-title">Mobilní vývoj</h3> <div class="thumbnail">
|
||||
<img width="180" height="180" src="http://www.zdrojak.cz/wp-content/uploads/2013/02/develcz.jpg" class="attachment- wp-post-image" alt="Mobilní vývoj" title="Mobilní vývoj" /> </div>
|
||||
<ul>
|
||||
<li><a href="http://www.zdrojak.cz/clanky/reportaz-z-devel-cz-konference-2013/" title="Reportáž z Devel.cz konference 2013">Reportáž z Devel.cz konference 2013</a></li>
|
||||
<li><a href="http://www.zdrojak.cz/clanky/webapi-firefox-os/" title="Prohlédněte si možnosti WebAPI ve Firefox OS">Prohlédněte si možnosti WebAPI ve Firefox OS</a></li>
|
||||
<li><a href="http://www.zdrojak.cz/clanky/nova-vyvojarska-konzole-v-google-play/" title="Nová Vývojářská konzole v Google Play">Nová Vývojářská konzole v Google Play</a></li>
|
||||
</ul>
|
||||
</aside> <aside id="recent-posts-0" class="widget widget_recent_entries"> <h3 class="widget-title">Nejnovější příspěvky</h3> <ul>
|
||||
<li>
|
||||
<a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/" title="Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="http://www.zdrojak.cz/clanky/poslete-zdrojak-do-kindlu-novinky/" title="Pošlete Zdroják do Kindlu a další jarní novinky">Pošlete Zdroják do Kindlu a další jarní novinky</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="http://www.zdrojak.cz/clanky/formulare-v-html5-a-nove-atributy/" title="Formuláře v HTML5 a nové atributy">Formuláře v HTML5 a nové atributy</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="http://www.zdrojak.cz/clanky/modularni-webove-aplikace-s-minimem-rucni-prace/" title="Modulární webové aplikace s minimem ruční práce">Modulární webové aplikace s minimem ruční práce</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="http://www.zdrojak.cz/clanky/zapier-propojovani-api/" title="Zapier – dejte propojování API ten správný šmrnc">Zapier – dejte propojování API ten správný šmrnc</a>
|
||||
</li>
|
||||
</ul>
|
||||
</aside><aside id="related_posts-0" class="widget widget_related_posts"><h3 class="widget-title">Související články</h3><ul><li><a href="http://www.zdrojak.cz/clanky/do-hlubin-implementaci-javascriptu-1-dil-uvod/" rel="bookmark">Do hlubin implementací JavaScriptu: 1. díl – úvod</a></li><li><a href="http://www.zdrojak.cz/clanky/uvodni-analyza-pro-moderni-e-shop/" rel="bookmark">Úvodní analýza pro moderní e-shop</a></li><li><a href="http://www.zdrojak.cz/clanky/django-databazovy-model/" rel="bookmark">Django: Databázový model</a></li><li><a href="http://www.zdrojak.cz/clanky/dojo-toolkit-pokrocile-techniky/" rel="bookmark">Dojo Toolkit: pokročilé techniky</a></li><li><a href="http://www.zdrojak.cz/clanky/uvod-do-architektury-mvc/" rel="bookmark">Úvod do architektury MVC</a></li></ul></aside><aside id="recent-comments-0" class="widget widget_recent_comments"><h3 class="widget-title">Nejnovější komentáře</h3><ul id="recentcomments"><li class="recentcomments">Martin Kozák u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24239">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">maras u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24238">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">Ladislav Thon u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24237">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">satai u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24236">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li><li class="recentcomments">Ladislav Thon u <a href="http://www.zdrojak.cz/clanky/mloc-js-staticke-typovani/?show=comments#comment-24235">Typické! O maďarské konferenci mloc.js a statickém typování při vývoji webových aplikací</a></li></ul></aside> </div><!-- #secondary -->
|
||||
</div>
|
||||
<div id="bottom" class="site-bottom no-print">
|
||||
<div class="wrapper">
|
||||
<aside class="follow">
|
||||
<h3>Sledujte</h3>
|
||||
|
||||
<div class="content">
|
||||
<ul class="menu">
|
||||
<li>
|
||||
<a href="/feed" class="rss">
|
||||
<i class="icon-rss"></i>
|
||||
<span class="assistive-text">RSS</span>
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="/mail" class="mail" title="Zpravodaj">
|
||||
<i class="icon-mail"></i>
|
||||
<span class="assistive-text">Zpravodaj</span>
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a target="_blank" href="https://twitter.com/zdrojak" title="@zdrojak" class="twitter">
|
||||
<i class="icon-twitter"></i>
|
||||
<span class="assistive-text">Twitter @zdrojak</span>
|
||||
</a>
|
||||
</li>
|
||||
<li>
|
||||
<a target="_blank" href="https://plus.google.com/101725826130888424314/posts" class="googleplus">
|
||||
<i class="icon-gplus"></i>
|
||||
<span class="assistive-text">Google + Zdroják</span>
|
||||
</a>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
</aside> </div>
|
||||
</div>
|
||||
|
||||
|
||||
</div>
|
||||
|
||||
<script type='text/javascript' src='http://www.zdrojak.cz/wp-includes/js/admin-bar.min.js?ver=3.5.1'></script>
|
||||
<script type='text/javascript'>
|
||||
/* <![CDATA[ */
|
||||
(function k(){window.$SendToKindle&&window.$SendToKindle.Widget?$SendToKindle.Widget.init({"title":".entry-title","published":".entry-date","content":".post","exclude":".no-kindle"}):setTimeout(k,500);})();
|
||||
/* ]]> */
|
||||
</script>
|
||||
<script type='text/javascript' src='http://d1xnn692s7u6t6.cloudfront.net/widget.js'></script>
|
||||
<script type='text/javascript' src='http://www.zdrojak.cz/wp-content/themes/zdrojak/js/zdrojak.min.js?ver=5f54643'></script>
|
||||
|
||||
<div id="webstats">
|
||||
<script src="http://c1.navrcholu.cz/code?site=72;t=t1x1" type="text/javascript"></script>
|
||||
<noscript>
|
||||
<div>
|
||||
<img src="http://c1.navrcholu.cz/hit?site=72;t=t1x1;ref=;jss=0" width="1" height="1" alt=""/>
|
||||
</div>
|
||||
</noscript>
|
||||
|
||||
<script type="text/javascript">
|
||||
var _gaq = _gaq || [];
|
||||
_gaq.push(['_setAccount', 'UA-30960355-1']);
|
||||
_gaq.push(['_trackPageview']);
|
||||
|
||||
(function () {
|
||||
var ga = document.createElement('script');
|
||||
ga.type = 'text/javascript';
|
||||
ga.async = true;
|
||||
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
|
||||
var s = document.getElementsByTagName('script')[0];
|
||||
s.parentNode.insertBefore(ga, s);
|
||||
})();
|
||||
</script>
|
||||
</div>
|
||||
<!-- d0e88e1c2777b950889c170b9b92adf8 -->
|
||||
</body>
|
||||
</html>
|
@ -0,0 +1,21 @@
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="charset" content="utf-8"/>
|
||||
<title>This is title of document</title>
|
||||
</head>
|
||||
<body>
|
||||
<div>Inline text is not so good, but it's here.</div>
|
||||
<div class="article">
|
||||
<div class="wrapper">
|
||||
<p>
|
||||
Paragraph is more <em>better</em>.
|
||||
This text is very <strong>pretty</strong> 'cause she's girl.
|
||||
</p>
|
||||
<p>
|
||||
This is not <big>crap</big> so <dfn title="Make me readable">readability</dfn> me :)
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
<div>And some next not so <b>good</b> text.</div>
|
||||
</body>
|
||||
</html>
|
@ -0,0 +1,18 @@
|
||||
<html>
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
|
||||
<title>Paragraphs</title>
|
||||
</head>
|
||||
<body>
|
||||
<div>
|
||||
<h1>Nadpis H1, ktorý chce byť prvý s textom ale predbehol ho "title"</h1>
|
||||
<p>
|
||||
Toto je prvý odstavec a to je fajn.
|
||||
</p>
|
||||
<p>
|
||||
Tento text je tu aby vyplnil prázdne miesto v srdci súboru.
|
||||
Aj súbory majú predsa city.
|
||||
</p>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
@ -0,0 +1,35 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import print_function
|
||||
|
||||
import sys
|
||||
import atexit
|
||||
import nose
|
||||
|
||||
from os.path import dirname, abspath
|
||||
|
||||
|
||||
DEFAULT_PARAMS = [
|
||||
"nosetests",
|
||||
"--with-coverage",
|
||||
"--cover-package=readability",
|
||||
"--cover-erase",
|
||||
]
|
||||
|
||||
|
||||
@atexit.register
|
||||
def exit_function(msg="Shutting down"):
|
||||
print(msg, file=sys.stderr)
|
||||
|
||||
|
||||
def run(argv=[]):
|
||||
sys.exitfunc = exit_function
|
||||
|
||||
nose.run(
|
||||
argv=DEFAULT_PARAMS + argv,
|
||||
defaultTest=abspath(dirname(__file__)),
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
run(sys.argv[1:])
|
@ -0,0 +1,169 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from lxml.html import fragment_fromstring, document_fromstring
|
||||
from readability.readable import Article
|
||||
from readability.annotated_text import AnnotatedTextHandler
|
||||
from .compat import unittest
|
||||
from .utils import load_snippet, load_article
|
||||
|
||||
|
||||
class TestAnnotatedText(unittest.TestCase):
|
||||
def test_simple_document(self):
|
||||
dom = fragment_fromstring("<p>This is\n\tsimple\ttext.</p>")
|
||||
annotated_text = AnnotatedTextHandler.parse(dom)
|
||||
|
||||
expected = [
|
||||
(
|
||||
("This is\nsimple text.", None),
|
||||
),
|
||||
]
|
||||
self.assertEqual(annotated_text, expected)
|
||||
|
||||
def test_empty_paragraph(self):
|
||||
dom = fragment_fromstring("<div><p>Paragraph <p>\t \n</div>")
|
||||
annotated_text = AnnotatedTextHandler.parse(dom)
|
||||
|
||||
expected = [
|
||||
(
|
||||
("Paragraph", None),
|
||||
),
|
||||
]
|
||||
self.assertEqual(annotated_text, expected)
|
||||
|
||||
def test_multiple_paragraphs(self):
|
||||
dom = fragment_fromstring("<div><p> 1 first<p> 2\tsecond <p>3\rthird </div>")
|
||||
annotated_text = AnnotatedTextHandler.parse(dom)
|
||||
|
||||
expected = [
|
||||
(
|
||||
("1 first", None),
|
||||
),
|
||||
(
|
||||
("2 second", None),
|
||||
),
|
||||
(
|
||||
("3\nthird", None),
|
||||
),
|
||||
]
|
||||
self.assertEqual(annotated_text, expected)
|
||||
|
||||
def test_single_annotation(self):
|
||||
dom = fragment_fromstring("<div><p> text <em>emphasis</em> <p> last</div>")
|
||||
annotated_text = AnnotatedTextHandler.parse(dom)
|
||||
|
||||
expected = [
|
||||
(
|
||||
("text", None),
|
||||
("emphasis", ("em",)),
|
||||
),
|
||||
(
|
||||
("last", None),
|
||||
),
|
||||
]
|
||||
self.assertEqual(annotated_text, expected)
|
||||
|
||||
def test_recursive_annotation(self):
|
||||
dom = fragment_fromstring("<div><p> text <em><i><em>emphasis</em></i></em> <p> last</div>")
|
||||
annotated_text = AnnotatedTextHandler.parse(dom)
|
||||
|
||||
expected = [
|
||||
(
|
||||
("text", None),
|
||||
("emphasis", ("em", "i")),
|
||||
),
|
||||
(
|
||||
("last", None),
|
||||
),
|
||||
]
|
||||
self.assertEqual(annotated_text, expected)
|
||||
|
||||
def test_annotations_without_explicit_paragraph(self):
|
||||
dom = fragment_fromstring("<div>text <strong>emphasis</strong>\t<b>hmm</b> </div>")
|
||||
annotated_text = AnnotatedTextHandler.parse(dom)
|
||||
|
||||
expected = [
|
||||
(
|
||||
("text", None),
|
||||
("emphasis", ("strong",)),
|
||||
("hmm", ("b",)),
|
||||
),
|
||||
]
|
||||
self.assertEqual(annotated_text, expected)
|
||||
|
||||
def test_process_paragraph_with_chunked_text(self):
|
||||
handler = AnnotatedTextHandler()
|
||||
paragraph = handler._process_paragraph([
|
||||
(" 1", ("b", "del")),
|
||||
(" 2", ("b", "del")),
|
||||
(" 3", None),
|
||||
(" 4", None),
|
||||
(" 5", None),
|
||||
(" 6", ("em",)),
|
||||
])
|
||||
|
||||
expected = (
|
||||
("1 2", ("b", "del")),
|
||||
("3 4 5", None),
|
||||
("6", ("em",)),
|
||||
)
|
||||
self.assertEqual(paragraph, expected)
|
||||
|
||||
def test_include_heading(self):
|
||||
dom = document_fromstring(load_snippet("h1_and_2_paragraphs.html"))
|
||||
annotated_text = AnnotatedTextHandler.parse(dom.find("body"))
|
||||
|
||||
expected = [
|
||||
(
|
||||
('Nadpis H1, ktorý chce byť prvý s textom ale predbehol ho "title"', ("h1",)),
|
||||
("Toto je prvý odstavec a to je fajn.", None),
|
||||
),
|
||||
(
|
||||
("Tento text je tu aby vyplnil prázdne miesto v srdci súboru.\nAj súbory majú predsa city.", None),
|
||||
),
|
||||
]
|
||||
self.assertSequenceEqual(annotated_text, expected)
|
||||
|
||||
def test_real_article(self):
|
||||
article = Article(load_article("zdrojak_automaticke_zabezpeceni.html"))
|
||||
annotated_text = article.main_text
|
||||
|
||||
expected = [
|
||||
(
|
||||
("Automatické zabezpečení", ("h1",)),
|
||||
("Úroveň zabezpečení aplikace bych rozdělil do tří úrovní:", None),
|
||||
),
|
||||
(
|
||||
("Aplikace zabezpečená není, neošetřuje uživatelské vstupy ani své výstupy.", ("li", "ol")),
|
||||
("Aplikace se o zabezpečení snaží, ale takovým způsobem, že na ně lze zapomenout.", ("li", "ol")),
|
||||
("Aplikace se o zabezpečení stará sama, prakticky se nedá udělat chyba.", ("li", "ol")),
|
||||
),
|
||||
(
|
||||
("Jak se tyto úrovně projevují v jednotlivých oblastech?", None),
|
||||
),
|
||||
(
|
||||
("XSS", ("a", "h2")),
|
||||
("Druhou úroveň představuje ruční ošetřování pomocí", None),
|
||||
("htmlspecialchars", ("a", "kbd")),
|
||||
(". Třetí úroveň zdánlivě reprezentuje automatické ošetřování v šablonách, např. v", None),
|
||||
("Nette Latte", ("a", "strong")),
|
||||
(". Proč píšu zdánlivě? Problém je v tom, že ošetření se dá obvykle snadno zakázat, např. v Latte pomocí", None),
|
||||
("{!$var}", ("code",)),
|
||||
(". Viděl jsem šablony plné vykřičníků i na místech, kde být neměly. Autor to vysvětlil tak, že psaní", None),
|
||||
("{$var}", ("code",)),
|
||||
("někde způsobovalo problémy, které po přidání vykřičníku zmizely, tak je začal psát všude.", None),
|
||||
),
|
||||
(
|
||||
("<?php\n$safeHtml = $texy->process($content_texy);\n$content = Html::el()->setHtml($safeHtml);\n// v šabloně pak můžeme použít {$content}\n?>", ("pre", )),
|
||||
),
|
||||
(
|
||||
("Ideální by bylo, když by už samotná metoda", None),
|
||||
("process()", ("code",)),
|
||||
("vracela instanci", None),
|
||||
("Html", ("code",)),
|
||||
(".", None),
|
||||
),
|
||||
]
|
||||
self.assertSequenceEqual(annotated_text, expected)
|
@ -1,11 +1,12 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
import os
|
||||
try:
|
||||
# Python < 2.7
|
||||
import unittest2 as unittest
|
||||
except ImportError:
|
||||
import unittest
|
||||
|
||||
from breadability.readable import Article
|
||||
from readability.readable import Article
|
||||
from ...compat import unittest
|
||||
|
||||
|
||||
class TestAntipopeBlog(unittest.TestCase):
|
@ -0,0 +1,91 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from collections import defaultdict
|
||||
from readability._compat import to_unicode, to_bytes
|
||||
from readability.document import (OriginalDocument, determine_encoding,
|
||||
convert_breaks_to_paragraphs)
|
||||
from .compat import unittest
|
||||
from .utils import load_snippet
|
||||
|
||||
|
||||
class TestOriginalDocument(unittest.TestCase):
|
||||
"""Verify we can process html into a document to work off of."""
|
||||
|
||||
def test_convert_br_tags_to_paragraphs(self):
|
||||
returned = convert_breaks_to_paragraphs(
|
||||
"<div>HI<br><br>How are you?<br><br> \t \n <br>Fine\n I guess</div>")
|
||||
|
||||
self.assertEqual(returned,
|
||||
"<div>HI</p><p>How are you?</p><p>Fine\n I guess</div>")
|
||||
|
||||
def test_convert_hr_tags_to_paragraphs(self):
|
||||
returned = convert_breaks_to_paragraphs(
|
||||
"<div>HI<br><br>How are you?<hr/> \t \n <br>Fine\n I guess</div>")
|
||||
|
||||
self.assertEqual(returned,
|
||||
"<div>HI</p><p>How are you?</p><p>Fine\n I guess</div>")
|
||||
|
||||
def test_readin_min_document(self):
|
||||
"""Verify we can read in a min html document"""
|
||||
doc = OriginalDocument(load_snippet('document_min.html'))
|
||||
self.assertTrue(to_unicode(doc).startswith('<html>'))
|
||||
self.assertEqual(doc.title, 'Min Document Title')
|
||||
|
||||
def test_readin_with_base_url(self):
|
||||
"""Passing a url should update links to be absolute links"""
|
||||
doc = OriginalDocument(
|
||||
load_snippet('document_absolute_url.html'),
|
||||
url="http://blog.mitechie.com/test.html")
|
||||
self.assertTrue(to_unicode(doc).startswith('<html>'))
|
||||
|
||||
# find the links on the page and make sure each one starts with out
|
||||
# base url we told it to use.
|
||||
links = doc.links
|
||||
self.assertEqual(len(links), 3)
|
||||
# we should have two links that start with our blog url
|
||||
# and one link that starts with amazon
|
||||
link_counts = defaultdict(int)
|
||||
for link in links:
|
||||
if link.get('href').startswith('http://blog.mitechie.com'):
|
||||
link_counts['blog'] += 1
|
||||
else:
|
||||
link_counts['other'] += 1
|
||||
|
||||
self.assertEqual(link_counts['blog'], 2)
|
||||
self.assertEqual(link_counts['other'], 1)
|
||||
|
||||
def test_no_br_allowed(self):
|
||||
"""We convert all <br/> tags to <p> tags"""
|
||||
doc = OriginalDocument(load_snippet('document_min.html'))
|
||||
self.assertIsNone(doc.dom.find('.//br'))
|
||||
|
||||
def test_empty_title(self):
|
||||
"""We convert all <br/> tags to <p> tags"""
|
||||
document = OriginalDocument("<html><head><title></title></head><body></body></html>")
|
||||
self.assertEqual(document.title, "")
|
||||
|
||||
def test_title_only_with_tags(self):
|
||||
"""We convert all <br/> tags to <p> tags"""
|
||||
document = OriginalDocument("<html><head><title><em></em></title></head><body></body></html>")
|
||||
self.assertEqual(document.title, "")
|
||||
|
||||
def test_no_title(self):
|
||||
"""We convert all <br/> tags to <p> tags"""
|
||||
document = OriginalDocument("<html><head></head><body></body></html>")
|
||||
self.assertEqual(document.title, "")
|
||||
|
||||
def test_encoding(self):
|
||||
text = "ľščťžýáíéäúňôůě".encode("iso-8859-2")
|
||||
encoding = determine_encoding(text)
|
||||
|
||||
def test_encoding_short(self):
|
||||
text = "ľščťžýáíé".encode("iso-8859-2")
|
||||
encoding = determine_encoding(text)
|
||||
self.assertEqual(encoding, "utf8")
|
||||
|
||||
text = to_bytes("ľščťžýáíé")
|
||||
encoding = determine_encoding(text)
|
||||
self.assertEqual(encoding, "utf8")
|
@ -0,0 +1,23 @@
|
||||
# -*- coding: utf8 -*-
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division, print_function, unicode_literals
|
||||
|
||||
from os.path import abspath, dirname, join
|
||||
|
||||
|
||||
TEST_DIR = abspath(dirname(__file__))
|
||||
|
||||
|
||||
def load_snippet(file_name):
|
||||
"""Helper to fetch in the content of a test snippet."""
|
||||
file_path = join(TEST_DIR, "data/snippets", file_name)
|
||||
with open(file_path, "rb") as file:
|
||||
return file.read()
|
||||
|
||||
|
||||
def load_article(file_name):
|
||||
"""Helper to fetch in the content of a test article."""
|
||||
file_path = join(TEST_DIR, "data/articles", file_name)
|
||||
with open(file_path, "rb") as file:
|
||||
return file.read()
|
Loading…
Reference in New Issue