{ Snipperize } /html
Snippets about html
HTML Content Extractor
The Algorithm Of Content Extraction Extracting the main content from a web page is a complex task. A general, robust solution would probably require natural language understanding, an ability to render and visually analyze a webpage (at least partially) and so on – in short, an advanced AI. However, there are simpler ways to solve this – if you are willing to accept the tradeoffs. You just need to find the right heuristics. My content extractor class initially started out as a PHP port of HTML::ContentExtractor, so some of the below ideas might seem familiar if you’ve used that Perl module. First, assume that content = text. Text is measured both by word count and character count. Short, solitary lines of text are probably not part of the main content. Furthermore, webpage blocks that contain a lot of links and little unlinked text are probably contain navigation elements. So, we remove anything that is not the main content, and return what’s left. To determine whether something is part of the content (and should be retained), or is unimportant (and can be discarded) – The length of a text block. If it’s shorter than the give threshold, the block is deleted. The links / unlinked_words ratio. If higher than a certain treshold, delete the block. A number of “uninteresting” tags are unconditionally removed. These include <script>, <style>, <img> and so on. You can specify both the minimal text length and the ratio threshold when using the script. If you don’t, they will be calculated automatically. The minimal length will be set to the average length of a text block in the current webpage, and the max. link/text ratio will be set to average ratio * 1.30. In my experience, these defaults work fairly well for a wide range of websites. Technical : All the thresholds and other stuff are calculated at the level of DOM nodes that correspond to predefined “container” tags (e.g. <div>, <table>, <ul>, etc). The DOM tree is traversed recursively (depth-first), in two passes – the first removes the “uninteresting” tags and calculates the average values used for auto-configuration. The second applies the thresholds. The script is not Unicode-safe. More Ideas Automatically remove blocks that contain predefined words, for example, “all rights reserved”. HTML::ContentExtractor does this. Use a different HTML segmentation logic. Using the DOM for segmentation may be inefficient in complex, markup-heavy pages. On the other end of the scale, primitive websites that don’t embrace “semantical markup” may be impossible to segment using DOM. Use Bayesian filters to find blocks of text that are not part of the content. For example, various copyright notices and sponsored links. Invent a heuristic to filter out comments on blog posts/forums. Here’s an idea – find a short text block that mentions “Comments” and has a relatively large text block before it.
Python / html, parser, content, extractor, dom / by ThePeppersStudio (151 days, 21.41 hours ago)
htmlSQL
htmlSQL is a experimental PHP class which allows you to access HTML values by an SQL like syntax. This means that you don't have to write complex functions (regular expressions) to extract specific values.
PHP / html, sql, parser / by ThePeppersStudio (152 days, 19.07 hours ago)
Strip HTML Tags
Objective-C / strip, html, tags, NSScanner / by ThePeppersStudio (158 days, 13.86 hours ago)
Decode HTML Entities
Python / decode, html, htmlentities, htmlentitydefs, name2codepoint / by ThePeppersStudio (190 days, 15.27 hours ago)
Python HTML/CSS to PDF converter
pisa is a html2pdf converter using the ReportLab Toolkit, the HTML5lib and pyPdf. It supports HTML 5 and CSS 2.1 (and some of CSS 3). It is completely written in pure Python so it is platform independent. The main benefit of this tool that a user with Web skills like HTML and CSS is able to generate PDF templates very quickly without learning new technologies. Easy integration into Python frameworks like CherryPy, KID Templating, TurboGears, Django, Zope, Plone, Google AppEngine (GAE) etc. (see 'demo' folder for examples)
Python / html, pdf, css / by ThePeppersStudio (227 days, 12.58 hours ago)
- 1
- Home
- New Snippet
- Languages
-

