FiveFilters.org

Support Center

Site Patterns

Last Updated: Jun 24, 2014 01:02AM CEST

This information is for customers of our self-hosted version of Full-Text RSS. The information refers to version 3.0 of Full-Text RSS. For customers on the premium hosted plan, please contact us if you need us to improve extraction for particular sites.

Site patterns allow you to specify what should be extracted from specific sites.

Site patterns can be used if our automatic content extraction fails to pick out the correct content block for a particular site, or if additional fine tuning is required (e.g. to strip undesirable elements within the content block, to include images which are not being included, to follow a single-page link on multi-page articles.).

We have an experimental tool to help you create a simple one through a point-and-click interface.

Important: The rest of this document is for advanced users. If you feel you need a new site config file for something you're trying to extract and you're uncomfortable creating one yourself, you can ask us to create one for you. Request a site config file.

How it Works

After we fetch the contents of a URL, we use the hostname (e.g. mysite.example.org) and check to see if a config file exists for that hostname. If there is a matching file (e.g. mysite.example.org.txt or .example.org.txt) in one of the config folders, we will fetch the rules it contains. If no such file is found, we attempt to detect the appropriate content block and title automatically. When there is a matching site config, we first try to use the patterns within to extract the content. If these patterns fail to match, we will, by default, revert to auto-detection — this gives us another chance to get at the content (useful when site redesigns invalidate our stored patterns).

A simple example of what might be found in the site config file for processing Wikipedia entries:

body: //div[@id = 'content']
strip_id_or_class: editsection
strip_id_or_class: toc
prune: no

Fingerprints: Matching Without Hostnames

Some sites and blogging platforms (e.g. WordPress.com, Posterous.com, Blogger.com) allow users to map their own domains to their accounts. So, for example, a Posterous user might choose to map example.org to example.posterous.com. This poses a problem for Full-Text RSS because if the content on this site was previously extracted using our Posterous site config file (.posterous.com.txt) it will no longer match when accessed at example.org (unless we save a copy of the .posterous.com.txt as example.org.txt).

To get around this problem, we have introduced an additional way to identify sites and trigger site-specific extraction. Since Full-Text RSS 2.9, it's now possible to associate a fragment of HTML as a site's identifier (or fingerprint). Fingerprints, rather than identifying a site using its domain names, use HTML fragment strings which are likely to appear in the HTML as identifiers for particular sites. Going back to our Posterous example, we can use '<meta content="Posterous" name="generator" />' as an identifier string for Posterous and map it to the .posterous.com.txt site config file. These can be added in custom_config.php (to see how these should look, open config.php and look for $options->fingerprints).

Location of site config files

In the site_config folder you will find two subfolders: 'standard' and 'custom'. FTR comes with a number of existing site patterns in the 'standard' folder. It's possible to change these, but we suggest users place their own site patterns in the custom folder to prevent future updates overwriting their site patterns.

Global rules

Global site config files accept everything regular site config file does, but are applied to all sites, whether or not a specific site config matches. There is a global site config in the site_config/standard/ folder. To add your own rules, the global site config file should be named global.txt and placed inside site_config/custom/.

Site config merging

Full-Text RSS looks for site config files in the following order:

  1. URL hostname match or wildcard match in the site_config/custom/
  2. URL hostname match or wildcard match in the site_config/standard/
  3. fingerprint match (HTML fragment mapping to hostname) in site_config/custom/
  4. fingerprint match (HTML fragment mapping to hostname) in site_config/standard/
  5. global rules in site_config/custom/global.txt
  6. global rules in site_config/standard/global.txt

Any matching files are merged, with rules from files higher up in the list taking priority. If, however, one of these files contains autodetect_on_failure: no then the files beneath it will not be merged.

XPath

To select elements for extraction or removal, we use XPath. If you're not familiar with the syntax, there's a nice tutorial here: XPath 1.0 tutorial.

Pattern Format

The pattern format has been borrowed from Instapaper. You'll find plenty of examples by opening the text files inside the site_config/standard folder or by browsing our GitHub repository. We make use of the patterns provided by Instapaper and, in the same spirit, have made available our own additions for anyone to use.

We currently recognise the following directives:

title: [XPath]

The page title. XPaths evaluating to strings are also accepted. Multiple statements accepted. Will evaluate in order until a result is found. If not specified or no match found, it will be auto-detected. The title is placed in the <title> element, inside the feed's item element.

Note: if the input URL points to a feed, the item's title in the feed will not be changed - we assume item titles in feeds are not truncated. Since Full-Text RSS 3.1 you can use the extracted title. To do so you should pass &use_extracted_title in the querystring. To enable/disable this for for all feeds, see the config file - specifically $options->favour_feed_titles

Example
		title: //h1[@id='title']

body: [XPath]

The body of the article. Multiple statements accepted. Will evaluate in order until a result is found. If not specified or no match found, it will be auto-detected. The body is placed in the <description> element, inside the feed's item element.

Example
		body: //div[@id='body']
		# also possible to specify multiple 
		# elements to be concatenated:
		body: //div[@class='summary'] | //div[@id='body']

date: [XPath]

The publication date. XPaths evaluating to strings are also accepted. Multiple statements accepted. Will evaluate in order until a result is found. The date is placed in the <pubDate> element, inside the feed's item element.

Example
		date: //span[@class='date']

author: [XPath]

The author(s) of the piece. XPaths evaluating to strings are also accepted. Multiple statements accepted. Will evaluate in order until a result is found. Each author is placed in a <dc:author> element, inside the feed's item element.

Example
		author: //span[@class='author']

strip: [XPath]

Strip any matching elements and their children. Multiple statements accepted.

Example
		strip: //div[@class='hidden']
		strip: //div[@id='content']//p[@class='promo']

strip_id_or_class: [string]

Strip any element whose @id or @class contains this substring. Multiple statements accepted.

Example
		strip_id_or_class: hidden
		strip_id_or_class: navigation

strip_image_src: [string]

Strip any <img> element where @src attribute contains this substring. Multiple statements accepted.

Example
		strip_image_src: /advert/
		strip_image_src: /tracker/

tidy: [yes|no] (default: yes)

Preprocess with Tidy. Tidy usually helps clean up the HTML for processing. It can, however, sometimes make matters worse. If it does, try setting this to no. (This setting may affect the final DOM tree produced, and with it affect your xpath expressions – so if your xpath is failing to match the desired elements, try setting this to 'no' to see if helps.)

Example
		tidy: no

prune: [yes|no] (default: yes)

Strip elements within body that do not resemble content elements. Sometimes this leads to elements which you'd like to keep from being stripped. If that happens, set this to no.

Example
		prune: no

autodetect_on_failure: [yes|no] (default: yes)

If set to no, we will not attempt to auto-detect the title or content block if the given title/body expressions fail to match.

Example
		autodetect_on_failure: no

single_page_link: [XPath]

Identifies a link element or URL pointing to the page holding the entire article. This is useful for sites which split their articles across multiple pages. Links to such pages tend to display the first page with links to the other pages at the bottom. Often there is also a link to a page which displays the entire article on one page (e.g. 'print view' or 'single page'). This should be an XPath expression identifying the link to that page. If present and we find a match, we will retrieve that page and the site config options will be applied to the new page.

Examples
		single_page_link: //span[@class='singlePage']/a
		single_page_link: //a[contains(@href, '/print/')]

single_page_link_in_feed: [XPath]

Same as above, but applied to item description HTML taken from feed. Please be aware that the same article URL may appear in a variety of feeds which do not always contain the same item description. If both single_page_link and single_page_link_in_feed appear in the site config, single_page_link_in_feed will be ignored.

Example
		single_page_link_in_feed: //a

next_page_link: [XPath]

Identifies a link element or URL pointing to the next page of a multi-page article. This is useful for sites which split their articles across multiple pages but do not offer a single page view (if a single page view is provided, please use single_page_link instead - it'll be much faster). If present and we find a match, we will retrieve that page and the site config options will be applied to the new page. If the next_page_link matches a link on this new page, the process will continue. Finally the content will be joined together. (Introduced in version 3.0.)

Examples
		next_page_link: //a[@id='next-page']
		next_page_link: //a[contains(text(), 'Next page')]

replace_string([string to find]): [replacement string]

Simple find and replace to be performed on HTML before processing.

Example
		replace_string(<p />): <br /><br />
		# alternatively in version 3.0 you can write the above as: 
		find_string: <p /> 
		replace_string: <br /><br />

parser: [string]

By default we rely on PHP’s fast libxml parser. For sites where this proves problematic, you can now specify html5lib – a PHP implementation of a HTML parser based on the HTML5 spec. (Introduced in version 3.0.)

Note: html5lib is a slower parser and still quite buggy.

Example
		parser: html5lib

test_url: [string]

A URL to use to test the pattern. In future, we’ll have a tool which will use this to automatically test if the patterns in the file are still valid. Must be URL of an article from this site, not the site's front page. One or more.

Example
		test_url: http://www.example.org/2011/05/what-a-day/

# comments

Lines beginning with # are ignored.

Example
		# this is an advert
		strip: //img[@class='ad']
help@fivefilters.org
http://assets2.desk.com/
false
fivefilters
Loading
seconds ago
a minute ago
minutes ago
an hour ago
hours ago
a day ago
days ago
about
false
Invalid characters found
/customer/en/portal/articles/autocomplete