This information is for customers of our self-hosted version of Full-Text RSS. The information refers to version 3.0 of Full-Text RSS. For customers on the premium hosted plan, please contact us if you need us to improve extraction for particular sites.
Site patterns allow you to specify what should be extracted from specific sites.
Site patterns can be used if our automatic content extraction fails to pick out the correct content block for a particular site, or if additional fine tuning is required (e.g. to strip undesirable elements within the content block, to include images which are not being included, to follow a single-page link on multi-page articles.).
We have an experimental tool to help you create a simple one through a point-and-click interface.
Important: The rest of this document is for advanced users. If you feel you need a new site config file for something you're trying to extract and you're uncomfortable creating one yourself, you can ask us to create one for you. Request a site config file.
How it Works
After we fetch the contents of a URL, we use the hostname (e.g. mysite.example.org) and check to see if a config file exists for that hostname. If there is a matching file (e.g. mysite.example.org.txt or .example.org.txt) in one of the config folders, we will fetch the rules it contains. If no such file is found, we attempt to detect the appropriate content block and title automatically. When there is a matching site config, we first try to use the patterns within to extract the content. If these patterns fail to match, we will, by default, revert to auto-detection — this gives us another chance to get at the content (useful when site redesigns invalidate our stored patterns).
A simple example of what might be found in the site config file for processing Wikipedia entries:
body: //div[@id = 'content'] strip_id_or_class: editsection strip_id_or_class: toc prune: no
Fingerprints: Matching Without Hostnames
Some sites and blogging platforms (e.g. WordPress.com, Posterous.com, Blogger.com) allow users to map their own domains to their accounts. So, for example, a Posterous user might choose to map example.org to example.posterous.com. This poses a problem for Full-Text RSS because if the content on this site was previously extracted using our Posterous site config file (.posterous.com.txt) it will no longer match when accessed at example.org (unless we save a copy of the .posterous.com.txt as example.org.txt).
To get around this problem, we have introduced an additional way to identify sites and trigger site-specific extraction. Since Full-Text RSS 2.9, it's now possible to associate a fragment of HTML as a site's identifier (or fingerprint). Fingerprints, rather than identifying a site using its domain names, use HTML fragment strings which are likely to appear in the HTML as identifiers for particular sites. Going back to our Posterous example, we can use '<meta content="Posterous" name="generator" />' as an identifier string for Posterous and map it to the .posterous.com.txt site config file. These can be added in custom_config.php (to see how these should look, open config.php and look for $options->fingerprints).
Location of site config files
In the site_config folder you will find two subfolders: 'standard' and 'custom'. FTR comes with a number of existing site patterns in the 'standard' folder. It's possible to change these, but we suggest users place their own site patterns in the custom folder to prevent future updates overwriting their site patterns.
Global site config files accept everything regular site config file does, but are applied to all sites, whether or not a specific site config matches. There is a global site config in the site_config/standard/ folder. To add your own rules, the global site config file should be named global.txt and placed inside site_config/custom/.
Site config merging
Full-Text RSS looks for site config files in the following order:
- URL hostname match or wildcard match in the site_config/custom/
- URL hostname match or wildcard match in the site_config/standard/
- fingerprint match (HTML fragment mapping to hostname) in site_config/custom/
- fingerprint match (HTML fragment mapping to hostname) in site_config/standard/
- global rules in site_config/custom/global.txt
- global rules in site_config/standard/global.txt
Any matching files are merged, with rules from files higher up in the list taking priority. If, however, one of these files contains autodetect_on_failure: no then the files beneath it will not be merged.
To select elements for extraction or removal, we use XPath. If you're not familiar with the syntax, there's a nice tutorial here: XPath 1.0 tutorial.
The pattern format has been borrowed from Instapaper. You'll find plenty of examples by opening the text files inside the site_config/standard folder or by browsing our GitHub repository. We make use of the patterns provided by Instapaper and, in the same spirit, have made available our own additions for anyone to use.
We currently recognise the following directives:
The page title. XPaths evaluating to strings are also accepted. Multiple statements accepted. Will evaluate in order until a result is found. If not specified or no match found, it will be auto-detected. The title is placed in the <title> element, inside the feed's item element.
Note: if the input URL points to a feed, the item's title in the feed will not be changed - we assume item titles in feeds are not truncated. Since Full-Text RSS 3.1 you can use the extracted title. To do so you should pass
&use_extracted_title in the querystring. To enable/disable this for for all feeds, see the config file - specifically
The body of the article. Multiple statements accepted. Will evaluate in order until a result is found. If not specified or no match found, it will be auto-detected. The body is placed in the <description> element, inside the feed's item element.
body: //div[@id='body'] # also possible to specify multiple # elements to be concatenated: body: //div[@class='summary'] | //div[@id='body']
The publication date. XPaths evaluating to strings are also accepted. Multiple statements accepted. Will evaluate in order until a result is found. The date is placed in the <pubDate> element, inside the feed's item element.
The author(s) of the piece. XPaths evaluating to strings are also accepted. Multiple statements accepted. Will evaluate in order until a result is found. Each author is placed in a <dc:author> element, inside the feed's item element.
Strip any matching elements and their children. Multiple statements accepted.
strip: //div[@class='hidden'] strip: //div[@id='content']//p[@class='promo']
Strip any element whose @id or @class contains this substring. Multiple statements accepted.
strip_id_or_class: hidden strip_id_or_class: navigation
Strip any <img> element where @src attribute contains this substring. Multiple statements accepted.
strip_image_src: /advert/ strip_image_src: /tracker/
tidy: [yes|no] (default: yes)
Preprocess with Tidy. Tidy usually helps clean up the HTML for processing. It can, however, sometimes make matters worse. If it does, try setting this to no. (This setting may affect the final DOM tree produced, and with it affect your xpath expressions – so if your xpath is failing to match the desired elements, try setting this to 'no' to see if helps.)
prune: [yes|no] (default: yes)
Strip elements within body that do not resemble content elements. Sometimes this leads to elements which you'd like to keep from being stripped. If that happens, set this to no.
autodetect_on_failure: [yes|no] (default: yes)
If set to no, we will not attempt to auto-detect the title or content block if the given title/body expressions fail to match.
Identifies a link element or URL pointing to the page holding the entire article. This is useful for sites which split their articles across multiple pages. Links to such pages tend to display the first page with links to the other pages at the bottom. Often there is also a link to a page which displays the entire article on one page (e.g. 'print view' or 'single page'). This should be an XPath expression identifying the link to that page. If present and we find a match, we will retrieve that page and the site config options will be applied to the new page.
single_page_link: //span[@class='singlePage']/a single_page_link: //a[contains(@href, '/print/')]
Same as above, but applied to item description HTML taken from feed. Please be aware that the same article URL may appear in a variety of feeds which do not always contain the same item description. If both single_page_link and single_page_link_in_feed appear in the site config, single_page_link_in_feed will be ignored.
Identifies a link element or URL pointing to the next page of a multi-page article. This is useful for sites which split their articles across multiple pages but do not offer a single page view (if a single page view is provided, please use single_page_link instead - it'll be much faster). If present and we find a match, we will retrieve that page and the site config options will be applied to the new page. If the next_page_link matches a link on this new page, the process will continue. Finally the content will be joined together. (Introduced in version 3.0.)
next_page_link: //a[@id='next-page'] next_page_link: //a[contains(text(), 'Next page')]
replace_string([string to find]): [replacement string]
Simple find and replace to be performed on HTML before processing.
replace_string(<p />): <br /><br /> # alternatively in version 3.0 you can write the above as: find_string: <p /> replace_string: <br /><br />
By default we rely on PHP’s fast libxml parser. For sites where this proves problematic, you can now specify html5lib – a PHP implementation of a HTML parser based on the HTML5 spec. (Introduced in version 3.0.)
Note: html5lib is a slower parser and still quite buggy.
A URL to use to test the pattern. In future, we’ll have a tool which will use this to automatically test if the patterns in the file are still valid. Must be URL of an article from this site, not the site's front page. One or more.
Lines beginning with # are ignored.
# this is an advert strip: //img[@class='ad']