Usage and Request Parameters

The simplest way to use Full-Text RSS is to use the form provided.

In the URL field, enter the URL of a partial feed or web page and click ‘Create Feed’. The resulting page should show you a newly generated feed with the full content. To use this feed in your application, copy the URL in the address bar. You can now use this new URL in place of the original partial feed URL.

If you're a developer and need to integrate Full-Text RSS in your application, we have a simple code example to give you an idea of how it can be used.

Form Fields

In addition to the URL, you can also specify a number of other options in the form:

Max Items

Set the maximum number of feed items we should process. The smaller the number, the faster the new feed is produced.

If your URL refers to a standard web page, this will have no effect: you will only get 1 item.

Link Handling

By default, links within the content are preserved. Change this field if you'd like links removed, or included as footnotes.

Extraction Failure Handling

If the extraction pattern above fails to match, FTR can remove the item from the feed or keep it in.

Keeping the item will keep the title, URL and original description (if any) found in the feed. In addition, FTR inserts a message before the original description notifying you that extraction failed.

Include Excerpt

Check the box and we'll include a brief plain text excerpt from the extracted content in the output.

JSON Output

We'll output JSON if selected

Debug

Check the box to see what's happening behind the scenes.

Query String Parameters

Using the form is the simplest way to generate a Full-Text RSS URL, but you can also construct one yourself. The form fields above are turned into query string parameters when you submit the form. Let's look at those parameters here, and a few more that are not presented on the form.

These parameters are to be appended on to the base URL. The base URL is where you installed Full-Text RSS, e.g. http://example.org/full-text-rss/. Because this will differ from installation to installation, in this guide we'll simply refer to the endpoints using the filenames only: extract.php and makefulltextfeed.php.

This page describes the two endpoints offered by Full-Text RSS:

  1. Article Extraction
  2. Feed Conversion

If you've restricted access to Full-Text RSS, the final section on API keys will tell you how to pass your key along in the request.

NOTE ON ENCODING

When constructing URLs without using the form, make sure you URL-encode the parameter values (anything after the '=' and before the '&'). In PHP the function to use is urlencode(). If you're doing it by hand, you can paste the parameter values into a web-based encoder and click 'Encode' to get the encoded value.

1. Article Extraction

To extract article content from a web page and get a simple JSON response, use the following endpoint:

/extract.php?url=[url]

Request Parameters

When making HTTP requests, you can pass the following parameters to extract.php in a GET or POST request.

NOTE

The configuration file will ultimately determine if and how many of these parameters can be used.

Parameter Value Description
url string (URL) This is the only required parameter. It should be the URL to a standard HTML page. You can omit the 'http://' prefix if you like.
inputhtml string (HTML) If you already have the HTML, you can pass it here. We will not make any HTTP requests for the content if this parameter is used. Note: The input HTML should be UTF-8 encoded. And you will still need to give us the URL associated with the content (the URL may determine how the content is extracted, if we have extraction rules associated with it).
content 0, html5 (default), text If set to 0, the extracted content will not be included in the output. If set to text, we'll convert the extracted HTML to plain text. By default text output is wrapped at 70 characters. You can use text0 to disable wrapping. Or set wrapping to a value between 20 and 1000 characters, e.g. text80 will wrap at 80 characters.
links preserve (default), footnotes, remove Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
images 1 (default), 0 If set to 0, images and associated elements (img, figure, figcaption) will be removed from the output.
xss 0, 1 (default)

Use this to enable/disable XSS filtering. It is enabled by default, but if your application/framework/CMS already filters HTML for XSS vulnerabilities, you can disable XSS filtering here.

If enabled, we'll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: when enabled this will remove certain elements you may want to preserve, such as iframes.

lang 0, 1 (default), 2, 3

Language detection. If you'd like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) (Default value)
2
As above, but guess the language if it's not specified.
3
Always guess the language, whether it's specified or not.
debug [no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parser html5php, libxml The default parser is libxml as it's the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It's slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter. Note: if the Gumbo PHP extension is available, that will be used regardless of this parameter or site config file directives.
siteconfig string Site-specific extraction rules are usually stored in text files in the site_config folder. You can also submit extraction rules directly in your request using this parameter.
proxy 0, 1, string (proxy name) This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Response (example)

Simple JSON output containing extracted article title, content, and more. It was produced from the following input URL: http://www.truthdig.com/report/print/make_america_ungovernable_20170205

{
"title": "Make America Ungovernable",
"excerpt": "By Chris Hedges Mr. Fish / Truthdig Donald Trump’s regime is rapidl…",
"date": "2017-02-05T23:34:57+00:00",
"author": null,
"language": "en",
"url": "http://www.truthdig.com/report/item/make_america_ungovernable_20170…",
"effective_url": "http://www.truthdig.com/report/print/make_america_ungovernable_2017…",
"domain": "truthdig.com",
"word_count": 2284,
"og_url": "http://www.truthdig.com/report/print/make_america_ungovernable_2017…",
"og_title": "Make America Ungovernable: Chris Hedges",
"og_description": "The window to overthrow the Trump regime is rapidly closing. We mus…",
"og_image": null,
"og_type": "article",
"twitter_card": null,
"twitter_site": "@truthdig",
"twitter_creator": "@truthdig",
"twitter_image": null,
"twitter_title": "Make America Ungovernable | Truthdig: Drilling Beneath the Headline…",
"twitter_description": "The window to overthrow the Trump regime is rapidly closing. We mus…",
"content": "<h4 class="date">Posted on Feb 5</h4>…"
}

Note: For brevity the output above is truncated.

2. Feed Conversion

To transform a partial feed to a full-text feed, pass the URL (encoded) in the querystring to the following URL:

makefulltextfeed.php?url=[url]

All the parameters in the form at the top of this page can be passed in this way. Examine the URL in the address bar after you click 'Create Feed' to see the values.

Request Parameters

When making HTTP requests, you can pass the following parameters to makefulltextfeed.php in a GET request. Most of these parameters have default values suitable for news enthusiasts who simply want to subscribe to a full-text feed in their news reading application. If that's what you're doing, you can safely ignore the details here. For developers, or others who need more control over the output produced by Full-Text RSS, this section should give you an idea of what you can do.

We do not provide form fields for all of these parameters, but you can modify the URL in your browser after clicking 'Create Feed' to use them.

NOTE

The configuration file will ultimately determine if and how many of these parameters can be used.

Parameter Value Description
url string (URL) This is the only required parameter. It should be the URL to a partial feed or a standard HTML page. You can omit the 'http://' prefix if you like.
format rss (default), json The default Full-Text RSS output is RSS. The only other valid output format is JSON. To get JSON output, pass format=json in the querystring. Exclude it from the URL (or set it to ‘rss’) if you’d like RSS.
summary 0 (default), 1 If set to 1, an excerpt will be included for each item in the output.
content 0, 1 (default), html5 If set to 0, the extracted content will not be included in the output. If set to html5, we'll output HTML5.
links preserve (default), footnotes, remove Links can either be preserved, made into footnotes, or removed. None of these options affect the link text, only the hyperlink itself.
images 1 (default), 0 If set to 0, images and associated elements (img, figure, figcaption) will be removed from the output.
exc 0 (default), 1 If Full-Text RSS fails to extract the article body, the generated feed item will include a message saying extraction failed followed by the original item description (if present in the original feed). You ask Full-Text RSS to remove such items from the generated feed completely by passing 1 in this parameter.
accept auto (default), feed, html

Tell Full-Text RSS what it should expect when fetching the input URL. By default Full-Text RSS tries to guess whether the response is a feed or regular HTML page. It's a good idea to be explicit by passing the appropriate type in this parameter. This is useful if, for example, a feed stops working and begins to return HTML or redirecs to a HTML page as a result of site changes. In such a scenario, if you've been explicit about the URL being a feed, Full-Text RSS will not parse HTML returned in response. If you pass accept=html (previously html=1), Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.

Note: If excluded, or set to auto, Full-Text RSS first tries to parse the server's response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html mode, Full-Text RSS will identify itself as a browser from the very first request.

xss 0 (default), 1

Use this to enable XSS filtering. We have not enabled this by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it's good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMSs which display feed content - the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side - although there's client side xss filtering available too.

If enabled, we'll pass retrieved HTML content through htmLawed (safe flag on and style attributes denied). Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

callback string This is for JSONP use. If you're requesting JSON output, you can also specify a callback function (Javascript client-side function) to receive the Full-Text RSS JSON output.
lang 0, 1 (default), 2, 3

Language detection. If you'd like Full-Text RSS to find the language of the articles it processes, you can use one of the following values:

0
Ignore language
1
Use article metadata (e.g. HTML lang attribute) or feed metadata. (Default value)
2
As above, but guess the language if it's not specified.
3
Always guess the language, whether it's specified or not.

If language detection is enabled and a match is found, the language code will be returned in the <dc:language> element inside the <item> element.

debug [no value], rawhtml, parsedhtml

If this parameter is present, Full-Text RSS will output the steps it is taking behind the scenes to help you debug problems.

If the parameter value is rawhtml, Full-Text RSS will output the HTTP response (headers and body) of the first response after redirects.

If the parameter value is parsedhtml, Full-Text RSS will output the reconstructed HTML (after its own parsing). This version is what the extraction rules are applied to, and it may differ from the original (rawhtml) output. If your extraction rules are not picking out any elements, this will likely help identify the problem.

Note: Full-Text RSS will stop execution after HTML output if one of the last two parameter values are passed. Otherwise it will continue showing debug output until the end.

parser html5php, libxml The default parser is libxml as it's the fastest. HTML5-PHP is an HTML5 parser implemented in PHP. It's slower than libxml, but can often produce better results. You can request HTML5-PHP be used as the parser in a site-specific config file (to ensure it gets used for all URLs for that site), or explicitly via this request parameter. Note: if the Gumbo PHP extension is available, that will be used regardless of this parameter or site config file directives.
siteconfig string Site-specific extraction rules are usually stored in text files in the site_config folder. You can also submit extraction rules directly in your request using this parameter.
proxy 0, 1, string (proxy name) This parameter has no effect if proxy servers have not been entered in the config file. If they have been entered and enabled, you can pass the following values: 0 to disable proxy use (uses direct connection). 1 for default proxy behaviour (whatever is set in the config), or a string to identify a specific proxy server (has to match the name given to the proxy in the config file).

Feed-only parameters — These parameters only apply to web feeds. They have no effect when the input URL points to a web page.

Parameter Value Description
use_extracted_title 0 (default), 1 By default, if the input URL points to a feed, item titles in the generated feed will not be changed - we assume item titles in feeds are not truncated. If you'd like them to be replaced with titles Full-Text RSS extracts, use this parameter in the request. To enable/disable this for for all feeds, see the config file - specifically $options->favour_feed_titles
use_effective_url 0 (default), 1 When we extract content for feed items, we often end up at a different URL than the one in the original feed. This is often a result of URL shorteners or tracking services being used by the feed publisher. We include the final (effective) URL we reached to get the content inside the dc:identifier field. If you enable this, we'll also use this URL in place of the original item URL in the new feed we produce. To enable/disable this for for all feeds, see the config file - specifically $options->favour_effective_url
max number The maximum number of feed items to process. (The default and upper limit will be found in the configuration file.)

Response (example)

JSON output produced for the BBC feed http://feeds.bbci.co.uk/news/rss.xml. You can also request regular RSS.

{
"rss": {
"@attributes": {
"version": "2.0"
        }
,
"channel": {
"title": "BBC News - Home",
"link": "http://www.bbc.co.uk/news/#sa-ns_mchannel=rss&amp;ns_source=PublicR…",
"description": "The latest stories from the Home section of the BBC News web site.",
"ttl": 15,
"image": {
"title": "BBC News - Home",
"link": "http://www.bbc.co.uk/news/#sa-ns_mchannel=rss&amp;ns_source=PublicR…",
"url": "http://news.bbcimg.co.uk/nol/shared/img/bbc_news_120x60.gif"
            }
,
"item": [
{
"title": "Russia's Putin visits annexed Crimea",
"link": "http://www.bbc.co.uk/news/world-europe-27344029#sa-ns_mchannel=rss&…",
"guid": "http://www.bbc.co.uk/news/world-europe-27344029#sa-ns_mchannel=rss&…",
"description": "President Putin: "[Crimeans have] proved their loyalty to a histor…",
"content_encoded": "<!-- Adding hypertab -->&#13;\n&#13;\n&#13;\n<!-- end of hypertab -…",
"pubDate": "Fri, 09 May 2014 15:02:04 +0000",
"dc_language": "en-gb",
"dc_format": "text/html",
"dc_identifier": "http://www.bbc.co.uk/news/world-europe-27344029",
"media_thumbnail": [
{
"@attributes": {
"url": "http://news.bbcimg.co.uk/media/images/74751000/jpg/_74751301_ycst2i…"
                            }

                        }
,
{
"@attributes": {
"url": "http://news.bbcimg.co.uk/media/images/74751000/jpg/_74751302_ycst2i…"
                            }

                        }

                    ]

                }
,
{
"title": "Harris 'assaulted daughter's friend'",
"link": "http://www.bbc.co.uk/news/uk-27340134#sa-ns_mchannel=rss&ns_source=…",
"guid": "http://www.bbc.co.uk/news/uk-27340134#sa-ns_mchannel=rss&amp;ns_sou…",
"description": "Rolf Harris arrives at court flanked by his wife and daughter Rolf …",
"content_encoded": "<!-- Embedding the video player -->&#13;\n<!-- This is the embedd…",
"pubDate": "Fri, 09 May 2014 15:21:52 +0000",
"dc_language": "en-gb",
"dc_format": "text/html",
"dc_identifier": "http://www.bbc.co.uk/news/uk-27340134",
"media_thumbnail": [
{
"@attributes": {
"url": "http://news.bbcimg.co.uk/media/images/74740000/jpg/_74740642_hi0221…"
                            }

                        }
,
{
"@attributes": {
"url": "http://news.bbcimg.co.uk/media/images/74740000/jpg/_74740643_hi0221…"
                            }

                        }

                    ]

                }
,
{
"title": "Nigeria 'ignored' school warning",
"link": "http://www.bbc.co.uk/news/world-africa-27344863#sa-ns_mchannel=rss&…",
"guid": "http://www.bbc.co.uk/news/world-africa-27344863#sa-ns_mchannel=rss&…",
"description": "Nigeria's military had advance warning of the attack on a school at…",
"content_encoded": "<div class="caption full-width">&#13;\n <img src="http://news.b…",
"pubDate": "Fri, 09 May 2014 15:48:34 +0000",
"dc_language": "en-gb",
"dc_format": "text/html",
"dc_identifier": "http://www.bbc.co.uk/news/world-africa-27344863",
"media_thumbnail": [
{
"@attributes": {
"url": "http://news.bbcimg.co.uk/media/images/74749000/jpg/_74749855_747495…"
                            }

                        }
,
{
"@attributes": {
"url": "http://news.bbcimg.co.uk/media/images/74749000/jpg/_74749856_747495…"
                            }

                        }

                    ]

                }

            ]

        }

    }

}

Note: For brevity the output above is truncated.

API Keys

To restrict access to your copy of Full-Text RSS, you can specify API keys in the config file.

NOTE

Feeds produced by Full-Text RSS are intended to be publically accessible to work with feed readers. As such, the API key should not appear in the final URL for feeds.

Parameter Value Description
key string or number

This parameter has two functions.

If you're calling Full-Text RSS programattically, it's better to use this parameter to provide the API key index number together with the hash parameter (see below) so that the actual API key does not get sent in the HTTP request.

If you pass the actual API key in this parameter, the hash parameter is not required. If you pass the actual API key, Full-Text RSS will find the index number and generate the hash value automatically and redirect to a new URL to hide the API key. If you'd like to link to a generated feed publically while protecting your API key, make sure you copy and paste the URL that results after the redirect.

If you've configured Full-Text RSS to require a key, an invalid key will result in an error message.

hash string A SHA-1 hash value of the API key (actual key, not index number) and the URL supplied in the url parameter, concatenated. This parameter must be passed along with the API key's index number using the key parameter (see above). In PHP, for example: $hash = sha1($api_key.$url);
key_redirect 0 or 1 (default)

When supplying the API key with the key parameter, Full-Text RSS will generate a new URL and issue a HTTP redirect to the new URL to hide the API key (see description above). If you'd like to avoid an HTTP redirect, you can pass 0 in this parameter. We do not recommend you subscribe to feeds generated in this way.