Feature request

Most of the time, Full-Text RSS does a great job of identifying the content of a webpage without the need for a custom rule. Sometimes however, it might miss one element out and so a custom rule needs to be started from scratch.

It would be great if one could specify in a custom rule to use the automated content but add (or subtract) certain elements to/from it to avoid building the rule from scratch.

I’m not sure if this is technically possible but would be good to see it included if it were.

Thanks for considering this!

Hi Anthony,

Thanks for the feature request.

Firstly, the subtracting scenario is already possible.

If you create a site config for a site where automatic extraction is already good, but there are elements you’d like removed from the output. You don’t have to tell it where the body is, you can simply enter strip rules. For example:

strip: //div[contains(@class, ‘to-be-removed’)]

or

strip_id_or_class: to-be-removed

Full-Text RSS will then do its auto-detection and apply your strip rules on the result.

To insert elements is not currently possible. But something we’ve been thinking about. But the common scenarios where you will need to do this are:

  • The detected body element is actually only one part of the article. Here you have to currently write a body rule, and specify either the parent element (containing all body elements), or using specifying more than one element. For example:

body: //div[contains(@class, ‘first-part’)] | // div[contains(@class, ‘rest-of-article’)]

or

body: //div[contains(@class, ‘first-part’) or contains(@class, ‘rest-of-article’)]

  • The detected body element contains everything, but some elements are being stripped out. This can often be resolved by disabling pruning for the site:

prune: no

Note that sometimes elements users want we cannot preserve because they rely on Javascript to load. Using our site config creator will load the article with JS disabled, so you can see what we can realistically return: http://siteconfig.fivefilters.org

Hope that’s some help.

Best, Keyvan

Thanks for the reply. After I hit send, I realised that it was already possible to subtract elements from the auto detected content.

The example you cite is exactly the scenario I was talking about. On some sites, their HTML is hideous and badly formed so extracting the right content requires quite a complex rule (I quite enjoy the challenge!). Often the auto detect is so close to getting it right that just adding one more element will crack it. Would be great to get this kind of functionality.

I’ll give the prune function a go next time.

Anthony L

Thanks Anthony, will think about this in future updates. :slight_smile: