Find & replace

I need to find and replace that tag:

 

And here are my config file :

Generated by FiveFilters.org’s web-based selection tool

Place this file inside your site_config/custom/ folder

Source: http://siteconfig.fivefilters.org/grab.php?url=http%3A%2F%2Fauto.ahram.org.eg%2FNews%2F12686.aspx

body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ img_Block ')]//img | //p

srtip: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ columns ‘) and (contains(concat(’ ‘,normalize-space(@class),’ ‘),’ padding ‘)) and (contains(concat(’ ‘,normalize-space(@class),’ ‘),’ m_m ‘))]//div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ adv ‘) and (contains(concat(’ ‘,normalize-space(@class),’ ‘),’ hide-for-small '))]//div
strip: //a

#Parser
tidy: yes
parser: html5lib

Clean HTML after procesing

prune: no

test_url: http://auto.ahram.org.eg/News/12686.aspx


Hi Mohammed,

You can use the following to find and replace:

find_string:

 


replace_string:

This would be entered in the txt file that we would create for that specific domain, correct?

Thanks!

Jojo Valdez

Yes, that’s correct. So if you wanted this replacement to happen on all requests to example.org. You would create or edit the site config file example.org.txt and add those lines to it.

Exactly where do we put find_string and replace _string in the txt file? Does it has to be in certain tags or can it be anywhere in that text file? Can we also put html code in the find and replace?

Thanks!

Hi there, they can be anywhere in the text file. But find_string should precede replace_string as they’re treated as a set. You can see examples here https://github.com/fivefilters/ftr-site-config/search?utf8=✓&q=find_string

Can I use this:

find_string : <strong>RELATED .*? </strong>
replace_string:``

Hi there, we do not currently support wildcards or regex in the find_string/replace_string directives.

But if the intention is to remove the element, you can achieve something similar with our strip directive and XPath:

strip: //strong[contains(., 'RELATED')]

What if your HTML actually looks like this:

<div id='article'>
 <p>Para 1</p>
 <p>Para 2</p>
 <p>Para 3</p>
 <p><strong>RELATED:</strong></p>
 <ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <li>Item 3</li>
 </ul>
</div>

You’ll still be left with Item 1, 2 and 3 if you only remove the <strong> element. But XPath can still help you. You can use two strip directives. First to remove the <ul> element after the “Related” text and then to remove the “Related” text itself:

strip: //div[@id='article']/p[contains(.,'RELATED:')]/following-sibling::ul[1]
strip: //div[@id='article']/p[contains(.,'RELATED:')]

You can experiment with XPath and the example above here:

http://www.xpathtester.com/xpath/60f55de5a8bab796d2bb4e54a4e99356

If you want to be more precise, you can edit the XPath to make sure it targets <p> elements which contain “RELATED” inside a <strong> element:

strip: //div[@id='article']/p[strong[contains(.,'RELATED')]]

Hi! If the rss feed is on a subdomain do I create a custom txt file with the subdomain url? The original site resides on the original domain but the rss feed resides on a sub domain. Because its a new custom txt file it is blank. So:

find_string: TEST
replace_string:

If it finds the “TEST” anywhere on that feed including the title of that feed then it should delete it, correct?

Thanks!

I’m having an issue.

I am grabbing the rss feed:
https://rss.medicalnewstoday.com/featurednews.xml

But after rendering it with full-text RSS all the titles has “Medical News Today:” before the actual title.
I want to delete the “Medical News Today” but keep the original title. I did find and replace and even the strip but its not working for me. Also because the rss feed resides on a subdoamin do I need to name the custom site config after the subdomain???

Patiently waiting
Thanks!

Hi there,

If the rss feed is on a subdomain do I create a custom txt file with the subdomain url? The original site resides on the original domain but the rss feed resides on a sub domain.

No, you should create the custom site config file with domain that is actually used to retrieve the web page. So for example, if feeds.example.org/rss is the feed URL and it contains items which load from example.org, your site config file should be example.org.txt and not feeds.example.org.txt.

This is also true if there’s redirecting happening for whatever reason, e.g. a URL shortener is used or an analytics service at a different domain. So if feeds.example.org/rss has items with URLs of the form https://bit.ly/xyz which redirect to example.org/... Full-Text RSS will follow the redirect(s) and look for a site config file matching the final URL (in this example example.org.txt).

I am grabbing the rss feed:
https://rss.medicalnewstoday.com/featurednews.xml

But after rendering it with full-text RSS all the titles has “Medical News Today:” before the actual title. I want to delete the “Medical News Today” but keep the original title.

When given a feed, Full-Text RSS prioritises the item titles in the feed over the title extracted from the item URL. But in situations like this, you can tell Full-Text RSS to use the titles extracted from the articles instead of the titles found in the feed. To do that you’d pass:

&use_extracted_title=1

In this case, the extracted titles don’t contain the text you’re trying to remove (they only appear in the feed), so there’s really nothing else to do with the site config file. The final URL simply becomes something like this:

http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A%2F%2Frss.medicalnewstoday.com%2Ffeaturednews.xml&use_extracted_title=1

Hope that’s some help.

So i want to get rid of the content in this div class on the xml feed:

div class=“one_half” readability=“12”

I put on the custom text file:

strip: //div[@class='one_half']

How long does this take in effect? Because when I put the feed in the full text rss nothing happens.

Thanks!

Can you please start a new thread for this and also include the name you gave the site config file and the URL you’re giving Full-Text RSS.