Best strategy for processing multiple feeds

desk-user · December 30, 2014, 5:05pm

I submitted a similar question at the end of an older thread (Turning APC off), but wasn’t sure if that thread is still actively monitored. My apologies for the duplication.

I am using FTR to monitor several client feeds (300+) and am trying to figure out the most efficient way to extract NEW content. I currently check the feeds every few hours for new articles. When a new article is found, I store it in our db and from that point forward I will then only extract articles newer than the last one stored. Most of the feeds do not publish every day, and if they do they rarely publish more than one article a day. Still, because I can’t know if, when or how often they publish, I am checking regularly. Also, in order to do the date compare, I’m obligated to pull back whatever is in the feed, checking the article pub date, then stopping once I reach an older date, and moving on to the next feed.

Currently, I am looping through my list of hundreds of feeds, and calling makefulltextfeed.php for each feed – array(‘format’=>‘json’,‘max’=> 100,‘summary’=>1,‘url’=>$this->feed_url)

I am able to do this for about 70 at a time before I get a server error (500), which I am presently trying to debug.

I’m wondering if there is a more efficient way to do this. From another thread (Feature Request: Support array of URLs on the extract.php endpoint) I see I can combine URLs in a single request to makefulltextfeed.php. I’m wondering if this strategy supports hundreds of URLs concatenated together, and if this would be more efficient. I could break it into fewer URLs per request, if that would help.

Also, wondering if there is some more efficient way in which I can accomplish the date compare, to just check for new content.

Your assistance is immensely appreciated. This is the last piece of the puzzle for publishing this service.

This tool has been incredibly useful!

fivefilters · January 1, 2015, 6:03pm

Hi Marc,

Thanks for the question.

My suggestion for making it more efficient would be to monitor the feeds (using original feed URLs) and then when you notice new items, pull in those new items using Full-Text RSS. This will require more code on your part. But if you do it, you’ll avoid having Full-Text RSS pull in content that you have previously stored in an earlier request.

If you do not want to do this, I would reduce the ‘max’=>100 parameter value to something much smaller. Otherwise for large feeds, which you’re monitoring regularly anyway, you’ll be asking Full-Text RSS to pull in up to a 100 items per request. It’s unlikely you’re going encounter 100 new items published for a feed between your requests, so reduce this to 5 and you’ll be asking Full-Text RSS to do a lot less processing and it should return much quicker.

Full-Text RSS has no knowledge of prior requests for a URL at the moment, other than basic caching. So it doesn’t know if between requests for a feed, there were new items published or not. Your own application will have to keep track of that (most feed readers do this already). In a future version we’ll probably have caching for individual feed items. So requesting a feed that’s already had items pulled in from an earlier request will result in those items being loaded from disk rather than remotely. But that’s not here yet.

Hope this is some help.