NY Times wants log-in

Hi, I’m using self-hosted v3.3 with the latest site_config from github, when I generate a feed from:

http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml

(requesting ~10 articles or more), after the first ~5 articles the full text feed contains mostly links to the NY Times log-in page.

Is there a workaround for this, perhaps a way to configure full-text-rss to pass a cookie as a credential?

Thanks in advance.

PS: Thanks, Full-Text RSS is wonderful :slight_smile:

Mark

Hi Mark, thanks for reporting this. We’re going to look into it and see if there’s anything we can do. If the first 5 articles are always retrieved okay, short term solution might be to limit results to 5 items in Full-Text RSS. (e.g. using the max paramter: &max=5)

Thanks. This is what I’m doing for now but sadly it means I’m missing quite a lot of articles due to a slow Feedly refresh interval that I cannot control. Thanks in advance!

Mark

Did anyone thing of a solution for this please? I’m now running 3.4, would the advert detection feature help with this?

Mark

Ah, this is interesting - digging into the problem the limit of 5 corresponds to the default number of parallel fetches that HumbleHTTP is using. Increasing the default to 25 seems to be a workaround for this site; although I don’t know what else that will break.

Strangely I haven’t managed to reproduce the symptoms manually fetching items from the feed. I’d love to understand exactly that’s going on here.

Mark

Hi Mark, thanks for the interesting update.

We haven’t had time to investigate this yet, but from what you describe, my guess is that the reason this workaround works has to do with cookies.

HumbleHTTPAgent looks for cookies in HTTP responses, and sends them on subsequent requests. So these cookies might be what NYTimes uses to block subsequent requests. There are a few caveats, though:

  • Cookies are only stored in memory while processing a single request to makefulltextfeed.php, they do not get used when you issue another call to makefulltextfeed.php.
  • HumbleHTTPAgent, by default, sends requests for the first 5 URLs in a feed in parallel, as such, those first 5 requests will contain no cookie data (as no response as been received yet)
  • If you’re processing 10 feed items, the second set of 5 URLs could contain cookie data if the previous set resulted in cookies.
  • If the feed contains 10 or 25 items, and you increase parallel requests to 10 or 25, then no cookie data is sent as the whole lot will be sent in parallel (leaving no time for response parsing between requests).

We’ll try to test to see if this is in fact the cause. If it is, we’ll try to implement a workaround soon.

Ah, spot on - cookies seems to be the issue. Sadly disabling cookies breaks the site completely. It looks like some sort of cookie filtering is required.

Mark

Hi Mark,

Here’s a suggested patch for this issue by Dave Vasilevsky. We haven’t tested yet, but hope to incorporate it into the next version once we have.

Dave writes:

“The patch causes FTR to only apply cookies to redirects of the same item. This way, subsequent items don’t see cookies from previous items, so sites can’t use cookies to limit the number of pages we process. I’ve tested with all three HumbleHttpAgent methods, curl_multi, request pool, and file_get_contents. All work ok. It’s possible—though unlikely—that this patch could break some feeds. Maybe there’s a feed out there whose items depend on cookies set by previous items, or by the feed itself.”


libraries/humble-http-agent/CookieJar.php | 2 ±
libraries/humble-http-agent/HumbleHttpAgent.php | 47 ++++++++++++++++±-------
2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/libraries/humble-http-agent/CookieJar.php b/libraries/humble-http-agent/CookieJar.php
index ac346b5…350b706 100644
— a/libraries/humble-http-agent/CookieJar.php
+++ b/libraries/humble-http-agent/CookieJar.php
@@ -229,7 +229,7 @@ class CookieJar
}

// return array of set-cookie values extracted from HTTP response headers (string $h)
  • public function extractCookies($h) {
  • public static function extractCookies($h) {
    $x = 0;
    $lines = 0;
    $headers = array();
    diff --git a/libraries/humble-http-agent/HumbleHttpAgent.php b/libraries/humble-http-agent/HumbleHttpAgent.php
    index 11b30b5…f972891 100644
    — a/libraries/humble-http-agent/HumbleHttpAgent.php
    +++ b/libraries/humble-http-agent/HumbleHttpAgent.php
    @@ -34,7 +34,7 @@ class HumbleHttpAgent
    protected $curlOptions;
    protected $minimiseMemoryUse = false; //TODO
    protected $method;
  • protected $cookieJar;
  • $this->cookieJar = new CookieJar();
    // set request options (redirect must be 0)
    // HTTP PECL (http://php.net/manual/en/http.request.options.php)
    $this->requestOptions = array(
    @@ -284,6 +283,7 @@ class HumbleHttpAgent
    $this->debug(“Following redirects #$redirects…”);
    $this->fetchAllOnce($this->redirectQueue, $isRedirect=true);
    }
  • $this->deleteCookies();
    }

    // fetch all URLs without following redirects
    @@ -326,7 +326,7 @@ class HumbleHttpAgent
    }
    $httpRequest = new HttpRequest($req_url, $_meth, $this->requestOptions);
    // send cookies, if we have any

  • if ($cookies = $this->cookieJar->getMatchingCookies($req_url)) {
  • if ($cookies = $this->getCookies($orig, $req_url)) {
    $this->debug("…sending cookies: $cookies");
    $httpRequest->addHeaders(array(‘Cookie’ => $cookies));
    }
    @@ -374,10 +374,7 @@ class HumbleHttpAgent
    }
    if ($this->validateURL($redirectURL)) {
    $this->debug('Redirect detected. Valid URL: '.$redirectURL);
  • // store any cookies
  • $cookies = $request->getResponseHeader(‘set-cookie’);
  • if ($cookies && !is_array($cookies)) $cookies = array($cookies);
  • if ($cookies) $this->cookieJar->storeCookies($url, $cookies);
  • $this->storeCookies($orig, $url);
    $this->redirectQueue[$orig] = $redirectURL;
    } else {
    $this->debug('Redirect detected. Invalid URL: '.$redirectURL);
    @@ -459,7 +456,7 @@ class HumbleHttpAgent
    // add referer for picky sites
    $headers[] = 'Referer: '.$this->referer;
    // send cookies, if we have any
  • if ($cookies = $this->cookieJar->getMatchingCookies($req_url)) {
  • if ($cookies = $this->getCookies($orig, $req_url)) {
    $this->debug("…sending cookies: $cookies");
    $headers[] = 'Cookie: '.$cookies;
    }
    @@ -496,9 +493,7 @@ class HumbleHttpAgent
    }
    if ($this->validateURL($redirectURL)) {
    $this->debug('Redirect detected. Valid URL: '.$redirectURL);
  • // store any cookies
  • $cookies = $this->cookieJar->extractCookies($this->requests[$orig][‘headers’]);
  • if (!empty($cookies)) $this->cookieJar->storeCookies($url, $cookies);
  • $this->storeCookies($orig, $url);
    $this->redirectQueue[$orig] = $redirectURL;
    } else {
    $this->debug('Redirect detected. Invalid URL: '.$redirectURL);
    @@ -557,7 +552,7 @@ class HumbleHttpAgent
    $httpContext[‘http’][‘header’] .= $this->getUserAgent($req_url)."\r\n";
    // add referer for picky sites
    $httpContext[‘http’][‘header’] .= 'Referer: '.$this->referer."\r\n";
  • if ($cookies = $this->cookieJar->getMatchingCookies($req_url)) {
  • if ($cookies = $this->getCookies($orig, $req_url)) {
    $this->debug("…sending cookies: $cookies");
    $httpContext[‘http’][‘header’] .= 'Cookie: '.$cookies."\r\n";
    }
    @@ -589,9 +584,7 @@ class HumbleHttpAgent
    }
    if ($this->validateURL($redirectURL)) {
    $this->debug('Redirect detected. Valid URL: '.$redirectURL);
  • // store any cookies
  • $cookies = $this->cookieJar->extractCookies($this->requests[$orig][‘headers’]);
  • if (!empty($cookies)) $this->cookieJar->storeCookies($url, $cookies);
  • $this->storeCookies($orig, $url);
    $this->redirectQueue[$orig] = $redirectURL;
    } else {
    $this->debug('Redirect detected. Invalid URL: '.$redirectURL);
    @@ -709,6 +702,30 @@ class HumbleHttpAgent
    }
    return false;
    }
  • protected function getCookies($orig, $req_url) {
  • $jar = $this->cookieJar[$orig];
  • if (!isset($jar)) {
  • return null;
  • }
  • return $jar->getMatchingCookies($req_url);
  • }
  • protected function storeCookies($orig, $url) {
  • $headers = $this->requests[$orig][‘headers’];
  • $cookies = CookieJar::extractCookies($headers);
  • if (empty($cookies)) {
  • return;
  • }
  • if (!isset($this->cookieJar[$orig])) {
  • $this->cookieJar[$orig] = new CookieJar();
  • }
  • $this->cookieJar[$orig]->storeCookies($url, $cookies);
  • }
  • protected function deleteCookies() {
  • $this->cookieJar = array();
  • }
    }

// gzdecode from http://www.php.net/manual/en/function.gzdecode.php#82930

Brilliant, thanks - I’ve just spotted your reply!

In trying patch, gpatch and git patch none of them like the patch above as copied and pasted from the forum. Not sure if it’s being corrupted somehow. It would be great if you could mail me a copy or point me at a clean patch to download please so I can easily try it without resorting to manual application.

Thanks in advance

Mark

Hi Mark, I’ve tried pasting it here: https://gist.github.com/fivefilters/0a758b6d64ce4fb5728c

Let me know if that’s any better.

Thanks very much, that’s better - I was able to apply that with patch -l -p1 :slight_smile:

A very quick test suggests this looks good. I’ll leave this patch in place and report back when all of my feeds have had a chance to sync.

Thanks again, Mark

Yes, this patch seems to work without any side effects - all of my feeds have now refreshed. Thanks again - looking forward to see what else is in v3.5 :slight_smile:

Mark

Thanks for the update, Mark. Glad to hear it. :slight_smile:

Mark, Keyvan,
could you please provide some detail on how to apply this patch? Thanks so much!
Patrick

To execute the patch, I have used the following command on shell:
patch HumbleHttpAgent.php -i humblehttpagent_patch.diff -o updatedfile

And I receive the following (error) message:
patching file HumbleHttpAgent.php
Hunk #1 FAILED at 229.
1 out of 1 hunk FAILED – saving rejects to file updatedfile.rej
patching file HumbleHttpAgent.php
patch: src/patch.c:1208: apply_hunk: Assertion `outstate->after_newline’ failed.
Aborted (core dumped)

could you please provide the appropriate way to apply this patch? Thanks so much.
Patrick

Hi Patrick,

We hope to have a new release with this patch included - and when we do, we’ll backport it to the public version too.

I don’t know why it’s failing for you (might be because the patch affects two files, rather than one, or maybe the formatting was affected when I uploaded it). For now, it might be easier to make the changes by hand. The patch provided by Dave, which I linked to above, affects two files: CookieJar.php and HumbleHttpAgent.php - both in the libraries/humble-http-agent/ folder.

Start with the visual difference in this Gist: https://gist.github.com/fivefilters/0a758b6d64ce4fb5728c

There’s only one small change in CookieJar.php. Which is to make a public method static. Look at lines 13 and 14 in the Gist above.

Then open HumbleHttpAgent.php and, working from line 26 in the patch above, add the lines in green and remove the lines in red - a few lines surrounding the changes are included which should help locate where the changes need to be made.

Hope that’s some help. Please reply if you still have trouble.