Syndication feed fetchers, HTTP redirects, and conditional GET
In response to my entry on how ETag values are specific to a URL, a Wandering Thoughts reader asked me in email what a syndication feed reader (fetcher) should do when it encounters a temporary HTTP redirect, in the context of conditional GET. I think this is a good question, especially if we approach it pragmatically.
The specification compliant answer is that every final (non-redirected) URL must have its ETag and Last-Modified values tracked separately. If you make a conditional GET for URL A because you know its ETag or Last-Modified (or both) and you get a temporary HTTP redirection to another URL B that you don't have an ETag or Last-Modified for, you can't make a conditional GET. This means you have to insure that If-None-Match and especially If-Modified-Since aren't copied from the original HTTP request to the newly re-issued redirect target request. And when you make another request for URL A later, you can't send a conditional GET using ETag or Last-Modified values you got from successfully fetching URL B; you either have to use the last values observed for URL A or make an unconditional GET. In other words, saved ETag and Last-Modified values should be per-URL properties, not per-feed properties.
(Unfortunately this may not fit well with feed reader code structures, data storage, or uses of low-level HTTP request libraries that hide things like HTTP redirects from you.)
Pragmatically, you can probably get away with re-doing the conditional GET when you get a temporary HTTP redirect for a feed, with the feed's original saved ETag and Last-Modified information. There are three likely cases for a temporary HTTP redirection of a syndication feed that I can think of:
- You're receiving a generic HTTP redirection to some sort of error
page that isn't a valid syndication feed. Your syndication feed
fetcher isn't going to do anything with a successful fetch of it
(except maybe add an 'error' marker to the feed), so a conditional
GET that fools you with "nothing changed" is harmless.
- You're being redirected to an alternate source of the normal feed,
for example a feed that's normally dynamically generated might
serve a (temporary) HTTP redirect to a static copy under high
load. If the conditional GET matches the ETag (probably unlikely
in practice) or the Last-Modified (more possible), then you almost
certainly have the most current version and are fine, and you've
saved the web server some load.
- You're being (temporarily) redirected to some kind of error feed; a valid syndication feed that contains one or more entries that are there to tell the person seeing them about a problem. Here, the worst thing that happens if your conditional GET fools you with "nothing has changed" is that the person reading the feed doesn't see the error entry (or entries).
The third case is a special variant of an unlikely general case where the normal URL and the redirected URL are both versions of the feed but each has entries that the other doesn't. In this general case, a conditional GET that fools you with a '304 Not Modified' will cause you to miss some entries. However, this should cure itself when the temporary HTTP redirect stops happening (or when a new entry is published to the temporary location, which should change its ETag and reset its Last-Modified date to more or less now).
A feed reader that keeps a per-feed 'Last-Modified' value and updates it after following a temporary HTTP redirect is living dangerously. You may not have the latest version of the non-redirected feed but the target of the HTTP redirection may be 'more recent' than it for various reasons (even if it's a valid feed; if it's not a valid feed then blindly saving its ETag and Last-Modified is probably quite dangerous). When the temporary HTTP redirection goes away and the normal feed's URL resumes responding with the feed again, using the target's "Last-Modified" value for a conditional GET of the original URL could cause you to receive "304 Not Modified" until the feed is updated again (and its Last-Modified moves to be after your saved value), whenever that happens. Some feeds update frequently; others may only update days or weeks later.
Given this and the potential difficulties of even noticing HTTP redirects (if they're handled by some underlying library or tool), my view is that if a feed provides both an ETag and a Last-Modified, you should save and use only the ETag unless you're sure you're going to handle HTTP redirects correctly. An ETag could still get you into trouble if used across different URLs, but it's much less likely (see the discussion at the end of my entry about Last-Modified being specific to the URL).
(All of this is my view as someone providing syndication feeds, not someone writing syndication feed fetchers. There may be practical issues I'm unaware of, since the world of feeds is very large and it probably contains a lot of weird feed behavior (to go with the weird feed fetcher behavior).)