Downloading big rotated files

Marcos Dione

2016-05-10 17:30

I've been improving a little Elevation's reproducibility. One of the steps of setting it up is to download an extract to both import in the database and fetch the DEM files that will be part of the background. The particular extract that I'm using, Europe, is more than 17GiB in size, which means that it takes a looong time to download. Thus, I would like to have the ability to continue the download if it has been interrupted.

The original script that was trying to do that is using curl. This version is not trying to continue the download, which can easily be achieved by adding the --continue - option. The version that has it never hit the repo because of the following:

The problem arises when the file we want to download is rolled every day. This means that the contents of the file changes from one day to the other, and we can't just continue from we left if that's the case, we must start all over¹. One could think that curl has an option that looks like it handles that, --time-cond, which is what the script is trying to use. This option makes curl send the If-Modified-Since HTTP header, which allows the server to respond with a 304 (Not modified) if the file is not newer that the provided date. The date the curl provides is the one from the file referenced by that option, and I was giving the same file as the one where the output goes. I was using these options wrong, it was doing it the other way around: continue if the file changed or doing nothing if not.

So I sat down to try and tackle the problem. I know one can use the HEAD request to check (at least) two things: the resource's date and size (bah, at least in the case of static files like this). So the original idea was to get the URL's date and size; if the date is newer than the local file, I should restart the download from scratch; if not and the size was bigger than the local file, then continue; otherwise, assume the file is finished downloading and stop there.

The last twist of the problem is that the only useful dates from the file were either ctime or mtime, but both change on every write on the file. This means that if I leave the script downloading the file, and in the meanwhile the file is rotated, and the download is interrupted and I try again later, the file's c/mtime is newer that the URL, even when is for a file that is older then the URL. So I had to add a parallel timestamp file that is created only when starting a download and never updated (until the next full download; the file is actually touch'ed), and it is its mtime the one used for comparing with the URL's.

Long story short, curl's --time-cond and --continue options combined are not for this, a HEAD helps a little bit, but rotation-while-downloading can further complicate things. One last feature one could ask to such a script would be to keep the old file while downloading a new one and rotate at the end, but I will leave it for when/if I really need it. The new script is written in ayrton because it's easier to handle execution output and dates in it than in bash. This also pushed me to make minor improvements to it, so expect a release soon.

In fact the other options are not do anything (but then we're left with an incomplete, useless file) or to try and find the file; in the case of geofabrik, they keep the last week of daily rotation, the first day of each previous month back to the beginning of the year; then the first day of each year back to 2014. Good luck with that. ↩