Downloading big rotated files
I've been improving a little Elevation's reproducibility. One of the steps of setting it up is to download an extract to both import in the database and fetch the DEM files that will be part of the background. The particular extract that I'm using, Europe, is more than 17GiB in size, which means that it takes a looong time to download. Thus, I would like to have the ability to continue the download if it has been interrupted.
The original script
that was trying to do that is using curl
. This version is not trying to continue
the download, which can easily be achieved by adding the --continue -
option. The
version that has it never hit the repo because of the following:
The problem arises when the file we want to download is rolled every day. This means
that the contents of the file changes from one day to the other, and we can't just
continue from we left if that's the case, we must start all over1.
One could think that curl
has an option that looks like it handles that,
--time-cond
, which is what the script is trying to use. This option makes curl
send the
If-Modified-Since
HTTP header,
which allows the server to respond with a 304 (Not modified) if the file is not newer
that the provided date. The date the curl
provides is the one from the file
referenced by that option, and I was giving the same file as the one where the output
goes. I was using these options wrong, it was doing it the other way around: continue
if the file changed or doing nothing if not.
So I sat down to try and tackle the problem. I know one can use the HEAD
request
to check (at least) two things: the resource's date and size (bah, at least in the case of
static files like this). So the original idea was to get the URL's date and size;
if the date is newer than the local file, I should restart the download from scratch;
if not and the size was bigger than the local file, then continue; otherwise, assume
the file is finished downloading and stop there.
The last twist of the problem is that the only useful dates from the file were either
ctime
or mtime
, but both change on every write on the file. This means that if I
leave the script downloading the file, and in the meanwhile the file is rotated,
and the download is interrupted and I try again later, the file's c/mtime
is
newer that the URL, even when is for a file that is older then the URL. So I
had to add a parallel timestamp file that is created only when starting a download
and never updated (until the next full download; the file is actually touch
'ed),
and it is its mtime
the one used for comparing with the URL's.
Long story short, curl
's --time-cond
and --continue
options combined are not
for this, a HEAD
helps a little bit, but rotation-while-downloading can further
complicate things. One last feature one could ask to such a script would be to keep
the old file while downloading a new one and rotate at the end, but I will leave it
for when/if I really need it.
The new script
is written in ayrton
because it's easier to
handle execution output and dates in it than in bash
. This also pushed me to make
minor improvements to it, so expect a release soon.
-
In fact the other options are not do anything (but then we're left with an incomplete, useless file) or to try and find the file; in the case of geofabrik, they keep the last week of daily rotation, the first day of each previous month back to the beginning of the year; then the first day of each year back to 2014. Good luck with that. ↩