Reprojecting and splitting huge datasets
Another aspect I've been looking into with respect to optimizing the rendering speed is data sources that are not in the target projection. Not being in the target projection forces Mapnik to reproject them on the fly, and this for each (meta)tile, whereas it would make more sense to have them already reprojected.
In my case I have 3 datasets, big ones, in EPSG:4258. The three datasets are elevation, hill and slope shade based on EEA's DEM files. The source tiles amount to almost 29GiB, and the elevation layer, being RGB, takes more than that. So I set off to try to reproject the things.
My first, more obvious approach was to reproject every 5x5°, 18000x18000px tile, then derive the data I need, but I started to get gaps between tiles. Notice that the original data is cleanly cut (5° x 3600"/° x 1px/" == 18000px), without any overlapping.
The next approach was to merge them all
in a .vrt file, then reproject chunks of it with gdalwarp
. What I wanted as
output was the same 5x5° tiles, reprojected, but with an extra pixel, so they
overlap. This last requirement was the problematic one. See, the final
projection makes any square in the original projection a tall rectangle,
stretching more and more towards the poles. The closest I could get
was to use the -ts
option, but that meant that I didn't get any control about
how many extra pixels I got in the vertical/latitude direction. My OCD started
thrashing :) In fact what happened was that I was not sure how GDAL would handle
the possible partial pixel, whether rounding down (meaning excluding it), up
(finishing it), or simply leaving the pixel with partial data and impacting the
final rendering.
Even Rouault pointed to me that gdalwarp
can do something fantastic: it can
generate a .vrt
file too with all the parameters needed for the reprojection,
so reading from there was automatically reprojecting the original data. The
resulting dataset is 288,000x325,220px (the original is 288,000x180,000px),
so I'm definitely going to cut it down in small
tiles. After consulting with a eight-ball, I decided to discard the idea of
tiles with boundaries based on coordinates, which might not even make sense
anymore, but settle for pixel based sizes, still with an extra pixel. The chosen
size is 2**14+1 a.k.a. 16385. For this gdal_translate
is perfect.
The final algorithm is like this:
gdalwarp -t_srs "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 \ +x_0=0.0 +y_0=0.0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs +over" \ -r lanczos -tr 30.92208077590933 -30.92208077590933 \ -of VRT EU-DEM.vrt EU-DEM-corrected.vrt
The values for the -tr
option is the pixel size in meters, which is the unit
declared in the SRS. Notice that as Mercator stretches towards the poles, this
is the size at the origin of the projection; in this case, at 0° lat, 0° lon.
Then the reprojection (by reading from the reprojecting dataset) and cut, in a couple of loops:
tile_size=$((2**14)); \ for i in $(seq 0 17); do for j in $(seq 0 5); do for k in $(seq 0 3); do l=$((4*$j+$k)); gdal_translate -co BIGTIFF=YES -co TILED=YES -co COMPRESS=LZMA \ -co LZMA_PRESET=9 \ -srcwin $(($tile_size*$i)) $(($tile_size*$l)) \ $(($tile_size+1)) $(($tile_size+1)) \ -of GTiff EU-DEM-corrected.vrt \ $(printf "%03dx%03d-corrected.tif" $i $l) & done; wait; done; done
There's an extra loop to be able to launch 4 workers at the same time, because I
have 4 cores. This doesn't occupy the 4 cores a 100% of the time (cores that
already finished stay idle until the other finished), but it was getting akward
to express in a Makefile
, and this is run only once.
Before deriving the rest of the data there's an extra step: removing those generated tiles that actually have no data. I do a similar thing with empty sea tiles in the rendering process. Notice also that the original data is not tile complete for the covered region (79 tiles instead of the 160 they should be).