Reprojecting and splitting huge datasets

Another aspect I've been looking into with respect to optimizing the rendering speed is data sources that are not in the target projection. Not being in the target projection forces Mapnik to reproject them on the fly, and this for each (meta)tile, whereas it would make more sense to have them already reprojected.

In my case I have 3 datasets, big ones, in EPSG:4258. The three datasets are elevation, hill and slope shade based on EEA's DEM files. The source tiles amount to almost 29GiB, and the elevation layer, being RGB, takes more than that. So I set off to try to reproject the things.

My first, more obvious approach was to reproject every 5x5°, 18000x18000px tile, then derive the data I need, but I started to get gaps between tiles. Notice that the original data is cleanly cut (5° x 3600"/° x 1px/" == 18000px), without any overlapping.

The next approach was to merge them all in a .vrt file, then reproject chunks of it with gdalwarp. What I wanted as output was the same 5x5° tiles, reprojected, but with an extra pixel, so they overlap. This last requirement was the problematic one. See, the final projection makes any square in the original projection a tall rectangle, stretching more and more towards the poles. The closest I could get was to use the -ts option, but that meant that I didn't get any control about how many extra pixels I got in the vertical/latitude direction. My OCD started thrashing :) In fact what happened was that I was not sure how GDAL would handle the possible partial pixel, whether rounding down (meaning excluding it), up (finishing it), or simply leaving the pixel with partial data and impacting the final rendering.

Even Rouault pointed to me that gdalwarp can do something fantastic: it can generate a .vrt file too with all the parameters needed for the reprojection, so reading from there was automatically reprojecting the original data. The resulting dataset is 288,000x325,220px (the original is 288,000x180,000px), so I'm definitely going to cut it down in small tiles. After consulting with a eight-ball, I decided to discard the idea of tiles with boundaries based on coordinates, which might not even make sense anymore, but settle for pixel based sizes, still with an extra pixel. The chosen size is 2**14+1 a.k.a. 16385. For this gdal_translate is perfect.

The final algorithm is like this:

gdalwarp -t_srs "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 \
    +x_0=0.0 +y_0=0.0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs +over" \
    -r lanczos -tr 30.92208077590933 -30.92208077590933 \
    -of VRT EU-DEM.vrt EU-DEM-corrected.vrt

The values for the -tr option is the pixel size in meters, which is the unit declared in the SRS. Notice that as Mercator stretches towards the poles, this is the size at the origin of the projection; in this case, at 0° lat, 0° lon.

Then the reprojection (by reading from the reprojecting dataset) and cut, in a couple of loops:

tile_size=$((2**14)); \
for i in $(seq 0 17); do
    for j in $(seq 0 5); do
        for k in $(seq 0 3); do
            l=$((4*$j+$k));
            gdal_translate -co BIGTIFF=YES -co TILED=YES -co COMPRESS=LZMA \
                -co LZMA_PRESET=9 \
                -srcwin $(($tile_size*$i)) $(($tile_size*$l)) \
                    $(($tile_size+1)) $(($tile_size+1)) \
                -of GTiff EU-DEM-corrected.vrt \
                $(printf "%03dx%03d-corrected.tif" $i $l) &
        done;
        wait;
    done;
done

There's an extra loop to be able to launch 4 workers at the same time, because I have 4 cores. This doesn't occupy the 4 cores a 100% of the time (cores that already finished stay idle until the other finished), but it was getting akward to express in a Makefile, and this is run only once.

Before deriving the rest of the data there's an extra step: removing those generated tiles that actually have no data. I do a similar thing with empty sea tiles in the rendering process. Notice also that the original data is not tile complete for the covered region (79 tiles instead of the 160 they should be).