Another aspect I've been looking into with respect to optimizing the rendering speed is data sources that are not in the target projection. Not being in the target projection forces Mapnik to reproject them on the fly, and this for each (meta)tile, whereas it would make more sense to have them already reprojected.

In my case I have 3 datasets, big ones, in EPSG:4258. The three datasets are elevation, hill and slope shade based on EEA's DEM files. The source tiles amount to almost 29GiB, and the elevation layer, being RGB, takes more than that. So I set off to try to reproject the things.

My first, more obvious approach was to reproject every 5x5°, 18000x18000px tile, then derive the data I need, but I started to get gaps between tiles. Notice that the original data is cleanly cut (5° x 3600"/° x 1px/" == 18000px), without any overlapping.

The next approach was to merge them all in a .vrt file, then reproject chunks of it with gdalwarp. What I wanted as output was the same 5x5° tiles, reprojected, but with an extra pixel, so they overlap. This last requirement was the problematic one. See, the final projection makes any square in the original projection a tall rectangle, stretching more and more towards the poles. The closest I could get was to use the -ts option, but that meant that I didn't get any control about how many extra pixels I got in the vertical/latitude direction. My OCD started thrashing :) In fact what happened was that I was not sure how GDAL would handle the possible partial pixel, whether rounding down (meaning excluding it), up (finishing it), or simply leaving the pixel with partial data and impacting the final rendering.

Even Rouault pointed to me that gdalwarp can do something fantastic: it can generate a .vrt file too with all the parameters needed for the reprojection, so reading from there was automatically reprojecting the original data. The resulting dataset is 288,000x325,220px (the original is 288,000x180,000px), so I'm definitely going to cut it down in small tiles. After consulting with a eight-ball, I decided to discard the idea of tiles with boundaries based on coordinates, which might not even make sense anymore, but settle for pixel based sizes, still with an extra pixel. The chosen size is 2**14+1 a.k.a. 16385. For this gdal_translate is perfect.

The final algorithm is like this:

gdalwarp -t_srs "+proj=merc +a=6378137 +b=6378137 +lat_ts=0.0 +lon_0=0.0 \
    +x_0=0.0 +y_0=0.0 +k=1.0 +units=m +nadgrids=@null +wktext +no_defs +over" \
    -r lanczos -tr 30.92208077590933 -30.92208077590933 \
    -of VRT EU-DEM.vrt EU-DEM-corrected.vrt

The values for the -tr option is the pixel size in meters, which is the unit declared in the SRS. Notice that as Mercator stretches towards the poles, this is the size at the origin of the projection; in this case, at 0° lat, 0° lon.

Then the reprojection (by reading from the reprojecting dataset) and cut, in a couple of loops:

tile_size=$((2**14)); \
for i in $(seq 0 17); do
    for j in $(seq 0 5); do
        for k in $(seq 0 3); do
            l=$((4*$j+$k));
            gdal_translate -co BIGTIFF=YES -co TILED=YES -co COMPRESS=LZMA \
                -co LZMA_PRESET=9 \
                -srcwin $(($tile_size*$i)) $(($tile_size*$l)) \
                    $(($tile_size+1)) $(($tile_size+1)) \
                -of GTiff EU-DEM-corrected.vrt \
                $(printf "%03dx%03d-corrected.tif" $i $l) &
        done;
        wait;
    done;
done

There's an extra loop to be able to launch 4 workers at the same time, because I have 4 cores. This doesn't occupy the 4 cores a 100% of the time (cores that already finished stay idle until the other finished), but it was getting akward to express in a Makefile, and this is run only once.

Before deriving the rest of the data there's an extra step: removing those generated tiles that actually have no data. I do a similar thing with empty sea tiles in the rendering process. Notice also that the original data is not tile complete for the covered region (79 tiles instead of the 160 they should be).


openstreetmap gis gdal

Posted mar 23 ene 2018 14:08:42 CET Tags: gis

I just uploaded my first semi-automated change. This change was generated with my hack for generating centerlines for riverbank polygons. This time I expanded it to include a JOSM plugin which will take all the closed polygons from the selection and run the algorithm on them, creating new lines. It still needs some polishing, like making sure they're riverbanks and copying useful tags to the new line, and probably running a simplifying algo at some point. Also, even simple looking polygons might generate complex lines (in plural, and some of these lines could be spurious), so some manual editing might be required afterwards, specially connecting the new line to existing centerlines. Still, I think it's useful.

Like I mentioned last time, its setup is quite complex: The JOSM plugin calls a Python script that needs the Python module installed. That module, for lack of proper bindings for SFCGAL, depends on PostgreSQL+PostGIS (but we all have one, right? :-[ ), and connection strings are currently hardcoded. All that to say: it's quite hacky, not even alpha quality from the installation point of view.

Lastly, as imagico mentioned in the first post about this hack, the algorithms are not fast, and I already made my computer start thrashing the disk swapping like Hell because pg hit a big polygon and started using lots of RAM to calculate its centerline. At least this time I can see how complex the polygons are before handing them to the code. As an initial benchmark, the original data for that changeset (I later simplified it with JOSM's tool) took 0.063927s in pg+gis and 0.004737s in the Python code. More test will come later.

Okey, one last thing: Java is hard for a Pythonista. At some point it took me 2h40 to write 60 lines of code, ~2m40 per line!


openstreetmap gis python

Posted lun 29 ago 2016 20:06:39 CEST Tags: gis

My latest Europe import was quite eventful. First, I run out of space several times during the import itself, at indexing time. The good thing is that, if you manage to reclaim some space, and reading a little of source code[1], you can replay the missing queries by hand and stop cursing. To be fair, osm2pgsql currently uses a lot of space in slim+flat-nodes mode: three tables, planet_osm_node, planet_osm_way and planet_osm_relation; and one file, the flat nodes one. Those are not deleted until the whole process has finished, but they're actually not needed after the processing phase. I started working on fixing that.

But that was not the most difficult part. The most difficult part was that I forgot, somehow, to add a column to the import.style file. Elevation, my own style, renders different icons for different types of castles (and forts too), just like the Historic Place map of the Hiking and Bridle map[2]. So today I sat down and tried to figure out how to reparse the OSM extract I used for the import to add this info.

The first step is to add the column to the tables. But first, which tables should be impacted? Well, the line I should have added to the import style is this:

node,way   castle_type  text         polygon

That says that this applies to nodes and ways. If the element is a way, polygon will try to convert it to a polygon and put it in the planet_osm_polygon table; if it's a node, it ends in the planet_osm_point table. So we just add the column to those tables:

ALTER TABLE planet_osm_point   ADD COLUMN castle_type text;
ALTER TABLE planet_osm_polygon ADD COLUMN castle_type text;

Now how to process the extract? Enter pyosmium. It's a Python binding for the osmium library with a stream-like type of processing à la expat for processing XML. The interface is quite simple: one subclasses osmium.SimpleHandler, defines the element type handlers (node(), way() and/or relation()) and that's it! Here's the full code of the simple Python script I did:

#! /usr/bin/python3

import osmium
import psycopg2

conn= psycopg2.connect ('dbname=gis')
cur= conn.cursor ()

class CastleTypes (osmium.SimpleHandler):

    def process (self, thing, table):
        if 'castle_type' in thing.tags:
            try:
                name= thing.tags['name']
            # osmium/boost do not raise a KeyError here!
            # SystemError: <Boost.Python.function object at 0x1329cd0> returned a result with an error set
            except (KeyError, SystemError):
                name= ''
            print (table, thing.id, name)

            cur.execute ('''UPDATE '''+table+
                         ''' SET castle_type = %s
                            WHERE osm_id = %s''',
                         (thing.tags['castle_type'], thing.id))

    def node (self, n):
        self.process (n, 'planet_osm_point')

    def way (self, w):
        self.process (w, 'planet_osm_polygon')

    relation= way  # handle them the same way (*honk*)

ct= CastleTypes ()
ct.apply_file ('europe-latest.osm.pbf')

The only strange part of the API is that it doesn't seem to raise a KeyError when the tag does not exist, but a SystemError. I'll try to figure this out later. Also interesting is the big amount of unnamed elements with this tag that exist in the DB.


[1] I would love for GitHub to recognize something like https://github.com/openstreetmap/osm2pgsql/blob/master/table.cpp#table_t::stop and be directed to that method, because #Lxxx gets old pretty quick.

[2] I just noticed how much more complete those maps are. more ideas to use :)


openstreetmap gis python

Posted mar 09 ago 2016 17:45:56 CEST Tags: gis

In this last two days I've been expanding osm-centerlines. Now it not only supports ways more complex than a simple rectangle, but also ones that lead to 'branches' (unfortunately, most probably because the mapper either imported bad data or mapped it himself). Still, I tested it in very complex polygons and the result is not pretty. There is still lots of room for improvements.

Unluckily, it's not as stand alone as it could be. The problem is that, so far, the algos force you to provide now only the polygon you want to process, but also its skeleton and medial. The code extends the medial using info extracted from the skeleton in such a way that the resulting medial ends on a segment of the polygon, hopefully the one(s) that cross from one riverbank to another at down and upstream. Calculating the skeleton could be performed by CGAL, but the current Python binding doesn't include that function yet. As for the medial, SFCGAL (a C++ wrapper for CGAL) exports a function that calculates an approximative medial, but there seem to be no Python bindings for them yet.

So, a partial solution would be to use PostGIS-2.2's ST_StraightSkeleton() and ST_ApproximateMedialAxis(), so I added a function called skeleton_medial_from_postgis(). The parameters are a psycopg2 connection to a PostgreSQL+PostGIS database and the way you want to calculate, as a shapely.geometry, and it returns the skeleton and the medial ready to be fed into extend_medials(). The result of that should be ready for mapping.

So there's that. I'll be trying to improve it in the next days, and start looking into converting it into a JOSM plugin.


openstreetmap gis python

Posted vie 29 jul 2016 00:39:57 CEST Tags: gis

For a long time now I've been thinking on a problem: OSM data sometimes contains riverbanks that have no centerline. This means that someone mapped (part of) the coasts of a river (or stream!), but didn't care about adding a line that would mark its centerline.

But this should be computationally solvable, right? Well, it's not that easy. See, for given any riverbank polygon in OSM's database, you have 4 types of segments: those representing the right and left riverbanks (two types) and the flow-in and flow-out segments, which link the banks upstream and downstream. With a little bit of luck there will be only one flow-in and one flow-out segment, but there are no guarantees here.

One method could try and identify these segments, then draw a line starting in the middle of the flow-in segment, calculating the middle by traversing both banks at the same time, and finally connect to the middle for the flow-out segment. Identifying the segments by itself is hard, but it is also possible that the result is not optimal, leading to a jagged line. I didn't try anything on those lines, but I could try some examples by hand...

Enter topology, the section of maths that deals with this kind of problems. The skeleton of a polygon is a group of lines that are equidistant to the borders of the polygon. One of the properties this set of lines provides is direction, which can be exploited to find the banks and try to apply the previous algorithm. But a skeleton has a lot of 'branches' that might confuse the algo. Going a little further, there's the medial axis, which in most cases can be considered a simplified skeleton, without most of the skeleton branches.

Enter free software :) CGAL is a library that can compute a lot of topological properties. PostGIS is clever enough to leverage those algorithms and present, among others, the functions ST_StraightSkeleton() and ST_ApproximateMedialAxis(). With these two and the original polygon I plan to derive the centerline. But first an image that will help explaining it:

The green 'rectangle' is the original riverbank polygon. The thin black line is the skeleton for it; the medium red line is the medial. Notice how the medial and the center of the skeleton coincide. Then we have the 4 branches forming a V shape with its vertex at each end of the medial and its other two ends coincide with the ends of the flow in and flow out segments!

So the algorithm is simple: start with the medial; from its ends, find the branches in the skeleton that form that V; using the other two ends of those Vs, calculate the point right between them, and extend the medial to those points. This only calculates a centerline. The next step would be to give it a direction. For that I will need to see if there are any nearby lines that could be part of the river (that's what the centerline is for, to possibly extend existing rivers/centerlines), and use its direction to give it to the new centerline.

For the moment the algorithm only solves this simple case. A slightly more complex case is not that trivial, as skeletons and medials are returned as a MultiLineString with a line for each segment, so I will have to rebuild them into LineStrings before processing.

I put all the code online, of course :) Besides a preloaded PostgreSQL+PostGIS database with OSM data, you'll need python3-sqlalchemy, geoalchemy, python3-fiona and python3-shapely. The first two allows me to fetch the data from the db. Ah! by the way, you will need a couple of views:

CREATE VIEW planet_osm_riverbank_skel   AS SELECT osm_id, way, ST_StraightSkeleton (way)      AS skel   FROM planet_osm_polygon WHERE waterway = 'riverbank';
CREATE VIEW planet_osm_riverbank_medial AS SELECT osm_id, way, ST_ApproximateMedialAxis (way) AS medial FROM planet_osm_polygon WHERE waterway = 'riverbank';

Shapely allows me to manipulate the polygonal data, and fiona is used to save the results to a shapefile. This is the first time I ever use all of them (except SQLAlchemy), and it's nice that it's so easy to do all this in Python.


openstreetmap gis python

Posted mar 26 jul 2016 18:55:14 CEST Tags: gis

As I mentioned in one of my lasts posts about it, I started using SRTM 1 arc data for rendering my maps. So far, with the 3 arc dataset, I was able to generate a huge image by stitching the tiles with gdal_merge.py, then generating the 3 final images (height, slope and hill shade) plus one intermediary for the slope, all with gdaldem. Now this is no longer possible, as the new dataset is almost 10x the old one, so instead of going that way, I decided to try another one.

With gdalbuildvrt it is possible to generate an XML file that 'virtually' stitches images. This means that any attempt to access data through this file will actually make the library (libgdal) to find the proper tile(s) and access them directly.

So now the problem becomes processing each tile individually and then virtual stitching them. The first part is easy, as I just need to do the same I was doing to the huge stitched image before. I also took the opportunity to use tiled files, which instead of storing the image one scan line at a time (being 1 arc second resolution, each scan line has 3601 pixels; the extra one is for overlapping with the neighbors), it stores the file in 256x256 sub-tiles, possibly (that is, no tested) making rendering faster by clustering related data closer. The second step, with gdalbuildvrt, should also be easy.

The first block on the way is the fact that SRTM tiles above 50°N are only 1801 pixels wide, most posibly because it makes no sense anyways. This meant that I had to preprocess the original tiles so libgdal didn't have to do the interpolation at render time (in fact, it already has to do it once while rendering, using the lanczos scaling algorithm). This was done with gdalwarp.

The second one came from slope and hill shading tiles. As the algortihm goes, it generates some 'fade out' values in the edges, and when libgdal was stitching them, I could see it as a line in the seam. This was fixed by passing -compute_edges to gdaldem.

Finally, for some reason gdalbuildvrt was generating some very strange .vrt files. The format of these files is more or less the following:

  • For each band in the source tiles it creates a band in the result.
    • For each source tile, it describes:
      • The source file
      • The source band
      • The size and tiling of the source (3601², 256²)
      • The rectangle we want from the source (0², 3601²)
      • The rectangle in the result (x, y, 3601²)

The problem I saw was some weird declararions of the rectangles in the result, where the coordinates or the sizes didn't match what I expected. I will try to figure this out with the GDAL poeple in the following weeks, but first I want to make sure that the source tiles are easily downloadable (so far I have only found download options through USGS' EarthExplorer, which requires you to be logged in in order to download tiles, which means that it is not very scriptable, so not very reproducible).

So for the moment I'm using my own .vrt file generator, completely not generic enough for release, but soon. I also took the opportunity to make the rectangles in the result non-overlapping, being just 3600² in size. I know that the generated file works because I'm also generating smaller samples of the resulting layers (again, height, slope and hill shading) for rendering smaller zoom levels.

The only remaining question about huge DEM datasets is contour generation. So far I had just generated contour lines for each tile and lived with the fact that they too look ugly at the seams.


gdal gis srtm

Posted jue 30 abr 2015 15:03:22 CEST Tags: gis

Since I started playing with rendering maps I included some kind of elevation info for highlighting mountains. At the beginning it was just hillshading provided by some german guy (I don't have the reference on me right now), but after reading Tilemill's terrain data guide, I started using DEMs to generate 4 different layers: elevation coloring, slope shading, hillshading and contour lines.

When I started I could find only three DEM sources: SRTM 3arc and ViewFinderPanoramas (1arc and 3arc). The second one tries to flatten plains (for instance, the Po's plain nearby to where I live), where it generates some ugly looking terracing. The third one, when I downloaded the corresponding tile (they're supposed to be 1x1 degrees), its medatada reported an extension between 7 and 48 degrees east, and between 36 and 54 degrees north, and that its size is 147602x64801 pixels. I also remember stitching all the tiles covering Europe, just to get a nice 1x1 degrees hole in the North Adriatic sea. Not having much time to pre or post process the data, I decided to stick to SRTM.

Things changed at the end of last year. The US government decided to release 1arc, 30m global coverage (previously that resolution covered only the US). I started playing with the data mid January, only to find that it is not void-filled: this DEMs are derived from the homonymous Shuttle mission, which used radar to get the data. Radar gets very confused when water is involved; this is no problem on rivers, lakes or sea, where elevation is constant relative to the coast, but it's a problem on snow covered mountains, glaciers and even clouds. This means that the data has NODATA holes. The SRTM 3arc v4.1 I was using had these 'voids' filled; deFerrantis has painstakingly been filling these voids too by hand, using topographic maps as reference.

So I set up to fill these voids too. But first let's see how the original data looks like. All the images are for the area near Isola 2000, a ski station I go often. The first image is how this looks on the SRTM 3arc v4.1:

This is a 4x4 grid of 256x256 pixel tiles (1024x1024 in total) at zoom level 13. The heights range from ~700m up to ~2600m, and the image combines all the 4 layers. It already starts showing some roundness in the terrain, specially in ridges and mountain tops; even at the bottom of the deep valleys.

For contrast, this is deFerrantis data:

This is the first time I really take a look on the result; it doesn't seem to be much better that 3arc. Here's the terracing I mentioned:

For contrast, check what 1arc means:

From my point of view, the quality is definitely better. Peaks, crests and valleys are quite sharp. As for the mountain sides, they look rugged. My appreciation is this reflects better the nature of the terrain in question, but Christoph Hormann of Imagigo.de views it as sample noise. He has worked a lot on DEMs to generate very beautiful maps.

But then we have those nice blue lagoons courtesy of voids (the blue we can see is the water color I use in my maps). So, how to proceed?

The simplest way to fix this is covering the voids with averages calculated from the data at the seams of the voids. GDAL has a tool for that called, of course, gdal_fillnodata.py. This is the outcome:

At first this looks quite good, but once we start to zoom in (remember there are at least 5 more zoom levels), we start to see some regular patterns:

Another option is to use deFerrantis' data to fill the voids. For this we need to merge both datasets. One way to do it is using GDAL's gdalwarp tool. We create a file piling up layers of data; first the most complete one, then the layers with holes:

gdalwarp deFerrantis/N44E007.hgt mixed.tif
gdalwarp SRTM_1arc_v3/n44_e007_1arc_v3.tif mixed.tif

This looks like this:

I have to be honest, it doesn't look good. Both files declare the same extents and resolution (their metadata is similar, but the second file has more), but if you compare the renders for SRTM_1arc_v3 and deFerrantis, you will notice that they don't seem to align properly.

The last simple option would be to upsample SRTM_3arc_v4.1 and then merge like before, but it took me a while to figure out the right parameters:

gdalwarp -te 6.9998611 43.9998611 8.0001389 45.0001389 -tr 0.000277777777778 -0.000277777777778 -rb SRTM_3as_v4.1/srtm_38_04.tif srtm_1as_v3-3as_v4.1.tif
Creating output file that is 3601P x 3601L.
Processing input file SRTM_3as_v4.1/srtm_38_04.tif.
Using internal nodata values (eg. -32768) for image SRTM_3as_v4.1/srtm_38_04.tif.
0...10...20...30...40...50...60...70...80...90...100 - done.
gdalwarp SRTM_1as_v3/n44_e007_1arc_v3.tif srtm_1as_v3-3as_v4.1.tif
Processing input file SRTM_1as_v3/n44_e007_1arc_v3.tif.
Using internal nodata values (eg. -32767) for image SRTM_1as_v3/n44_e007_1arc_v3.tif.
0...10...20...30...40...50...60...70...80...90...100 - done.

The complex part was the -te and -tr parameters. The 3 arc file covers a 5x5 degrees zone at 40-45°N, 5-10°E. The 1 arc file only covers 44-45°N, 7-8°E, so I need to cut it out. I use the -te option for that, but according the doc it's «[to] set georeferenced extents of [the] output file to be created». This means the degrees of the W, S, E and N limits of the output file. Note that the actual parameters are 1.5" off those limits; I took those limits from the original 1 arc file.

The second one is even more cryptic: «[to] set [the] output file resolution (in target georeferenced units)». This last word is the key, units. According to gdalinfo, both files have a certain pixel size in units declared in the projection. Both declare UNIT["degree",0.0174532925199433]; “degree“ has a clear meaning; the float besides it is the size of the unit in radians (π/180). So the parameters for -tr is how many units (degrees) does a pixel represent (or, more accurately, how many degrees are between a pixel center and the next one). Notice that the vertical value is negative; that's because raster images go from North to South, but degrees go the other way around (North and East degrees are positive, South and West negative). In any case, I also just copied the pixel size declared in the 1 arc file.

After all this dissertation about GeoTIFF metadata, finally the resulting DEM:

As I said earlier, I don't have much time to invest in this; my map is mostly for personal consumption and I can't put much time on it. so my conclusion is this: if I can manage the 9x data size, I think I'm going this way.


elevation gdal gis srtm

Posted sáb 07 feb 2015 02:25:17 CET Tags: gis

One of the most cited ways to accelerate rendering maps with Mapnik is metatiling. This technique is simple: instead of rendering each individual 256x256 PNG file that normally backs a slippy map, render NxN times that (normally with N=8) and split afterwards. mapnik-stylesheet's generate_tiles.py was not capable of doing it, but I hacked a little bit and got it to do it, so after integrating openstreetmap-carto v2.14.1 and fixing some local bugs, I gave it another spin.

This time I also imported the whole Europe data, and expanded the terrain to cover anything above 60°N, taking the height files form De Ferranti's viewfinderpanoramas. This meant that whatever rendering times I got this time, they are not completely comparable to what I obtained before because it's simply handling way more data, specially relative to the terrain data.

So, first, a short revisit to importing. Here's the summary:

mdione@diablo:/var/lib/data/mdione/src/projects/osm/data/osm$ osm2pgsql \
    --create --drop --database gis --slim --flat-nodes /home/mdione/src/projects/osm/nodes.cache \
    --cache 2048 --number-processes 4 --unlogged europe-latest.osm.pbf
osm2pgsql SVN version 0.82.0 (64bit id space)

Node-cache: cache=2048MB, maxblocks=262145*8192, allocation method=11
Mid: loading persistent node cache from /home/mdione/src/projects/osm/nodes.cache
Allocated space for persistent node cache file
Maximum node in persistent node cache: 0
Mid: pgsql, scale=100 cache=2048

Reading in file: europe-latest.osm.pbf
Processing: Node(1225816k 799.6k/s) Way(151242k 17.71k/s) Relation(1872470 338.91/s)  parse time: 15596s [4h20m]
Node stats: total(1225816996), max(2884224404) in 1533s [0h26m]
Way stats: total(151242682), max(284701738) in 8538s [2h22m]
Relation stats: total(1872475), max(3776697) in 5525s [1h32m]

Going over pending ways...
Maximum node in persistent node cache: 2884632575
        110980610 ways are pending
Using 4 helper-processes [but only 1 is used, the others failed, dunno why]
Mid: loading persistent node cache from /home/mdione/src/projects/osm/nodes.cache
Maximum node in persistent node cache: 2884632575
Process 0 finished processing 110980610 ways in 34202 sec [9h30m]
110980610 Pending ways took 34202s at a rate of 3244.86/s

Going over pending relations...
Maximum node in persistent node cache: 2884632575
        0 relations are pending
Using 4 helper-processes
Process 2 finished processing 0 relations in 2 sec
Process 3 finished processing 0 relations in 2 sec
Process 0 finished processing 0 relations in 3 sec
Process 1 finished processing 0 relations in 3 sec
0 Pending relations took 3s at a rate of 0.00/s

node cache: stored: 203128264(16.57%), storage efficiency: 75.67% (dense blocks: 138131, sparse nodes: 63494657), hit rate: 17.62%

Stopped table: planet_osm_rels in 1s
Stopped table: planet_osm_nodes in 1s
Stopped table: planet_osm_ways in 3s

All indexes on  planet_osm_roads created  in 4679s [1h18m]
All indexes on  planet_osm_point created  in 6093s [1h41m]
All indexes on  planet_osm_line created  in 9703s [2h42m]
All indexes on  planet_osm_polygon created  in 12735s [3h32m]

Osm2pgsql took 62780s overall [17h26m]

Only ~3h30m more than importing a part of Europe, I think it's manageable. I still don't know what to make out of the cache/storage efficiency line, but it's clearly related to the small size of the cache.

About the rendering, I only rendered up to zoom level 11, as I discussed before, but that limit was mostly due to the fact that going all the way down to 14 took too much time. But after viewing how much less rendering took this time, I most probably revert that change. Take a look at the graph:

This graph shows the average and total time for rendering up to zoom level 11 with single tiling (actually it's the old data I showed previously) and metatiling 8x8. The total time uses the logarithmic scale on the right. The average in the second case is the division between the total time and the amount of final individual tiles, which are produced by cutting the metatile in as many single tiles as the minimum between 2^z and 64, so zoom levels 0-2 don't produce spurious images. The times are amazing. Notice that from time to time the total time took more in the metatile version than in the single tile version, but the are two factors that impact here. One is that due to metatiling, I'm rendering tiles that with single tiling I wouldn't due to the fact that now I can only cut in increments of 8 instead of one. Two, now I'm rendering the whole Europe instead of just a big part of it. This makes the amount of tiles produced to be between 6 to 10 times more.

Still, take a look at the average rendering times for metatiling, and specially at the last column in the table, which is not graphed: it's the proportion between the average tile rendering time, meta to single. It's mostly below the 10%, except for a couple of zoom levels that either produce few tiles (ZL 2, 16 tiles) or stay under 15% (ZL 4). This is ridiculously low, which means fast. I will first finish rendering my zone down to ZL 18 and then render the rest down to ZL 14 to have more complete, comparable graphs.


openstreetmap gis

Posted mar 10 jun 2014 20:55:23 CEST Tags: gis

In the last post I concluded that I will reimport Europe's data before applying a new version of my design. This would also mean that I will have to render everything from time to time. How much does this take? Let's answer that question.

Remember first that I didn't quite import all of Europe, just a rectangle with North UK, Portugal, Malta and Istambul as limits, so I'll just render that part. Then comes the question of how deep do it. OSM's wiki has some information about tile usage that shows that only up to zoom level (ZL) 11 all the tiles were viewed at least once, and that after ZL 15 the percentage drops below 10. I strike my balance at 14, where there's a one in three chance of the tile being seen. Then I render specific regions down to ZL 18, mostly my current immediate area of influence and places I will visit.

So with the same bbox as the import step, I fired the rendering process and left the machine mostly by itself. I say mostly because at the end, it's my main machine, and I kept using it for hacking, mail, browsing, chat, etc, but nothing very CPU, RAM or disk intensive. I modified generate_tiles.py so it measures the time spent on each tile and logged that into a file. Then I run another small python script on it to get the minimum, average, maximum and total times, how many tiles were rendered and the space they take[1]. Here's a table and graph that resumes it all:

The scale for all the lines is logarithmic, so all those linear progressions there are actually exponential. You can also see the cut at zoom level 14, and that I ended rendering more or less as many tiles from ZLs 15 to 18 as for ZLs 9 to 13.

The first thing to notice is that the average curve is similar, but considerably lower, to the one I had in my minimal benchmark. Most important, and something I didn't think then, is the total time columns. I have 4 of them: 'total' is the total CPU time in seconds, then I have one for the same time in hours and and another in days. As I actually have 4 CPUs and I run generate_tiles.py on all of them, I have an extra column that show an approximative wall clock time. At the bottom row you can see totals for the times, tile count and space used.

Talking about space, There's another constraint there. See, I have a 64GiB Nokia N9, which is one of, if not the, intended purposes for this map. Those 64GiB must be shared mostly with music, so I don't get bored while driving (specially in the endemic traffic jams in this region). This means that having up to ZL 14 in my phone is unfeasible, but ZL 11 seems more reasonable. This means that I will also have to make sure I have what I consider important info at least at that ZL, so it also impacts the map design.

So, 12.6 days and 84GiB later, I have my map. Cutting down ZLs 12 to 14 should save me some 8 days of rendering, most probably 7. I could also try some of the rendering optimizations I have found so far (in no particular order):

https://github.com/mapnik/mapnik/wiki/OptimizeRenderingWithPostGIS

https://github.com/mapnik/mapnik/wiki/Postgis-async

http://mike.teczno.com/notes/mapnik.html

http://blog.cartodb.com/post/20163722809/speeding-up-tiles-rendering

I was also thinking on doing depth first rendering, so requests done for ZL N can be used to render ZL N+1 (or maybe the other way around). This seems like a non obvious path, so I will probably check if I have lots of time, which I currently don't.


[1] Actually, the space used is calculated by du -sm *, so it's in MiB.


openstreetmap gis

Posted jue 17 abr 2014 19:38:32 CEST Tags: gis

First: one thing I didn't do in previous post was to show the final tables and sizes. Here it is:

 Schema |        Name        | Type  | Owner  |  Size   | Description
--------+--------------------+-------+--------+---------+-------------
 public | geography_columns  | view  | mdione | 0 bytes |
 public | geometry_columns   | view  | mdione | 0 bytes |
 public | planet_osm_line    | table | mdione | 11 GB   |
 public | planet_osm_point   | table | mdione | 2181 MB |
 public | planet_osm_polygon | table | mdione | 23 GB   |
 public | planet_osm_roads   | table | mdione | 2129 MB |
 public | raster_columns     | view  | mdione | 0 bytes |
 public | raster_overviews   | view  | mdione | 0 bytes |
 public | spatial_ref_sys    | table | mdione | 3216 kB |

 Schema |            Name             | Type  | Owner  |       Table        |  Size   | Description
--------+-----------------------------+-------+--------+--------------------+---------+-------------
 public | planet_osm_line_index       | index | mdione | planet_osm_line    | 4027 MB |
 public | planet_osm_point_index      | index | mdione | planet_osm_point   | 1491 MB |
 public | planet_osm_point_population | index | mdione | planet_osm_point   | 566 MB  |
 public | planet_osm_polygon_index    | index | mdione | planet_osm_polygon | 8202 MB |
 public | planet_osm_roads_index      | index | mdione | planet_osm_roads   | 355 MB  |
 public | spatial_ref_sys_pkey        | index | mdione | spatial_ref_sys    | 144 kB  |

The first thing to notice is that none of the intermediate tables are created nor their indexes, but also all the _pkey indexes are missing.

What I did in my previous post was to say that I couldn't update because the intermediate tables were missing. That was actually my fault. I didn't read carefully osm2psql's manpage, so it happens that the --drop option is not for dropping the tables before importing but for dropping the intermediate after import.

This means I had to reimport everything, and this time I made sure that I had the memory consumption log. But first, the final sizes:

 Schema |        Name        |   Type   | Owner  |    Size    | Description
--------+--------------------+----------+--------+------------+-------------
 public | contours           | table    | mdione | 21 GB      |
 public | contours_gid_seq   | sequence | mdione | 8192 bytes |
 public | geography_columns  | view     | mdione | 0 bytes    |
 public | geometry_columns   | view     | mdione | 0 bytes    |
 public | planet_osm_line    | table    | mdione | 11 GB      |
 public | planet_osm_nodes   | table    | mdione | 16 kB      |
 public | planet_osm_point   | table    | mdione | 2181 MB    |
 public | planet_osm_polygon | table    | mdione | 23 GB      |
 public | planet_osm_rels    | table    | mdione | 871 MB     |
 public | planet_osm_roads   | table    | mdione | 2129 MB    |
 public | planet_osm_ways    | table    | mdione | 42 GB      |
 public | raster_columns     | view     | mdione | 0 bytes    |
 public | raster_overviews   | view     | mdione | 0 bytes    |
 public | spatial_ref_sys    | table    | mdione | 3216 kB    |

 Schema |           Name           | Type  | Owner  |       Table        |  Size   | Description
--------+--------------------------+-------+--------+--------------------+---------+-------------
 public | contours_height          | index | mdione | contours           | 268 MB  |
 public | contours_pkey            | index | mdione | contours           | 268 MB  |
 public | contours_way_gist        | index | mdione | contours           | 1144 MB |
 public | planet_osm_line_index    | index | mdione | planet_osm_line    | 4022 MB |
 public | planet_osm_line_pkey     | index | mdione | planet_osm_line    | 748 MB  |
 public | planet_osm_nodes_pkey    | index | mdione | planet_osm_nodes   | 16 kB   |
 public | planet_osm_point_index   | index | mdione | planet_osm_point   | 1494 MB |
 public | planet_osm_point_pkey    | index | mdione | planet_osm_point   | 566 MB  |
 public | planet_osm_polygon_index | index | mdione | planet_osm_polygon | 8207 MB |
 public | planet_osm_polygon_pkey  | index | mdione | planet_osm_polygon | 1953 MB |
 public | planet_osm_rels_idx      | index | mdione | planet_osm_rels    | 16 kB   |
 public | planet_osm_rels_parts    | index | mdione | planet_osm_rels    | 671 MB  |
 public | planet_osm_rels_pkey     | index | mdione | planet_osm_rels    | 37 MB   |
 public | planet_osm_roads_index   | index | mdione | planet_osm_roads   | 358 MB  |
 public | planet_osm_roads_pkey    | index | mdione | planet_osm_roads   | 77 MB   |
 public | planet_osm_ways_idx      | index | mdione | planet_osm_ways    | 2161 MB |
 public | planet_osm_ways_nodes    | index | mdione | planet_osm_ways    | 52 GB   |
 public | planet_osm_ways_pkey     | index | mdione | planet_osm_ways    | 6922 MB |
 public | spatial_ref_sys_pkey     | index | mdione | spatial_ref_sys    | 144 kB  |

This time you'll probably notice a difference: there's this new contours table with a couple of indexes. This table contains data that I'll be using for drawing hypsometric lines (also know as contour lines) in my map. This 21GiB table contains all the data from 0 to 4000+m in 50m increments for the whole Europe and some parts of Africa and Asia, except for that above 60°, which means that Iceland, most of Scandinavia and the North of Russia is out. At that size, I think it's a bargain.

As with jburgess' data, we have the intermediate data, and quite a lot. Besides the 21GiB extra for contours, we have notably 42+52+2+7GiB for ways. In practice this means that, besides of some of my files, OSM+contour data uses almost all the 220GiB of SSD space, so I'll just move all my stuff out of the SSD :( Another alternative would be to just reimport the whole data from time to time (once a month or each time I update my rendering rules, which I plan to do based on openstreetmap-carto's releases, but not on each one of them).

During the import I logged the memory usage of the 10 more memory hungry processes in the machine with this command:

( while true; do date -R; ps ax -o rss,vsize,pid,cmd | sort -rn | head; sleep 60; done ) | tee -a mem.log

Then I massaged that file with a little bit of Python and obtained a CVS file which I graphed with LibreOffice. I tried several formats and styles, but to make things readable I only graphed the sum of all the postgres processes and osm2psql. This is the final graph:

Here you can see 4 lines, 2 for the sum of postgres and two for osm2psql. The thick lines graph the RSS for them, which is the resident, real RAM usage of that process. The correspondent thin line shows the VIRT size, which is the amount of memory malloc()'ed by the processes. As with any memory analysis under Linux, we have the problem that all the processes report also the memory used by the libraries used by them, and if there are common libraries among them, they will be reported several times. Still, for the amounts of memory we're talking about here, we can say it's negligible against the memory used by the data.

In the graph we can clearly see the three phases of the import: first filling up the intermediate tables, then the real data tables themselves, then the indexing. The weird curve we can see in the middle phase for osm2psql can be due to unused memory being swapped out. Unluckily I didn't log the memory/swap usage to support this theory, so I'll have it in account for the next run, if there is one. In any case, the peak at the end of the second phase seems to also support the idea.

One thing that surprises me is the real amount of memory used by osm2psql. I told him to use 2GiB for cache, but at its peak, it uses 3 times that amount, and all the time it has another 2GiB requested to the kernel. The middle phase is also hard on postgres, but it doesn't take that much during indexing; luckily, at that moment osm2psql has released everything, so most of the RAM is used as kernel cache.

13 paragraphs later, I finally write about the reason of this post, updating the database with daily diffs. As I already mentioned, the data as imported almost took all the space available, so I was very sensitive about the amount of space used by them. But first to the sizes and times.

The file 362.osc.gz, provided by Geofabrik as the diff for Europe for Mar05 weights almost 25MiB, but it's compressed XML inside. Luckily osm2psql can read them directly. Here's the summary of the update:

$ osm2pgsql --append --database gis --slim --flat-nodes /home/mdione/src/projects/osm/nodes.cache --cache 2048 --number-processes 4 --unlogged --bbox -11.9531,34.6694,29.8828,58.8819 362.osc.gz
Node-cache: cache=2048MB, maxblocks=262145*8192, allocation method=11
Mid: loading persistent node cache from /home/mdione/src/projects/osm/nodes.cache
Maximum node in persistent node cache: 2701131775
Mid: pgsql, scale=100 cache=2048

Reading in file: 362.osc.gz
Processing: Node(882k 3.7k/s) Way(156k 0.65k/s) Relation(5252 25.50/s)  parse time: 688s [11m28]

Node stats: total(882823), max(2701909278) in 240s [4m00]
Way stats: total(156832), max(264525413) in 242s [4m02]
Relation stats: total(5252), max(3554649) in 206s [3m26]

Going over pending ways...
Maximum node in persistent node cache: 2701910015
        122396 ways are pending

Using 4 helper-processes
Process 3 finished processing 30599 ways in 305 sec [5m05]
Process 2 finished processing 30599 ways in 305 sec
Process 1 finished processing 30599 ways in 305 sec
Process 0 finished processing 30599 ways in 305 sec
122396 Pending ways took 307s at a rate of 398.68/s [5m07]

Going over pending relations...
Maximum node in persistent node cache: 2701910015
        9432 relations are pending

Using 4 helper-processes
Process 3 finished processing 2358 relations in 795 sec [13m15]
Process 0 finished processing 2358 relations in 795 sec
Process 1 finished processing 2358 relations in 795 sec
Process 2 finished processing 2358 relations in 810 sec [13m30]
9432 Pending relations took 810s at a rate of 11.64/s

node cache: stored: 675450(100.00%), storage efficiency: 61.42% (dense blocks: 494, sparse nodes: 296964), hit rate: 5.12%

Osm2pgsql took 1805s overall [30m05]

This time is in the order of minutes instead of hours, but still, ~30m for only 25MiB seems a little bit too much. If I process the diff files daily, it would take ~15h a month to do it, but spread in ~30m stretches on each day. Also, that particular file was one of the smallest I have (between Mar03 and Mar17); most of the rest are above 30MiB, up to 38MiB for Mar15 and 17 each. Given the space problems that this causes, I might as well import before each rerender. Another thing to note is that the cache is quite useless, falling from ~20% to ~5% hit rate. I could try with lower caches too. The processing speeds are awfully smaller than at import time, but the small amount of data is the prevailing here.

Sizes:

 Schema |        Name        |   Type   | Owner  |    Size    | Description
--------+--------------------+----------+--------+------------+-------------
 public | contours           | table    | mdione | 21 GB      |
 public | contours_gid_seq   | sequence | mdione | 8192 bytes |
 public | geography_columns  | view     | mdione | 0 bytes    |
 public | geometry_columns   | view     | mdione | 0 bytes    |
 public | planet_osm_line    | table    | mdione | 11 GB      |
 public | planet_osm_nodes   | table    | mdione | 16 kB      |
 public | planet_osm_point   | table    | mdione | 2184 MB    |
 public | planet_osm_polygon | table    | mdione | 23 GB      |
 public | planet_osm_rels    | table    | mdione | 892 MB     |
 public | planet_osm_roads   | table    | mdione | 2174 MB    |
 public | planet_osm_ways    | table    | mdione | 42 GB      |
 public | raster_columns     | view     | mdione | 0 bytes    |
 public | raster_overviews   | view     | mdione | 0 bytes    |
 public | spatial_ref_sys    | table    | mdione | 3224 kB    |

 Schema |           Name           | Type  | Owner  |       Table        |  Size   | Description
--------+--------------------------+-------+--------+--------------------+---------+-------------
 public | contours_height          | index | mdione | contours           | 268 MB  |
 public | contours_pkey            | index | mdione | contours           | 268 MB  |
 public | contours_way_gist        | index | mdione | contours           | 1144 MB |
 public | planet_osm_line_index    | index | mdione | planet_osm_line    | 4024 MB |
 public | planet_osm_line_pkey     | index | mdione | planet_osm_line    | 756 MB  |
 public | planet_osm_nodes_pkey    | index | mdione | planet_osm_nodes   | 16 kB   |
 public | planet_osm_point_index   | index | mdione | planet_osm_point   | 1494 MB |
 public | planet_osm_point_pkey    | index | mdione | planet_osm_point   | 566 MB  |
 public | planet_osm_polygon_index | index | mdione | planet_osm_polygon | 8210 MB |
 public | planet_osm_polygon_pkey  | index | mdione | planet_osm_polygon | 1955 MB |
 public | planet_osm_rels_idx      | index | mdione | planet_osm_rels    | 352 kB  |
 public | planet_osm_rels_parts    | index | mdione | planet_osm_rels    | 676 MB  |
 public | planet_osm_rels_pkey     | index | mdione | planet_osm_rels    | 38 MB   |
 public | planet_osm_roads_index   | index | mdione | planet_osm_roads   | 358 MB  |
 public | planet_osm_roads_pkey    | index | mdione | planet_osm_roads   | 78 MB   |
 public | planet_osm_ways_idx      | index | mdione | planet_osm_ways    | 2165 MB |
 public | planet_osm_ways_nodes    | index | mdione | planet_osm_ways    | 52 GB   |
 public | planet_osm_ways_pkey     | index | mdione | planet_osm_ways    | 6926 MB |
 public | spatial_ref_sys_pkey     | index | mdione | spatial_ref_sys    | 104 kB  |

3MiB more of points, 21+5+1MiB more of rels, 45+1MiB more of roads, 0+2+8MiB more of lines, 0+3MiB for polygons, 0+4+4MiB for ways. In total, some 97MiB more. I tried a VACUUM at the end, but no space was gained, and I don't have enough space for VACUUM FULL. As VACUUM does not defragment, a second and third updates should make use of the internal fragmentation. Let's see.

363.osc.gz is the smalest file I have, at ~22MiB. The times are internally different, but overall looks proportional:

$ osm2pgsql --append --database gis --slim --flat-nodes /home/mdione/src/projects/osm/nodes.cache --cache 2048 --number-processes 4 --bbox -11.9531,34.6694,29.8828,58.8819 363.osc.gz
Maximum node in persistent node cache: 2701910015

Reading in file: 363.osc.gz
Processing: Node(750k 3.3k/s) Way(128k 0.44k/s) Relation(4264 15.73/s)  parse time: 792s

Node stats: total(750191), max(2703147051) in 230s
Way stats: total(128987), max(264655143) in 291s
Relation stats: total(4264), max(3556985) in 271s

Going over pending ways...
Maximum node in persistent node cache: 2703148031
        94490 ways are pending

Using 4 helper-processes
Process 0 finished processing 23623 ways in 238 sec
Process 2 finished processing 23622 ways in 238 sec
Process 1 finished processing 23623 ways in 238 sec
Process 3 finished processing 23622 ways in 239 sec
94490 Pending ways took 241s at a rate of 392.07/s

Going over pending relations...
Maximum node in persistent node cache: 2703148031
        8413 relations are pending

Using 4 helper-processes
Process 1 finished processing 2103 relations in 443 sec
Process 3 finished processing 2103 relations in 445 sec
Process 0 finished processing 2104 relations in 450 sec
Process 2 finished processing 2103 relations in 452 sec
8413 Pending relations took 453s at a rate of 18.57/s

node cache: stored: 576093(100.00%), storage efficiency: 60.50% (dense blocks: 437, sparse nodes: 252366), hit rate: 5.07%

Osm2pgsql took 1488s overall

The table sizes keep growing, as expected: OSM data does nothing but grow; my free space does nothing but shrink, currently at mere 249MiB. Given that the intermediate tables are dropped at the end of the second import phase, it only makes sense to do full imports from time to time, before updating the rendering rules. Munitely is not for me.


openstreetmap gis

Posted vie 28 mar 2014 18:32:27 CET Tags: gis