layout: true <div class="my-footer"> <span> <a href="https://www.github.com/MKyhos", style="color:white;"> <svg style="height:0.8em;top:.04em;position:relative;fill:white;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> https://www.github.com/MKyhos </a> </span> </div> --- class: middle background-image: url("http://www.johngordonart.com/wp-content/uploads/2015/02/Elephant-World-Drawing-v2-by-Artist-John-Gordon-1024x1024.jpg") background-position: bottom right background-size: 400px 400px # An Introduction to PostGIS .fancy2[...PostgreSQL going spatial] <p style="margin-top:4cm;"> CorrelCon 2020 <br> <a href="https://twitter.com/maltekyhos"> <svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @maltekyhos </a> </p> ??? Welcome Credits to Crunchy Data --- class: middle, inverse # Tools and Formats for Spatial Data * Getting an overview of tools and software * Do the file formats fit the requirements? --- The landscape (🥁) of spatial tools <img src="fig_spatial_software_hierarchy.svg" width="600" height="450" style="display: block; margin: auto;" /> ??? * Incomplete survey of tools for geo spatial data * we have heard a lot --- #### File format and infrastructure .pull-left[ Formats occuring in the wild: <img src="fig_spatial_data_types.svg" width="250" height="300" style="display: block; margin: auto;" /> ] .pull-right[ Possible drawbacks: what about * Versioning * Concurrent Access? * Reliability? * Scalability? * Compatibility? * Also in short, ACID? ] ??? ACID principles in information systems: * atomicity of transactions * consistency * isolation of users * durability --- class: middle Are we searching for a system that has the following assets? - Integration with various tools - Desktop applications - Server middleware - Web Applications - Programming languages - Functions and types - Speed! Scalability, security, reliability - Open Source Then we might turn to PostGIS! 🎉 ??? So the question is again, what do we desire? Which kind of system is needed? --- It even can easily be integrated in some generic infrastructure (Pousty, 2019): <img src="fig_example_network.svg" width="600" height="450" style="display: block; margin: auto;" /> ??? Lets suppose we have the following infrastructue: * Some Cluster with * Database instances * Rest Endpoints * Flat files, and server, * A firewall, * talking to data science team, and mobile app, We can simply add spatial capabilities by enabling postgis in the --- class: middle .fancy2[Take away:] PostGIS... - integrates well into existing, standard infrastructure - no need for specialized servers (e.g. GeoServer) - less training required (standard tools) - only additional team requirement: spatial specialists - open source! But also - might not be feasible for small one-off research projects - requires some knowledge in DB administration, and the command line - requires a server - technical overhead might be real! ??? So, some take aways so far: fit for purpose! --- class: middle, inverse # Turning to PostGIS --- **Differences, coming from PostgreSQL** | | Standard PostgreSQL | PostGIS | |-----------|---------------------------|---------------------------------| | Types | int, real, json, date | geometry, geography | | Indexes | B-Tree... | R-Tree, ... | | Functions | Avg(), Min(), json_each() | ST_Transform(), ST_Length(),... | **Differences, coming from R** * Setup and administration of some PG instance * Write SQL instead of R (but: functions have same ) * Unfortunately, `{dbplyr}` does not support PostGIS functions (yet). But `{sf}` works nice with PostGIS. We had a nice introduction to the **data model** already by Michael 🙏 * Point, Line, Polygon * MultiPoint, MultiLineString, MultiPolygon * GeometryCollection * MultiCurve * 3D Geometries: PolyHedralSurface Furthermore, raster grids and `\(S^2\)` geography types are supported! --- ## Tools <dl> <dt><b>Interfaces</b></dt> <dd><code>psql</code>: Commandline interface, check out <code>pgcli</code> as well!</dd> <dd><code>pgAdmin4</code>: GUI (open source) </dd> <dd><code>DataGrip</code>: GUI (commercial) </dd> <dd><code>DBeaver</code>: GUI (community version available) </dd> </dl> <dl> <dt><b>Data Import</b></dt> <dd><code>osm2pgsql</code>: import OpenStreetMap data</dd> <dd><code>shp2pgsql</code>: import Shapefiles</dd> <dd><code>ogr2ogr</code>: convert/import/export various types ( <code>geojson</code>, GeoPackage, ...)</dd> </dl> ??? For all tools, better documentation and tutorials are provided elsewhere. Just some examples for common tasks, so it is not sorcery at all. --- **Example:** Create extension: ```sql CREATE EXTENSION postgis; ``` **Example:** Create table with spatial column: ```sql CREATE TABLE census_grid ( grid_id_num int PRIMARY KEY, geom Geometry(Polygon, 25832) ); ``` **Example:** Import shapefile into running database: ```shell shp2pgsql \ -s 25833 \ Berlin_Ubahnhoefe_25833.shp public.berlin_metro_station | psql service=test-gis ``` --- class: middle, inverse # Demonstrative Examples Data, Berlin Scope: * Census 2011 Grid * OpenStreetMap Berlin * Berlin Subway stations & lines --- Retrieve information about geometry columns in a database: ```sql SELECT f_table_name, f_geometry_column, srid, type FROM geometry_columns; ``` <div class="knitsql-table"> <table> <caption>8 records</caption> <thead> <tr> <th style="text-align:left;"> f_table_name </th> <th style="text-align:left;"> f_geometry_column </th> <th style="text-align:right;"> srid </th> <th style="text-align:left;"> type </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> planet_osm_point </td> <td style="text-align:left;"> way </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> POINT </td> </tr> <tr> <td style="text-align:left;"> planet_osm_roads </td> <td style="text-align:left;"> way </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> LINESTRING </td> </tr> <tr> <td style="text-align:left;"> planet_osm_polygon </td> <td style="text-align:left;"> way </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> GEOMETRY </td> </tr> <tr> <td style="text-align:left;"> census_grid </td> <td style="text-align:left;"> geom </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> POLYGON </td> </tr> <tr> <td style="text-align:left;"> planet_osm_line </td> <td style="text-align:left;"> way </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> LINESTRING </td> </tr> <tr> <td style="text-align:left;"> berlin_metro_station </td> <td style="text-align:left;"> geom </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> MULTIPOINT </td> </tr> <tr> <td style="text-align:left;"> berlin_metro_lines </td> <td style="text-align:left;"> geom </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> MULTILINESTRING </td> </tr> <tr> <td style="text-align:left;"> clustered_bars </td> <td style="text-align:left;"> geom </td> <td style="text-align:right;"> 5243 </td> <td style="text-align:left;"> POINT </td> </tr> </tbody> </table> </div> --- ### Geographical Projections (`PROJ`) * We had a great introduction by Michael * Remember: use of appropriate projected reference system for distance calculation etc! * For web maps, usualy turn to ETRS 4326 (Web Mercator). * Transformation of coordinates: `ST_Transform(geom, SRID)` * Also, permanent update of spatial reference ID: ```sql ALTER TABLE planet_osm_polygon ALTER COLUMN way TYPE geometry(Geometry, 5243) USING ST_Transform(way, 5243); ``` --- ### Testing Relationships, Generate Geometries, etc. 1. Relationship (predicates) - `ST_Intersects()`, `ST_Contains()`, `ST_Touches()`, `ST_Overlaps()`, `ST_Within()`... 2. Geometry generation - `ST_Buffer()`, `ST_Union()`, `ST_Intersection()`, `ST_SymDifference()`, `ST_Difference()` 3. Many more! Often desired: a spatial join. `$$\text{SpatialJoin}(X, Y, \pi) = \{(o_x, o_y)~|~o_x \in X, o_y \in Y, \pi(o_x, o_y)\}$$` --- ### Example: Spatial Join Finding the subway station with most nearby restaurants in Berlin. ```sql SELECT m.nam as station_name, Count(*) AS nearby_restaurants FROM berlin_metro_station AS m JOIN planet_osm_point AS r ON ST_DWithin(m.geom, r.way, 250) WHERE r.amenity = 'restaurant' GROUP BY 1 ORDER BY 2 DESC LIMIT 5; ``` <div class="knitsql-table"> <table> <caption>5 records</caption> <thead> <tr> <th style="text-align:left;"> station_name </th> <th style="text-align:right;"> nearby_restaurants </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> U Oranienburger Tor </td> <td style="text-align:right;"> 110 </td> </tr> <tr> <td style="text-align:left;"> U Rosenthaler Platz </td> <td style="text-align:right;"> 91 </td> </tr> <tr> <td style="text-align:left;"> U Eberswalder Straße </td> <td style="text-align:right;"> 89 </td> </tr> <tr> <td style="text-align:left;"> U Stadtmitte </td> <td style="text-align:right;"> 50 </td> </tr> <tr> <td style="text-align:left;"> U Kurfürstendamm </td> <td style="text-align:right;"> 42 </td> </tr> </tbody> </table> </div> (Thanks to Paul Ramsey for pointing out a more efficient query!) ??? As said, with a spatial join one can integrate data based on geographic proximity, that are otherwise unrelated. --- ### Example: Advanced Spatial Join Per subway line: number of houses `\(\geq 13\)` appartments, maximal 50 meters away. ```sql SELECT l2.route, Sum(count) AS sum_houses_geq13_app FROM ( SELECT count, geom FROM census_dta AS d, census_grid AS g WHERE d.feature_id = 46 AND d.grid_id_num = g.grid_id_num ) AS census JOIN berlin_metro_lines AS l1 ON ST_DWithin(census.geom, l1.geom, 50) JOIN metro_lines_entity AS l2 ON Strpos(l1.nam, l2.route) > 0 GROUP BY 1 ORDER BY 2 DESC LIMIT 4; ``` <div class="knitsql-table"> <table> <caption>4 records</caption> <thead> <tr> <th style="text-align:left;"> route </th> <th style="text-align:right;"> sum_houses_geq13_app </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> U7 </td> <td style="text-align:right;"> 1161 </td> </tr> <tr> <td style="text-align:left;"> U2 </td> <td style="text-align:right;"> 866 </td> </tr> <tr> <td style="text-align:left;"> U6 </td> <td style="text-align:right;"> 690 </td> </tr> <tr> <td style="text-align:left;"> U9 </td> <td style="text-align:right;"> 501 </td> </tr> </tbody> </table> </div> --- ### Example: Point Clustering Eg, `\(k\)`-Means, or DBSCAN (Ester, Kriegel, Sander, Xu, and others, 1996). Let's find clusters of Berlin bars! 🍺 ```sql CREATE TABLE clustered_bars AS (SELECT name, ST_ClusterDBScan(way, eps:=500, minpoints:=5) OVER() AS c_id way AS geom FROM planet_osm_point WHERE amenity = 'bar'); ``` -- Further: create Voronoi polygons from from cluster centroids: ```sql WITH cluster_centroid AS ( SELECT c_id, ST_Centroid(ST_Collect(geom)) AS geom FROM clustered_bars GROUP BY c_id) SELECT ST_Transform((ST_Dump( ST_VoronoiPolygons(ST_Collect(geom))) ).geom, 4326) AS geom FROM cluster_centroid; ``` ---
--- class: middle, inverse # Performance * (Spatial) Indexing * Database Maintenance * Subdivide large * Spatial Data Clustering ??? In general, PostGIS can be very fast, compared to other software (e.g. ArcGIS), but it depends on the actual use case. Things to do: - Use Indexes (will be explained) - Maintain the database well, vacuuming etc. - Divide large polygons into smaller subpolygons - beneficial for indexes - beneficial for joins - spatial clustering of data in the physical storage. For all this, PostGIS provides tools. We will just have a look at Indexes, as they are very important. --- ## Indexing Structures .pull-left[ PostgreSQL natively comes with... - `\(B^+\)`-Tree - Generalized Inverted Index (`GIN`) - Generalized Inverted Search Tree (`GiST`) - Hash - Block Range Index (`BRIN`) - Space-partitioned GiST (`SP-GiST`) ] .pull-right[ In PostGIS, on top of `GiST`, we have - `\(R^+\)`-Tree: nodes contain maximum bounding rectangles/cubes (works till 10 dimensions). - Like other trees, with `\(N\)` stored data items the height is `\(\mathcal O (\log N)\)`. ] --- `\(R^+\)` Tree <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/R-tree.svg/896px-R-tree.svg.png" height="500" style="display: block; margin: auto;" /> .footnote[(Wikimedia Commons, 2010)] --- ```sql SELECT ST_Envelope(ST_Transform(way, 4326)) FROM planet_osm_line LIMIT 200; ```
--- ## Wait, there's (much) more! - `hstore`: - handling of OSM-like key-value datatype - one could also use `json`/`jsonb` instead - Procedural languages: - E.g. `PL/pgSQL`, `PL/Python`, `PL/R` - Adding conditional handling, plotting etc. to DB - `postgres_fdw`: - "Foreign data wrapper": access data across different PostgreSQL instances - `pg_routing`: Routing points --- class: middle, inverse, center # Questions? Thanks for tuning in! --- # Learning Ressources **Books:** * Obe, Regina and Leo Hsu (2015). *PostGIS in Action*. Manning. * Obe, Regina and Leo Hsu (2017). *PostgreSQL: Up & Running*. O'Reilly. * Mikiewicz, Mackiewicz, Nycz (2017). *Mastering PostGIS*. Packt. **Tutorials & Workshops:** * [Introduction to PostGIS](https://postgis.net/workshops/postgis-intro/) on [postgis.net](https://postgis.net) * [Interactive PostGIS Learning Portal](https://learn.crunchydata.com/postgis) by CrunchyData **Talks** (basically a super set of the slides you just saw) * Steve Pousty (2019) * Paul Ramsey (2019) The official documentation and everything mentioned in this talk can be found on [http://postgis.net/documentation/](http://postgis.net/documentation/) --- # References Ester, M., H. Kriegel, J. Sander, et al. (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise." In: _Kdd_. Vol. 96. 34. , pp. 226-231. Pousty, S. (2019). _Putting It All Together (presented at the STL PostGIS Day)_. https://youtu.be/RFhpxhXrdDg. Crunchy Data. Ramsey, P. (2019). _All Of PostGIS (presented at the STL PostGIS Day)_. https://youtu.be/g4DgAVCmiDE. Crunchy Data. Wikimedia Commons (2010). _R-Tree_. URL: [https://commons.wikimedia.org/wiki/File:R-tree.svg](https://commons.wikimedia.org/wiki/File:R-tree.svg). *** The elephant painting on the title page is due to John Gordon ([http://www.johngordonart.com](http://www.johngordonart.com)).