Introducing GeoPySpark, a Big Data GeoSpatial Library

By Jacob Bouffard

Elevator Pitch

GeoPySpark is a new Python library that allows users to analyze large amounts of geospatial data in a distributed environment. Backed by GeoTrellis, a Scala library, GeoPySpark is able to perform and scale well by taking taking advantage of its Scala backend while maintaining a Python interface.

Description

Introducing GeoPySpark, a Big Data GeoSpatial Library

About GeoPySpark

GeoPySpark is a Python library for processing large amounts of geospatial data in a distributed environment. A binding of GeoTrellis, a Scala geospatial library, GeoPySpark allows users to take advantage of the speed and scalability of Apache Spark while working with a Python interface.

About This Talk

This talk will discuss GeoPySpark itself as well as its features via an example use case. The first part of the presentation will be a general overview of how GeoPySpark works and its features. In the second part of the talk we will develop an example use case for GeoPySpark in Jupyter notebook that will utilize a cluster on Amazon’s EMR.

This notebook will generate a friction surface that will enable a cost distance surface calculation between two travel waypoints. In order to visualize the intermediate steps and final result, this example’s Jupyter Notebook will be a fork of Kitware’s Geonotebook. This application allows for the layers to be displayed on a map that accompanies the notebook.

Note: Due to length of the of the talk and how much data will worked with, all of the following operations will be carried out before hand.

When producing the friction layer certain variables will need to be considered. These are the following elements that will be taken into account when producing the friction layer: elevation, land cover, hydrology, roads, and trails. All of these datasets will cover the lower 48 states of the United States of America. See here for more information on the source and format of the datasets.

Once the data has been read in and formatted, the next phase will be to calculate the friction layer. The following steps will be performed in order to produce this layer:

  1. Calculate slope of the NED layer
  2. From the new slope layer, derive walking speeds using Tobler’s hiking function. This will become the base friction layer.
  3. The NLCD and NHD layers will have their values reclassified to corresponding values that represent how much friction there is when passing through a given cell.
  4. Roads and trails will be taken from an OSM orc file and will be rasterized into a single layer.
  5. Local map algebra operations will performed between the base friction layer and the reclassified NLCD and NHD layers. The resulting layer will contain adjusted tobler values where elevation, land cover, and hydrology are considered.
  6. The last step will be to take the adjusted friction layer and perform a local max operation on the roads/trails layer. This will produce the final friction layer.

Each step will have its data displayed on a map to better visualize the changes to the friction surface as these factors are taken into account.

The cost distance layer will be calculated using the final friction layer and the two points of interest. Once we have visualized the cost distance layer, the final step of this demo is to save the friction layer to a remote backend for future use. In this case, the layer will be saved to Amazon’s S3 service.

The Data

Sources and formats of the data:

Notes

Notes

Having been involved with GeoPySpark since its inception, I have helped design/develop every aspect of the project. It is this intimate knowledge of the library that makes me feel that I would be a good candidate for talking about GeoPySpark.

Technical Requirements

In order to fully show GeoPySpark abilities, I will need access to the internet during the presentation. If that cannot be provided, then a smaller, scaled down version of the demo will be given.