Hilmar Buchta

Hive 0.1x

Apache Hive doesn’t have support for geo spatial functions out of the box. However, it’s easy to extend the functionality by using external libraries, that can be called via user defined functions (UDFs). This post shows how to use the ESRI Hadoop libraries in Hive. I’m using the Hortonworks HDP sandbox here, so it’s easy to reproduce the examples from below.

 

Preparation

In order to use the libraries, they first need to be installed. I’m using the following script to download, build and deploy the libraries to the Hadoop file system. Please note that you may have to correct the build number of the jar files. You will also need maven to build the libraries, so you need to install this prior to running the following commands:

# create working directories
cd /tmp
mkdir gis
cd gis
mkdir framework
mkdir api

# download and build framework
cd framework
wget https://github.com/Esri/spatial-framework-for-hadoop/archive/master.zip
unzip master
cd spatial-framework-for-hadoop-master
mvn clean package
hadoop fs -put hive/target/spatial-sdk-hive-1.0.3-SNAPSHOT.jar /user/esri
hadoop fs -put json/target/spatial-sdk-json-1.0.3-SNAPSHOT.jar /user/esri
hadoop fs -put hive/function-ddl.sql /user/esri

# download and build api
cd /tmp/gis/api
wget https://github.com/Esri/geometry-api-java/archive/master.zip
unzip master
cd geometry-api-java-master
mvn clean package
hadoop fs -put target/esri-geometry-api-1.2.jar /user/esri

# clean up
cd /tmp
rm -rf gis

If successful you should see the following files in your Hadoop file system, directory /user/esri:

image

 

Making the ESRI functions available in Hive

After building the ESRI libraries, we still need to declare the new user defined functions within Hive. First we start with the jar-files:

add jar hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar;
add jar hdfs:///user/esri/spatial-sdk-json-1.0.3-SNAPSHOT.jar;
add jar hdfs:///user/esri/esri-geometry-api-1.2.jar;

Running a “list jars” in Hive shows the libraries being loaded successfully:

image

Please note that you can add the add jar commands to the .hiverc file. You can find more details on that file in this blog post.

Next we do have to declare the spatial functions. You will find the full declaration for all functions in the function-ddl.sql file that I also placed in hdfs under /user/esri (as shown in the screenshot from above) using the build script. However, for this post, I’m only using a few functions with the following declarations:

create temporary function ST_Point as ‚com.esri.hadoop.hive.ST_Point‘;
create temporary function ST_LineString as ‚com.esri.hadoop.hive.ST_LineString‘;
create temporary function ST_Length as ‚com.esri.hadoop.hive.ST_Length‘;
create temporary function ST_GeodesicLengthWGS84 as ‚com.esri.hadoop.hive.ST_GeodesicLengthWGS84‘;
create temporary function ST_SetSRID as ‚com.esri.hadoop.hive.ST_SetSRID‘;
create temporary function ST_Polygon as ‚com.esri.hadoop.hive.ST_Polygon‘;
create temporary function ST_Intersects as ‚com.esri.hadoop.hive.ST_Intersects‘;

create temporary function ST_Polygon as ‚com.esri.hadoop.hive.ST_Polygon‘;
create temporary function ST_Intersects as ‚com.esri.hadoop.hive.ST_Intersects‘;

Starting with Hive 0.13 you can also declare these functions permanently by omitting the “temporary” keyword from the declarations from above. This has the advantage that you do not need to declare the functions for every session. You can also include the jar file in the create function statement which makes it easier to create a permanent declaration.

For example, for the definition of the ST_Point function you would write the following SQL statement:

create function ST_Point as ‚com.esri.hadoop.hive.ST_Point‘ using jar ‚hdfs:///user/esri/spatial-sdk-hive-1.0.3-SNAPSHOT.jar‘;

Using the ESRI functions in Hive

In order to demonstrate the ESRI geo spatial functions I’m using the Uber data set with 25,000 Uber routes from San Francisco.

The first rows of the data are looking like this:

image

Each trip has a id (1 to 25,000) with the latitudes and longitudes at each time. For example, using window functions we can easily calculate the total time for each trip (by using first_value to get the starting time and restricting the result set to the last entry per trip):

— get trip duration
select id, unix_timestamp(dt)-unix_timestamp(dt_first) trip_duration
from (
select
id,dt,
first_value(dt) over (partition by id order by dt) dt_first,
row_number() over (partition by id order by dt desc) lnr
from uber
) Sub
Where lnr=1;

image

 

Ok, so far we did not use any of the geo spatial functions we’ve declared above. Let’s say we’d like to calculate the distance and average speed for each trip. The idea is to calculate the distance between two GPS points along the trip. Therefore I’m using window functions again to retrieve the preceding GPS coordinates along the way as shown below:

select
id,
dt,
latitude,
longitude,
lag(latitude,1) over (partition by id order by dt) prev_latitude,
lag(longitude,1) over (partition by id order by dt) prev_longitude
from uber;

Looking at a very short trip shows how we now have the preceding coordinates in the same row.

image

In order to calculate the distance between the coordinates, we first create a line between the two coordinates using the following query (the blue part is exactly the query from above):

Select
id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from
(select id, dt, latitude, longitude,
lag(latitude,1) over (partition by id order by dt) prev_latitude,
lag(longitude,1) over (partition by id order by dt) prev_longitude
from uber
) Sub;

image

The line is not visualized in the query editor, so we only see its binary representation. The next step is a little bit tricky because we need to calculate the length of the line. Here, the function ST_GeodesicLengthWGS84 can be used as shown below (again the blue part is the previous query). This functions returns the distance in meters on a spheriod, however the correct SRID has to be set before:

select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance
from
(
Select id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub
) Sub1

image

The only thing left is to aggregate distance and time difference. Since the time difference is in seconds and the distance is in meters, we can calculate speed in km/h as distance/1000 * (3600 / time_difference). This results in the following query (again only shown for trip no 44):

select id, min(dt) start_time, max(dt) end_time, Sum(Distance) distance, Sum(Distance)/1000 * (3600.0/Sum(time_passed)) avg_speed
from
(select id, dt, time_passed, ST_GeodesicLengthWGS84(ST_SetSRID(L,4326)) Distance from
(Select id,dt,
unix_timestamp(dt)-unix_timestamp(lag(dt,1) over (partition by id order by dt)) time_passed,
latitude, longitude,
ST_LineString(prev_longitude, prev_latitude, longitude, latitude) L
from (select id, dt, latitude, longitude, lag(latitude,1) over (partition by id order by dt) prev_latitude, lag(longitude,1) over (partition by id order by dt) prev_longitude from uber) Sub
) Sub1
) Sub2
group by id;

image

 

Ok, now this looks rather like a rather short trip, however it was still good to see the idea. Here are the waypoints from this short trip (PowerView):

image

 

Now let’s assume we’re interested in all the trips that go the the airport. In order to run this query, we need to define the area for the airport. I’m using Google Earth with this purpose here and simple draw a polygon for my target area:

image

You can then export the polygon to a .kml file. Since I have a very simple polygon here, I transferred the coordinates from the .kml-file manually into my geo spatial polygon type:

kml file ESRI polygon

<LinearRing>
<coordinates>
-122.3939778660149,37.61286698251174,0
-122.387217599775,37.61405599476924,0
-122.3886445577052,37.61737414129696,0
-122.3948268153237,37.61592559803017,0
-122.3939778660149,37.61286698251174,0
</coordinates>
</LinearRing>

ST_Polygon(
-122.3939778660149,37.61286698251174,
-122.387217599775,37.61405599476924,
-122.3886445577052,37.61737414129696,
-122.3948268153237,37.61592559803017,
-122.3939778660149,37.61286698251174

)

In order to find all trips that end at the airport we just need to get the end-point for each trip and use ST_Intersects to see if this endpoints is within the polygon. Here is the resulting query: 

select count(*) CntTripsToAirport
from (
select id, start_coords, ST_Point(longitude, latitude) end_coords
from (
   select
     id,dt,longitude, latitude,
     first_value(ST_Point(longitude, latitude)) over (partition by id order by dt) start_coords,
     row_number() over (partition by id order by dt desc) lnr 
  from uber 
  ) Sub
Where lnr=1
) trips
where
ST_Intersects(
  ST_Polygon(
   -122.392291,37.616543,
   -122.389115,37.616458,
   -122.389115,37.613552,
   -122.392119,37.613297,
   -122.392291,37.616543
)
, trips.end_coords);

This results in 26 trips:

image

Of course you can do a lot more using this data set and the geo spatial functions in Hive, so this post should just give you an idea. Also, you can use visualization tools like PowerView. For example, the following map shows the locations where most of the trips started (red means a lot of trips starting here):

image