Hilmar Buchta

SQL Server 2005-2016

Last year I wrote about importing data from OpenStreetMap using PowerQuery. You can find the blog post here. In that post I loaded a relatively small area as an example. The reason was, that the XML DOM parser in PowerQuery loads the full XML document into memory before being able to access it. If you need to process larger areas, the approach with the DOM parser won’t work. For example, in a recent project I had to load all addresses in Berlin. I took the OSM file from geofarbrik.de. The file is bzip-compressed down to 68 MB. Once downloaded it expands to about 940 MB (yes, XML is very talkative and compresses very well…). At first, I tried to load the file using PowerQuery and the script from my blog post. But since the DOM parser creates a memory consuming object model, it failed at about 30% of the load and 20 minutes with an out-of-memory error (using 25GB).

So, if you need to load a large XML file in general, you’re well advised to use a a different parsing approach. So this blog post is about using a lightweight XML parser for reading the file from above. For example, a SAX parser reads the file once from the beginning to the end (forward only) firing events. In C# the XMLReader follows a similar approach. Both parsers do not allow you to search the XML file randomly (for example with XPATH), to insert elements or to go back and forth. But what they do is that they read the file in a very efficient way.

Let’s have a look at a typical node element from the OpenStreetMap OSM file:


<node id="256922190" visible="true" version="7" changeset="29687333" timestamp="2015-03-23T20:13:41Z" user="atpl_pilot"
uid="881429" lat="52.5379749" lon="13.2888659">
  <tag k="addr:city" v="Berlin"/>
  <tag k="addr:country" v="DE"/>
  <tag k="addr:housenumber" v="24"/>
  <tag k="addr:postcode" v="13627"/>
  <tag k="addr:street" v="Halemweg"/>
  <tag k="addr:suburb" v="Charlottenburg-Nord"/>
  <tag k="amenity" v="library"/>
  <tag k="layer" v="1"/>
  <tag k="name" v="Stadtteilbibliothek Halemweg"/>
  <tag k="ref:isil" v="none"/>
</node>

You can clearly see the geo coordinates (latitude and longitude) as well as the address (in key/value pairs below the node). I wasn’t interested in points of interest (POIs) but you can also see that the amenity key contains information about the point of interest. In this case, we have a library.

Since PowerQuery uses the DOM parser and because I wanted the import process to run scheduled I used Integration Services (SSIS) to load the file. First I had to create a database table like this:

CREATE TABLE [dbo].[OSMAddress](
    [latitude] [real] NULL,
    [longitude] [real] NULL,
    [street] [nvarchar](255) NULL,
    [housenumber] [nvarchar](20) NULL,
    [postcode] [nvarchar](20) NULL,
    [city] [nvarchar](255) NULL,
    [country] [nvarchar](2) NULL
)

Next, I used a very simple data flow to populate the table:

image

The main logic is contained in the script component. This is the code for the CreateNewOutputRows event in the script component (please note that his code is without any error handling for simplicity here):

public override void CreateNewOutputRows()
{
    float latitude = -1;
    float longitude = -1;

    String city = null;
    String country = null;
    String street = null;
    String housenumber = null;
    String postcode = null;

    using (XmlReader reader = XmlReader.Create(Variables.OpenStreetmapFile))
    {               
        while (reader.Read())
        {
            switch (reader.NodeType)
            {
                case XmlNodeType.Element: 
                    if (reader.Name.Equals("node"))
                    {
                        if (reader.HasAttributes)
                        {
                            String lt = reader.GetAttribute("lat");
                            String lg = reader.GetAttribute("lon");

                            if (lt != null && lg != null)
                            {
                                if (!(float.TryParse(lt, out latitude) && float.TryParse(lg, out longitude)))
                                    latitude=longitude=-1;
                            }                                       
                        }
                    }
                    else if (reader.Name.Equals("tag"))
                    {
                        if (latitude > -1 && longitude > -1)
                        {
                            String k = reader.GetAttribute("k");
                            String v = reader.GetAttribute("v");
                            if (k!=null && v!=null) {
                                switch (k)
                                {
                                    case "addr:city":        city = v; break;
                                    case "addr:country":     country = v; break;
                                    case "addr:housenumber": housenumber = v; break;
                                    case "addr:postcode":    postcode = v; break;
                                    case "addr:street":     street =v; break;                                                   
                                }
                            }
                        }
                    }

                    break;
                
                case XmlNodeType.EndElement:
                    if (reader.Name.Equals("node"))
                    {
                        if (latitude > -1 && longitude > -1 && street != null && city!=null && housenumber!=null)
                        {
                            Output0Buffer.AddRow();
                            Output0Buffer.city = city.Substring(0, Math.Min(city.Length,255));
                            Output0Buffer.country = (country==null)?"":country.Substring(0, Math.Min(country.Length,2));
                            Output0Buffer.housenumber = housenumber.Substring(0, Math.Min(housenumber.Length,20));
                            Output0Buffer.latitude = latitude;
                            Output0Buffer.longitude = longitude;
                            Output0Buffer.postcode = (postcode==null)?"":postcode.Substring(0, Math.Min(postcode.Length,20));
                            Output0Buffer.street = street.Substring(0, Math.Min(street.Length,255));
                        }
                        latitude = longitude = -1;
                        street = postcode = housenumber = country = city = null;
                    }
                    break;
            }
        }
    }
}

The package took about 10 seconds to extract all of the addresses from the OSM file into the database table: A quite impressive result compared to the 20 minutes without success from above. So this clearly shows the advantage of XML parsers like SAX or XMLReader when it comes to reading larger files. If you go for larger areas it’s better to directly stream from the bzip2 compressed file instead of decompressing the file first. For example, the OSM file for Germany (complete) is about 4GB in size (bzip2 compressed) and expands to a single 48GB XML-file. I used SharpZipLib to decompress the file on the fly which saves a lot of disk space and IO. Using this approach I created the following visualization showing the concentration of fuel stations in Germany:

image

Of course you could retrieve much more information from the OSM file than I did here. For example, you can read borders (city, state etc.), points of interests (restaurants, airports etc.), sometimes even the location of trees. The file format is described here: http://wiki.openstreetmap.org/wiki/OSM_XML.