March 31, 2011

Update 2 on Neo4J Spatial and British Isles OSM Data import.

Well even 6GB of heap space isn't enough to fully import and re-index the British Isles Open Street Map Data set. I'm thinking that there will have to be some work on memory management somewhere. First thing I'm going to need to do is profile the import and work out where the memory is being used (I'm assuming it isn't a memory leak per se). Basically I want to determine whether it is because of a misconfiguration by me or there is a need to look at the memory usage in the OSMImporter class.

I'm also thinking that for reverse geo and local searches I really don't need the full OSM data set so a customised version of the OSM importer would be a good idea. I really don't need ways or the node data associated with them. I suspect a multi-pass approach would be a good idea. The first pass through the OSM data set determines which nodes are needed for the features that I want to import, the second pass imports the required nodes and their associated features, followed by a final re-indexing.

I've been in discussions with one of the people behind Neo4J Spatial, Craig Taverner, on the Neo4J User List. I found an issue with the bounding box used in the spatial search index visitor pattern. It's been quite illuminating to dig into this. I'd been thinking for a while that a number of the techniques that I encountered working with 3D graphics and scene graphs would be relevant and it definitely seems to be the case. I'm beginning to think that OpenGL or OpenCL backed indices could prove very useful in high throughput micro-batched situations where the bus latency can be hidden.

March 24, 2011

Update 1 on Neo4J Spatial and British Isles OSM Data import.

Successes and failures...

I managed to shepherd the import all the way through to import completion with 4096 MB of memory but unfortunately fell short when it came to the re-indexing. As I commented on my post I upped the memory to 6144 MB and restarted and it has again successfully imported and has been re-indexing overnight without falling over. I'm waiting to see how this goes.

During the past import I did a little more characterisation of the system utilisation and came up with the following: As reported before the node import process is single threaded and does not saturate my disk IO, what is more interesting is the way import which is heavily reliant on the already imported nodes. The way import saturated neither IO nor a single core on my CPU (again it appeared largely single threaded), I'm not sure what the problem is, it could be that the limit is the disk seek time / latency as the ways are composed of nodes already imported. I'm guessing that there would be a need to look at how the nodes are stored and whether any useful caching can be done to accelerate the way import.

The 'relation' import phase was too short for me to characterise.

The indexing phase does appear to be multi-threaded during a brief glance this morning before I left. using 2 cores quite consistently (4 cores are available).

Aside from the importing I've started playing with the API, using the Buckinhamshire data set that I imported on my laptop. So far not too impressed by performance, I used the code from some of the Neo4J spatial unit tests to do a way (highway) search from a given point. I chose coordinate in the middle of a lake (51.808721,-0.689735) and gradually expanded the search from 10m, 100m, 1000m, 10000m. quite slow but at least not an exponential increase in time - 14.4s, 85.8s, 88.7s, 86.1s.

Interesting to note that the OSM parts of the Neo4J Spatial codebase so far does not seem to have any convenience methods for area and boundary searches, only for way searches.

Here is my current (very rough) code below:
package com.presynt.neo4j;

import static org.junit.Assert.*;

import com.vividsolutions.jts.geom.Coordinate;
import com.vividsolutions.jts.geom.GeometryFactory;
import com.vividsolutions.jts.geom.Point;
import org.geotools.referencing.CRS;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Ignore;
import org.junit.Test;
import org.neo4j.gis.spatial.Layer;
import org.neo4j.gis.spatial.SpatialDatabaseService;
import org.neo4j.gis.spatial.SpatialTopologyUtils;
import org.neo4j.gis.spatial.osm.OSMImporter;
import org.neo4j.gis.spatial.osm.OSMLayer;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.kernel.EmbeddedGraphDatabase;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;

import javax.xml.stream.XMLStreamException;
import java.io.IOException;


public class TestNeo4JSpatial {

private static GraphDatabaseService graphDB;
private static SpatialDatabaseService spatialDB;

@Test
@Ignore
public void createSpatialDB() throws XMLStreamException, IOException {
OSMImporter importer = new OSMImporter("OSM-BUCKS");
BatchInserterImpl inserter = new BatchInserterImpl("/testDB");
importer.importFile(inserter ,"/OSMData/buckinghamshire.osm");//"british_isles.osm"); // "C:\\british_isles.osm");
inserter.getGraphDbService().shutdown();
graphDB = new EmbeddedGraphDatabase("/testDB");
importer.reIndex(graphDB, 100000);
graphDB.shutdown();
}

@Test
public void retrieveLayer(){
final Layer layer =
spatialDB.getLayer("OSM-BUCKS");
assertNotNull(layer);
assertEquals(OSMLayer.class, layer.getClass());
}

@Test
public void useLayer() {
final OSMLayer osmLayer = (OSMLayer)spatialDB.getLayer("OSM-BUCKS");
final GeometryFactory factory = osmLayer.getGeometryFactory();
System.out.println("Unit of measure: " + CRS.getEllipsoid(osmLayer.getCoordinateReferenceSystem()).getAxisUnit().toString());

final Point point = factory.createPoint(new Coordinate(51.808721,-0.689735));

// final Layer boundaryLayer = osmLayer.addSimpleDynamicLayer("boundary", "administrative");
// SearchContain searchContain = new SearchContain(point);
// boundaryLayer.getIndex().executeSearch(searchContain);
// for(SpatialDatabaseRecord record: searchContain.getResults()){
// System.out.println("Container:" + record);
// }

final Layer highwayLayer = osmLayer.addSimpleDynamicLayer("highway", null);
long startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 10.0);
System.out.println("10m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");
startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 10.0);
System.out.println("10m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");

startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 100.0);
System.out.println("100m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");
startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 100.0);
System.out.println("100m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");

startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 1000.0);
System.out.println("1000m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");
startTime = System.currentTimeMillis();
startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 1000.0);
System.out.println("1000m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");

SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 10000.0);
System.out.println("10000m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");
startTime = System.currentTimeMillis();
SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 10000.0);
System.out.println("10000m took:" + ((System.currentTimeMillis() - startTime) / 1000d) +"s");
startTime = System.currentTimeMillis();
// for(SpatialTopologyUtils.PointResult result : SpatialTopologyUtils.findClosestEdges(point, highwayLayer, 10000.0)){
// System.out.println(result);
// }
}

@BeforeClass
public static void initialiseDatabase() {
graphDB = new EmbeddedGraphDatabase("target/test-classes/spatialTestDB");
spatialDB = new SpatialDatabaseService(graphDB);
}

@AfterClass
public static void shutdownDatabase() {
graphDB.shutdown();
spatialDB = null;
graphDB = null;
}
}

March 21, 2011

Neo4J Spatial and Importing the British Isles Open Street Map Data

I know, I know. Overly ambitious and all that.

While we've been working on Presynt we've been having fun and games with Geo Data; both for Local Search and Reverse Geo lookups. We've also got a number of ambitious new Geo features planned for future versions that needs needs some form of Geo store.

So with that in mind I started playing with Neo4J Spatial and the Open Street Map data set.

It's been an interesting experience with some gotchas as Neo4J Spatial is still not fully ready for prime time, so I thought I would share my experience with those who are interested.

First off I thought I would try to import the full global data set from Open Street Map. Not necessarily a mistake, but a huge undertaking as we are talking about over 200GB of map data in raw OSM format, billions of data points and relationships.

The first gotcha is that if you wish to import the full OpenStreetMap data set I'd recommend starting with at least Neo4j version 1.3 Milestone 4 as that version was the first one to hugely expand the number of nodes supported from 4 billion to 128 billion.

Of course attempting this type of import I also started running into memory problems. I ended up running with -Xmx4096m and -XX:+UseConcMarkSweepGC. The concurrent mark and sweep garbage collector proved most resilient to the loads placed on it. The default parallel GC would fall over and cry quite regularly due to an issue with consuming too much CPU to too little effect.

In the end I discovered that the full global import would take too long to run and set my sights a little lower on just the British Isles data set which is all that Presynt needs right now.

In order to test my set up I decided to use the Buckinghamshire data set which is the smallest county data set in the UK and this highlighted another gotcha: Not all the dependencies required are actually included in the Neo4J POMs.

In the end I used the following dependencies:
  • org.neo4j:neo4j:1.3.M04
  • org.neo4j:neo4j-spatial:0.5-SNAPSHOT
  • org.geotools:gt-referencing:2.6.5
  • org.geotools:gt-main:2.6.5
  • org.geotools:gt-cql:2.6.5
  • org.geotools:gt-epsg-hsql:2.6.5
  • com.vividsolutions:jts:1.11
They may not all be necessary but they certainly work...

Once I had resolved these issues then I was able to start the full British Isles import. I was running it on a quad-core machine and found that the OSMImporter/BatchInserter combination (all default configuration mind) was using just one CPU. Looking at the OSMImporter codebase it became obvious that it was written to stream read the OSM xml data and synchronously write it out to Neo4J. I think that there may be some room to parallelise it as large amounts of the GPS node data is utterly independent and the XML data set is structured in such a way that it could be broken down into independent units of work. The only points where serialised behaviour are required are during the parsing of the XML, the construction of the Way and Relation data and possibly the writing to Neo4J.

The process seems to be largely CPU bound as I ran it on SATA II RAID 0 SSDs and their read and write capabilities were barely touched while the single CPU in use was pegged at 100% almost continuously.

The import of the British Isles dataset seems to take about 48 hours.

I used the vanilla OSMImporter and BatchInserter configurations as I am still a Neo4J Neophyte (bad pun intended ;-)) I will be reading up on alternative configurations to see if I can improve the performance. I will also look at the import of changesets so that I shouldn't need to do a full inport again.