I know, I know. Overly ambitious and all that.
While we've been working on
Presynt we've been having fun and games with Geo Data; both for Local Search and Reverse Geo lookups. We've also got a number of ambitious new Geo features planned for future versions that needs needs some form of Geo store.
It's been an interesting experience with some gotchas as Neo4J Spatial is still not fully ready for prime time, so I thought I would share my experience with those who are interested.
First off I thought I would try to import the full global data set from Open Street Map. Not necessarily a mistake, but a huge undertaking as we are talking about over 200GB of map data in raw OSM format, billions of data points and relationships.
The first gotcha is that if you wish to import the full OpenStreetMap data set I'd recommend starting with at least Neo4j version 1.3 Milestone 4 as that version was the first one to hugely expand the number of nodes supported from 4 billion to 128 billion.
Of course attempting this type of import I also started running into memory problems. I ended up running with -Xmx4096m and -XX:+UseConcMarkSweepGC. The concurrent mark and sweep garbage collector proved most resilient to the loads placed on it. The default parallel GC would fall over and cry quite regularly due to an issue with consuming too much CPU to too little effect.
In the end I discovered that the full global import would take too long to run and set my sights a little lower on just the British Isles data set which is all that Presynt needs right now.
In order to test my set up I decided to use the Buckinghamshire data set which is the smallest county data set in the UK and this highlighted another gotcha: Not all the dependencies required are actually included in the Neo4J POMs.
In the end I used the following dependencies:
- org.neo4j:neo4j:1.3.M04
- org.neo4j:neo4j-spatial:0.5-SNAPSHOT
- org.geotools:gt-referencing:2.6.5
- org.geotools:gt-main:2.6.5
- org.geotools:gt-cql:2.6.5
- org.geotools:gt-epsg-hsql:2.6.5
- com.vividsolutions:jts:1.11
They may not all be necessary but they certainly work...
Once I had resolved these issues then I was able to start the full British Isles import. I was running it on a quad-core machine and found that the OSMImporter/BatchInserter combination (all default configuration mind) was using just one CPU. Looking at the OSMImporter codebase it became obvious that it was written to stream read the OSM xml data and synchronously write it out to Neo4J. I think that there may be some room to parallelise it as large amounts of the GPS node data is utterly independent and the XML data set is structured in such a way that it could be broken down into independent units of work. The only points where serialised behaviour are required are during the parsing of the XML, the construction of the Way and Relation data and possibly the writing to Neo4J.
The process seems to be largely CPU bound as I ran it on SATA II RAID 0 SSDs and their read and write capabilities were barely touched while the single CPU in use was pegged at 100% almost continuously.
The import of the British Isles dataset seems to take about 48 hours.
I used the vanilla OSMImporter and BatchInserter configurations as I am still a Neo4J Neophyte (bad pun intended ;-)) I will be reading up on alternative configurations to see if I can improve the performance. I will also look at the import of changesets so that I shouldn't need to do a full inport again.