November 14, 2014

NoSQL vs Lucene as Database

Today I was looking at the products of SindiceTech specifically SIREn which is schemaless search engine for JSON. They seem to be using it for their knowledge graph representation. Now the question boils down to why even bother with Lucene? MongoDB is already out there with scalability and fast nested schemaless indexing and search, even if you want to consider using it for a knowledge graph or semantic web context. Here is a good comparison of Lucene vs MongoDB as schemaless data stores and search engine. At the end of the day, if lucene done right it can do anything MongoDB can.



A good answer is in this stackoverflow post.
With the NoSQL movement growing based on document-based databases, I've looked at MongoDB lately. I have noticed a striking similarity with how to treat items as "Documents", just like Lucene does (and users of Solr).
So, the question: Why would you want to use NoSQL (MongoDB, Cassandra, CouchDB, etc) over Lucene (or Solr) as your "database"?
What I am (and I am sure others are) looking for in an answer is some deep-dive comparisons of them. Let's skip over relational database discussions all together, as they serve a different purpose.
Lucene gives some serious advantages, such as powerful searching and weight systems. Not to mention facets in Solr (which Solr is being integrated into Lucene soon, yay!). You can use Lucene documents to store IDs, and access the documents as such just like MongoDB. Mix it with Solr, and you now get a WebService-based, load balanced solution.
You can even throw in a comparison of out-of-proc cache providers such as Velocity or MemCached when talking about similar data storing and scalability of MongoDB.
The restrictions around MongoDB reminds me of using MemCached, but I can use Microsoft's Velocity and have more grouping and list collection power over MongoDB (I think). Can't get any faster or scalable than caching data in memory. Even Lucene has a memory provider.
MongoDB (and others) do have some advantages, such as the ease of use of their API. New up a document, create an id, and store it. Done. Nice and easy.

Reponse:

This is a great question, something I have pondered over quite a bit. I will summarize my lessons learned:
  1. You can easily use Lucene/Solr in lieu of MongoDB for pretty much all situations, but not vice versa. Grant Ingersoll's post sums it up here.
  2. MongoDB etc. seem to serve a purpose where there is no requirement of searching and/or faceting. It appears to be a simpler and arguably easier transition for programmers detoxing from the RDBMS world. Unless one's used to it Lucene & Solr have a steeper learning curve.
  3. There aren't many examples of using Lucene/Solr as a datastore, but Guardian has made some headway and summarize this in an excellent slide-deck, but they too are non-committal on totally jumping on Solr bandwagon and "investigating" combining Solr with CouchDB.
  4. Finally, I will offer our experience, unfortunately cannot reveal much about the business-case. We work on the scale of several TB of data, a near real-time application. After investigating various combinations, decided to stick with Solr. No regrets thus far (6-months & counting) and see no reason to switch to some other.
Summary: if you do not have a search requirement, Mongo offers a simple & powerful approach. However if search is key to your offering, you are likely better off sticking to one tech (Solr/Lucene) and optimizing the heck out of it - fewer moving parts.
My 2 cents, hope that helped.

Here is another good comparison.


Also take a look at Apache Jena vs Neo4j for knowledge representation: triples vs graph

Another triple stores comparison: http://www.bioontology.org/wiki/images/6/6a/Triple_Stores.pdf

No comments:

Post a Comment