Enigmastation.com

Joseph Ottinger's blog

Considering Data Stores

Introduction

Computing, as an industry, is all about data. Our programs take in data from somewhere, do something to it, and then send the data somewhere else: a file system, a user, a database, perhaps as an email or some other message.

Because of the industry’s data-centric nature, our choice of data storage is critical, yet most architects see every problem as hammer-and-nail, applying the same data storage solution over and over again, regardless of whether it’s the best approach, simply because it’s the familiar approach.

What’s more, there are few resources that actually make any attempt to evaluate alternatives, with most resources emphasizing ways to lessen the impact of using the same old data store no matter what the application. “SOD it all,” is the common refrain, where “SOD” means “same-old-data store.”

This series is meant to provide at least a cursory evaluation of many data storage solutions for most architects. It will not be all-inclusive, and cannot provide application-specific data; what’s more, as an initial evaluation of the tools, there are going to be situations that are not optimally addressed.

Hopefully, the cursory nature of the evaluations will not offend; if the evaluation is wrong, please correct it by sending any comments to joeo@enigmastation.com, and together we’ll try to see how the evaluation is wrong, and more importantly, why it’s wrong. If you can include short, idiomatic, freely-licensed, and efficient code, that’d be great.

My goal is to use the publicly available documentation as much as possible; if it leads me astray, of course I accept responsibility, but perhaps we can see a way to prevent others from the same pitfalls.

The approach here is to consider the gestalt of a solution architecture, rather than focus on a specific solution itself, where possible.

Unfortunately, there’s no perfect one-size-fits-all solution. In the end, you have to evaluate what your requirements are, balancing those against your own expertise, to make a good choice, in the hopes of making the perfect choice.

This article series is meant to help you see what the possibilities are, instead of offering you a “best of breed” recommendation. Each product out there has strengths and weaknesses, and the variable nature of requirements means that your choice is really up to you.

We will look at each solution through a number of lenses, focusing on different attributes:

  • Popularity, which is based on sadly perception rather than hard data. Hard data would obviously be preferable, but gathering reliable hard data has proven to be near-impossible. Many companies that use “Solution X” simply refuse to say so, which makes Solution X seem less popular than it is, where every Tom, Dick, and Harry who use “Solution Y” talk a lot about it, which conversely can make Solution Y seem more popular than it is. Thus: perception.
  • Ease of use.
  • A rough analysis of strengths and weaknesses.
  • Speed of data access. As any performance architect can (and will, or should) tell you, speed is fleeting and generally nontrivial to get “right.” There are so many variables involved in getting a data store to perform really well that it’s impossible to fairly evaluate data stores as apples-against-apples. As a result, perception plays in, and the goal is to look at speed from a newbie’s perspective: without in-depth tuning unless the tuning is so common that everyone knows to do it that way.
  • Quality of documentation.
  • Flexibility of generating large reports (which involve spanning lots of data.)

As stated before, any modifications should be sent to joeo@enigmastation.com. I’d love to see this as a general reference that everyone can benefit from, and the best way to make that happen is for us as a community to work together.

Core Things to Think About

CRUD

Crud. No, wait. I meant “CRUD,” for “create, read, update, and delete.” These are the four basic operations every traditional datastore has to provide to be worth anything. However, their meaning is affected by the SQL domain, so let’s clarify:

  • Create is “persisting a specific unique object to the data store.” A create operation is finished when another process is able to see the data item.
  • Read is “retrieve a specific unique object from the data store.” SQL screws this up somewhat, because its read operation is the same as a query operation. This is important later on.
  • Update is “modify a specific object or set of objects in the data store.”
  • Delete is “remove an object or set of objects from the data store.”

However, the statement that every data store must implement all four of these … isn’t true. A messaging server can be seen as just “create and read,” for example, and that’s a legitimate view of data – consider a messaging paradigm such as an ESB serving as a data store. (The use case would be to send a data item as a data request, and a response that contains the data. MongoDB uses this kind of paradigm, actually.)

Create, read, delete is also viable: consider a UDB-based data store that responds to requests with data based on distance and a key. (Think memcached here, if you’re looking for a real-world version of this.)

However, most people want control of their data, and therefore they think in terms of how to accomplish all four tasks. How do I put something into the data store? How do I get it back out? How can I change it? How can I get rid of it?

The difficulty by which these four tasks are accomplished – and the nature of the tasks involved in accomplishing them – go a long way to defining what a good data store is for your project and team.

Transactions

Put simply, transactions are atomic units of work. If you write five items to a data store as one transaction (think line items on an order, for example, or connecting flights on a single ticket purchase), those five items must all succeed for any of them to succeed.

Transactions have a lot of facets, including speed, complexity, and propagation. They vary by data store, but they’re absolutely necessary. (… unless it’s a web service. Then all you have to do is support the “transaction API” which means you get to pretend to commit or rollback, as long as you support the call. WS-Transaction is a wonderful API that engenders no trust whatsoever.)

There are multiple types of transaction processing. Single-phase transactions (for lack of a better term) take place on one data store. Two-phase commits involve multiple data sources: two databases, two schemas, eighteen horses, and a goat.

Two-phase commits are a lot more common than one might hope. They’re also horribly complex to get right. One approach to two-phase commits: the transaction manager queues up all changes, acquiring locks on all changed elements, and then – when the transaction is committed – the transaction manager asks each data store if it can actually execute the changes it’s lined up. If they can, the transaction succeeds. If not, it fails.

A data store that cannot participate in two-phase commits is going to be functionally crippled in a traditional application. That’s not to say that they are absolutely mandated – you can play basketball with one arm, for example – but you really do want to have their support at some level.

Freaking CRUD

We already covered CRUD two sections ago, of course, but we’ve already seen that CRUD doesn’t necessarily take transactions into account. There’s one other operation that is so important that it deserves its own letter, but I couldn’t think of a good way to wedge it into “CRUD” without badly misspelling it.

It’s arguably more important than CRUD, too (in that CRUD doesn’t work well without it) so I’m going to go one step farther and give it its own word: Freaking, for “find.”

If you can’t find something, it’s not there. It may exist. It may not. But if you can’t quickly and easily figure out if it exists and how to find it, it might as well not exist, and that voids the whole purpose of using a data store in the first place.

So querying factors in pretty heavily: How can you find it, whatever it is? Can you count how many there are? Can you summarize or tabulate data? How quick is finding a set of items? How flexible is the query syntax? Can you find an item if it’s participating in a transaction? Can you even tell if you would have found a given item if it wasn’t in a transaction?

These may not all be important questions for your application, but they may be things to think about. You’re better off asking if these questions are relevant rather compared to not asking the questions when you should have.

Benchmarks

It’s painfully easy for someone (an author of an article, perhaps?) to say that technology X does Y well, and Z poorly. Benchmarks aren’t much better, honestly.

  1. Benchmarks are easy to tune for a given technology. It’s easy for me to write, say, a JDBC DAO that caches in a certain way based on how the benchmark works. That would make that DAO perform very well, even though it’s a custom solution geared for that benchmark.
  2. Benchmarks rely very much on the developer’s skill. If I know a specific technology very well, and hardly know a competing technology, a benchmark based on the former will far outperform the latter. That’s hardly fair or realistic or useful, as people will most likely use technology they know anyway.
  3. Benchmarks don’t reflect real-world operations or configurations.An insurance company has a lot of data to move around, but a benchmark will most likely use only a tiny portion of that, at best, which changes how the benchmark works. A full insurance application, in binary form, might be 25,000 bytes. The benchmark might be twenty-five bytes. One might take twenty TCP/IP packets to transfer across the network, the other takes one. That can fundamentally change what a benchmark can mean for an insurance company, to say the least.Further, data structures factor very heavily. An ORM normally maps many-to-one relationships as different tuples; however, if those tuples are never used for selection, there can be a huge negative performance impact. A benchmark can’t do a good job of being generally specified in such a way that the results can legitimately be used.

However, benchmarks are still useful, by giving us a reference point to use, as long as the implementations are done by reasonably-skilled developers (or, rather, developers with typical skill) and actually compare equivalent operations as much as possible.

Benchmarks can highlight capabilities, too; done properly, you can see if a technology fails in a given area, or how, or possibly even see the technical shortcomings.

The Benchmark Used here

This is definitely a benchmark. Therefore, use it as a guidepost, not as a factor for any decision you make. If you use it for decision-making, you’re buying into the “benchmarks are super-valuable” fantasy. It’s not true.

The benchmark here uses a DAO to perform a number of simple, concurrent operations, using a simple data model. There are N things relevant points about the benchmark.

  1. The data model was very simple: a single object, with an identifier, a “name” field, and a “description” field, both strings. No relationships of any kind. Therefore, the DAO measurements were focused on single pieces of data.
  2. The measurement was only on the DAO, using a dynamic proxy around the DAO to capture nanoseconds in the DAO.
  3. When an external process was used to store the data (an RDMS, an external GigaSpaces container, a MongoDB server) the service was run locally, on the same machine as the benchmark. Individual core consumption was under 100%, which is acceptable.
  4. Where possible, indexes were used for each appropriate field (which means that all fields were indexed; only the ID was unique, however.)
  5. The process for most of the tests was to write ten thousand data items, tracking their keys; an additional two thousand false keys were generated, to create “misses.” The operations were then performed concurrently, via a ServiceExecutor with a fixed threadpool with five threads, and the best 10% and the worst 10% of the timing results were discarded. The exception to this process was the “write test,” which measured the write times, and therefore didn’t need to worry about false keys or cache misses.
  6. Measurements were in nanoseconds, but the results are presented in milliseconds, with varying degrees of precision. (For example, 7.30ms might be appropriate for one data store, whereas 0.013ms might be appropriate for another operation with a different data store.)
  7. The DAO operations were: write, read, delete, and query.
    1. Write accepted an object. If the object had an ID, that ID’s data in the data store was replaced; if it did not have an ID, the data was written and an ID assigned. The ID assignment process varied by the specific data store.
    2. Read accepted an object key and returned the matching object. If the ID did not exist in the database, null was returned.
    3. Delete was identical to read, except the ID was also removed from the data store (a destructive read, in other words.)
    4. Query was based on a “query-by-example” paradigm, and the query mechanism was expected to build whatever data it needed. In most cases, this worked fairly well. (In Db4O’s case, it was horrible … and incorrect, as far as I can tell. This will be explained in the Db4O section.)

Remember, the benchmark is very simple and definitely not what any possible real world application would mirror – but the DAOs might serve as a basis for any of the datastores, if you like, and since they all do the same thing, you should be able to see what’s involved in using a given technology based on the DAOs.

Relational Data Stores

JDBC

The first data store for Java we’ll look at is the basis for most data stores: JDBC, the Java Database Connectivity API. This is basically a mechanism by which Java applications can issue SQL through which a relational database can be accessed.

Relational databases are the heart of the “SOD” approach, and when people talk about persistence, they usually mean that the data is being stored in a relational database somewhere. It’s probably a conservative estimate that 99% of the world’s data is stored in relational datastores – chances are it’s closer to the magical five nines (meaning “99.999%”) than it is to a mere 99%.

Most interactions with JDBC are streamed, which means that from the application’s perspective, a set of records being returned from a SQL query are held in memory one row at a time. This is actually very good, because it means your application can start processing data as soon as it comes from the database, as opposed to waiting until the entire dataset is ready.

Relational databases tend to be pretty fast for most query types, provided your query is simple enough. SQL is also good enough that you can usually make any query work well, especially if you can hand-tune it (and you really, really know what you’re doing.)

The good news is that JDBC is pretty simple (if slightly verbose), and relational databases are pretty ubiquitous. The bad news is that JDBC is fairly verbose, and relational databases tend to be fantastic at data retrieval and not so great at update transactions.

JDBC also has a, well, scarcity of data types. That doesn’t really sound so bad, except it means that you – the coder – tend to have to write conversions or validations yourself, even if the underlying database supports custom data types.

To understand more about this, think about a latitude/longitude pair. Latitude and longitude can be represented as a pair of floating point variables, so it makes sense that you could store a pair as, well, two floats. However, that puts a burden on the database: both variables must be present if either of them are, any ranges have to be coded either by the programmer or by the DBA (depending on the database’s capabilities), and it’s possible to get partial data (“give me only the latitude, not the longitude.”)

Likewise, an address is typically split out into its components (two lines of street address, possibly an apartment number, city, state, and zip code – if you’re only considering addresses in the United States!) For querying on city, that’s fine, but it still puts a lot of burden on the developer.

As a contrast, an object database can store an Address data type, and can still query by city without having to treat it as an entirely separate column in the data store. What’s more, an object database is not unique in this regard.

The data types issue is also what led Java into the object-relational zone, what Ted Neward rather colorfully called the “Viet Nam of Computer Science.” More energy has been spent trying to bolt data types onto JDBC than almost any other problem space in Java (with web applications frameworks running a close second, and perhaps griping about how haaaaard JavaEE is being a close third.)

Object-relational mapping is horribly useful, and it’s also a great way to suck the life right out of a promising application. Virtually everyone uses ORMs now instead of JDBC; they’re the other part of the “SOD it all” mentality.

Actually using JDBC directly from code is fairly quick in terms of code runtime; the worst aspect of using JDBC from a performance perspective is in actually acquiring a database connection. For this reason, most JDBC users rely on a database connection pool (JNDI for Java EE applications, or Commons-DBCP or something similar for standalone applications or developers who are unreasonably afraid of JNDI.)

From a coding standpoint, JDBC is verbose but simple. The basic structure of a JDBC query looks like this:

In this code, acquireConnection() gets a Connection from JNDI or some other mechanism.

processRow(), which is really poorly named, handles a row from the ResultSet (which would probably normally accept specific values from the ResultSet rather than getting the ResultSet itself); since we’re using transactions, assume it does something strange, like update the database somehow.

The various close() methods would call the close() method on the closeable object assuming it’s not null or in an error state.

This code will generally run properly and fairly safely and quickly under most normal conditions, assuming the SQL query isn’t complex, and the update process isn’t slow, and the transaction doesn’t build up to thousands of queued updates… you get the idea.

The benchmark used Spring to manage JDBC operations, which simplified them dramatically. The DAO’s read operation looked like this:

Complex graphs (or relationships) between tables are easy to manage; transactions would factor in there, too, but they’re fairly straightforward. You’d simply update both tables in a given transaction, and commit when the data is right; any errors that the database can detect (through foreign key relationships) would throw an error and you’d be able to manage the problems from an exception handler.

Again, JDBC is verbose, but simple enough. In many ways, that’s good – it means that few coders will look at JDBC code and not be able to understand what’s going on, assuming your SQL isn’t incredibly complex. (It may be. Some SQL queries are maddening.)

As you can see, dependency injection frameworks like Spring can make much of the boilerplate code around JDBC disappear; the transaction support and template capabilities can make all of the error handling and connection acquisition code disappear.

JDBC Summary

JDBC is the lowest common denominator for relational database usage in Java, serving as the underlying data storage mechanism for almost every ORM (Hibernate, et al), and other data storage mechanisms too (for example, most Java Content Repository implementations in production use a database in the end unless the user’s insane. Don’t be insane, users.)

CRUD in JDBC is very straightforward, but suffers from repetitive and verbose patterns (see the code example, which could serve as a boilerplate for all JDBC operations.) The same goes for querying a SQL database. SQL can be hand-tuned by a competent DBA, and accessing the specific database’s capabilities is very easy (stored procedures, anyone?) Also, failures tend to be dramatic and simple, as the type mapping isn’t very flexible; do it wrong, and JDBC will let you know pretty quickly.

Transaction support is horribly common, being supported by the basic API.
Generating reports with JDBC is also pretty efficient, because the schema typically reflects what reporting engines need: rows and columns of related data that can easily be summarized.

Finding support for JDBC is pretty easy; the Java part isn’t that complex in the first place, and SQL is, shall we say, very well-known.

JDBC can be a performance bottleneck; as JDBC does no caching in and of itself, repeated queries for the same data will hit the database server over and over again. (Some databases will cache responses of common queries, but the data still has to be transmitted to the application.)

The benchmark data for JDBC was quite good:

Average write time 7.30ms
Average read time, keys in serial order 1.13ms
Average read time, keys in random order 1.09ms
Average delete time, keys in serial order 3.93ms
Average delete time, keys in random order 4.09ms
Average query time, value in serial order 1.16ms
Average query time, value in random order 1.12ms

Most Java developers no longer use JDBC directly (or don’t admit to it); most use an ORM instead.

Object-Oriented or Hierarchical Data Stores

The Viet Nam of Computer Science (AKA Hibernate and other ORMs)

The next step from JDBC is to use an Object-Relational Mapper. Spring has one that maps JDBC resultsets to objects, but that’s not really an ORM the way that the Java Persistence API is; JPA can represent object graphs built from a set of data tables, following and enforcing foreign keys and cascading updates without the coder having to do much SQL at all.

The most common ORM is Hibernate, which served as much of the basis for the Java Persistence API, which is a core part of EJB3. Other ORMs include implementations of the Java Data Objects API (JDO) as well as iBatis and others, and some would even say that entity beans – as part of the EJB1 and EJB2 specifications – would also be considered as ORMs, but EJB2 and EJB1 entity beans were such a headache to code that they’re hardly worth considering in the real world any more. If you’re developing a new application, avoid EJB1 and EJB2; if you’re maintaining an older application that uses them, really, consider updating.

Hibernate is the (probably) most popular implementation of JPA, but it’s not the only one, of course: alternatives include Toplink, EclipseLink, and OpenJPA, to name a few – and each of the JPA implementations has its own strengths and weaknesses which are beyond the scope of this article.

JPA works by mapping an object’s attributes and properties to database tables, whether through an XML file or via Java 5 annotations. (Annotations are easier. XML is more flexible. Pick your poison.) The mapping can include collections of data, as well as complex datatypes (remember the “Address” we talked about with JDBC?), and JPA provides a very flexible querying API that can descend the object graph to query specific fields, building the actual SQL query for you on the fly.

To use JPA, you acquire an EntityManager, which represents a series of operations with the database. The EntityManager is able to persist objects for which a mapping exists, and query such objects as well. (It can even construct objects that exist as results from queries, that don’t represent entities at all.)

A mapped object is referred to as being managed while it’s in the EntityManager’s session scope, and when it leaves that scope, it needs to become managed again to have changes reflected in the database.

Also, complex object graphs may need to have their fetch strategies tuned for the specific operation. A fetch strategy describes when data or other relationships get loaded – along with the object itself or not. To understand what this means, consider a customer/order/line item/inventory structure, where a customer has many orders, an order has many line items, a line item has one inventory item, and a given inventory item can have many line items.

When you fetch a customer, you don’t necessarily want all of that customer’s orders, line items, or inventory items. You just want the customer.

However, if you do want the orders, you don’t want to go through a lot of hoops to get them; the key here is to tell JPA that the relationship is lazily-fetched (i.e., fetched on demand, when the set is accessed.) However, that means making sure that the session represented by the EntityManager is still active at the point at which you request the orders.

It’s certainly not impossible, and in many ways, it’s not even difficult – but it’s enough of a problem that many web applications wrap requests in a filter that provides a session for each request, so the object graph is available no matter how or when it’s accessed in the request lifecycle.

ORMs are really quite handy, since Java is an ostensibly object-oriented language; an ORM means your data gets represented in a form that the language supports natively. That means that coders can easily understand the structure of data (by looking at the Java code, or possibly even your object model’s documentation), and since JPA is very common, it’s trivial to find someone who has an acceptable level of expertise with it, without even mentioning how many books focus on Hibernate or JPA.

The code for the general cases in ORMs tends to be very boilerplate, especially with Spring. The read operation in the DAO was encapsulated in a base DAO class, and looked like this:

The query for the DAO, however, was something monstrous. The benchmark actually determined the JPA provider, and if it was Hibernate, used Hibernate’s query-by-example feature; if not, it built a valid JPA query and used that instead:

Yikes. And this was with Spring, which actually managed the transaction for us.

If you know you’re using a specific JPA provider, by all means, use that provider’s query capabilities if you can. JPA also has named query support, which would have been far more efficient in terms of code size; what you would do there is determine which fields were set, and select a named query based on the fields provided. However, if your data model has a large number of attributes, that design becomes unwieldy very quickly; with only three properties, this DAO would have had eight queries to look up.

Problems with ORMs: that silly dataset problem. If you retrieve all of the orders for our customer model, you will instantiate … all of the orders. (You should query for them instead of relying on the ORM to give you all relations. Querying gives you much more control.) Also, actually getting the mapping right is fairly easy if you’re mapping from Java to a relational structure, but much harder if you’re going to Java from a relational schema. (Tools exist to alleviate this somewhat, but the resulting object models tend to be unwieldy.)

Ideally, generating reports wouldn’t involve JPA at all. For reports, you’d almost certainly want to code directly to the relational engine rather than using your carefully-crafted object model.

The session model that JPA relies on can also be a barrier for entry at first. However, as stated, JPA is pretty ubiquitous now so a coder who isn’t familiar with the model probably should become familiar pretty soon. Sessions also hide the JDBC connection from the coder, and most JPA implementations nudge you in the direction of proper pooling. Therefore, connection pooling is still theoretically an issue for JPA, but realistically, it’s not.

The biggest problem ORMs have is that they’re part of the “SOD it all” mentality. They’re typically very powerful for most problems… but it’s also fairly easy to go off the reservation, so to speak, and find things you need for which ORMs are simply terrible.

Also, while the mapping capabilities of the available ORMs are very powerful, and the ORMs can generally create your database schema based on the object model, a good DBA will still probably need to be involved to manage partitioning, optimal indexing, possibly even index types and other such optimizations.

One piece of advice related to schema in an ORM would be to let the ORM generate a schema (maybe even saving off the data description language used to generate the schema), and then give the schema to a DBA who will hopefully look at how the database is used so he can optimize it.

Feel free to use an ORM – but be prepared to go direct to the database if you have performance-critical needs, or if you find that the ORM in question is simply not very good for what you’re trying to do.

Of course, instead of going direct to the database, you could use another type of data store altogether…

ORM Summary

ORMs are great for Java programmers when the schema remains fairly simple and the session visibility is consistent. From a developer’s standpoint, they’re very popular; help is easy to find, references are common, they’re fairly simple and noninvasive to your object model, and the skills are very transferable from project to project.

CRUD operations are fairly static, although managing sets in a session can be a problem; the only real difficulty here is “update,” but the engines are fairly well-documented so their behavior shouldn’t be a surprise.

ORMs also generate suboptimal schemas and queries, typically. This can be mitigated by tuning, but tuning is more rare than one would hope. Querying APIs are limited to something like SQL in the generic APIs, but individual products extend this support by adding query-by-example and other querying techniques. In general, generic query capabilities are okay, but if you’re willing to lose support for the specification in favor of a specific product you’re better off.

Transaction support is built in at the API level, much as it is for JDBC; this shouldn’t be much of a surprise, considering ORMs’ purpose as a mapping between an object model and a JDBC-capable database.

Generating sizable reports through the ORM is almost always an exercise in futility. Plus, generating reports and forms from data generated by an ORM can be harder than it should be, unless the schema’s been tuned or your object model is very simple.

If you have a problem with a given ORM, well… that’s why it’s been called the “Viet Nam of Computer Science.” You can generally find what you need, or figure it out, but if there’s a problem, you’re looking at so many moving parts that fit together that tracing can be … interesting, in the “not very good” way. On the other hand, ORMs are the most popular data storage mechanism in Java; chances are good that if you’re persistent in searching for an answer, it’s out there. Somewhere.

The benchmark data from Hibernate was pretty good:

Average write time 3.55ms
Average read time, keys in serial order 1.74ms
Average read time, keys in random order 1.62ms
Average delete time, keys in serial order 5.01ms
Average delete time, keys in random order 5.02ms
Average query time, value in serial order 2.46ms
Average query time, value in random order 2.37ms

ODBMS

Object databases are datastores that, well, contain native object representations rather than relations. They’re distant cousins of JavaSpaces, in that JavaSpaces export tuples as objects, but object databases tend to be stronger in queries than the JavaSpaces API.

DB4O

Using Db4O is fairly simple: you open a reference to a database (which can be a local database or a client/server connection), and use that reference to store, update, delete, or query for objects.

The cool thing about Db4O is that there is no translation between the data stored in the database and your application. Your actual data types go in and come out of the database. If you store an StreetAddress as an atomic unit, that’s exactly what you’ll get back out – and you can still query based on attributes of that StreetAddress.

Db4O supports our six operations. CRUD is supported much like it is in an ORM (except no mapping stage, as was mentioned), transactions are implicitly required by the API (opening the database starts a transaction, and commits create breakpoints that can be rolled back to)…

That leaves out one thing: Queries, our “Freakin’ CRUD” from the original six features. If there’s one area in which Db4O shines, it’s here: the query API is fantastic.

Combined with the implicit nature of transactions in Db4O, queries and results provide developers a flexibility that’s hard to imagine other data stores providing. (They do, of course, but Db4O still excels in this area, but keep in mind the performance issues that can occur with queries; see the benchmark data, below.)

Queries in Db4O can use query by example; they can also use a convenient callback matching mechanism (the “native query API”), and you can build a constraint mechanism with something called the SODA API.

Db4O also has one of the better tutorials available for a data store.

So what’s wrong with Db4O, since this review seems so glowing? Well, it’s not very good at reporting, for one thing. The query API, while excellent, is rather vertically focused on returning objects and not data (which makes sense for an object database, but many uses want data instead). The client/server mode has been iffy in experience (which is an anecdotal report, and probably isn’t fair to db4o, but … there it is.)

Lastly, the biggest drawback to Db4O (and every object database) is the lock-in to that specific data store. If you use JDBC, your SQL is generally portable to every SQL database. If you use JPA, there are multiple JPA engines available. If you use JavaSpaces, you have a specification that you can expect to have for reference; the same goes for JCR.

ODBMSes have their own APIs. This is a strength as long as their strengths help you and don’t hinder you.

The Db4O read(), which was fairly indicative of how most of the code looked, was written like this:

The benchmark and Db4O did not get along very well. I used an embedded Db4O container (the most likely use case), which means that basic operations were very fast, as long as they were based on object identity. Queries were horribly slow, and that’s due to the operating environment, not Db4O itself. There’s an optimizer for queries that I was unable to get running without contorting the JVM invocation, which I didn’t consider valid for the test, so queries had to walk the entire dataset in order to find matches.

Db4O has an external mode but that wasn’t tested; chances are good that an additional 1ms network penalty would factor in when an external client/server mode is used.

Average write time 0.07ms
Average read time, keys in serial order 0.02ms
Average read time, keys in random order 0.09ms
Average delete time, keys in serial order 0.10ms
Average delete time, keys in random order 0.29ms
Average query time, value in serial order See explanation
Average query time, value in random order See explanation

Object-oriented data stores are, for all intents and purposes, the lingua franca of Java persistence. JDBC provides access to the underlying database, but object-relational mappers and object databases both take Java’s native “data structures” – objects with properties and collections – and persist them.

This is incredibly powerful, even though there’s an impedance mismatch with ORMs – and they’re the most common data store for Java overall.

Document-oriented Data Stores

This is a loose description of data stores for whom relations between objects aren’t coerced. You can have references to objects, but the objects may not be there – the ties between them are weak, although some of the data stores have mechanisms by which they can alter that somewhat.

The burden is on the programmer to enforce referential integrity, but the benefit is huge – raw, blazing speed and scalability. It’s definitely a tradeoff, but the benefits in speed are impressive – and even when they’re not, you’re getting features that other data stores don’t normally provide (JCR provides full-text search, for example, and an incredibly flexible query syntax, at the cost of writes being very, very slow.)

Normally, one uses these as intermediate data stores, not systems of record. Data used here will often find its way to a relational database for a data warehouse, either explicitly or implicitly (JCR, again, can use a relational database as a backing store, for example.)

Don’t let the lack of enforced referential integrity scare you off. It’s easier to compensate for than you might think – transactions in most of the document-oriented data stores are so fast that clashes are hard to replicate in code.

JavaSpaces

Before we dive in here, an important disclaimer needs to be put in: I am employed by GigaSpaces Technologies, a JavaSpaces vendor. I’m likely to be biased to some degree, but I promise I’ll try to be as honest and unbiased as possible.

By the way, JavaSpaces is document-oriented, but only by the definitions enforced by referential integrity enforcement. It’s perfectly capable of storing an object hierarchy.

Anyway: onward!

JavaSpaces is part of the JINI specification, designed to provide a place to store data and provide distributed processing capabilities. There are two primary projects that provide JavaSpaces implementations in the real world: GigaSpaces Technologies and Blitz. GigaSpaces is a commercial implementation, and Blitz is open source.

They implement different versions of the JavaSpaces specification, and have different philosophies. Since their philosophies are so different, we’ll treat them separately; otherwise, we’ll be reduced to reading a series of “Blitz does this; GigaSpaces does that” paragraphs.

JavaSpaces is best thought of as a sort of transactional messaging server where the messages live as long as you want them to, and delivery of those messages isn’t destructive unless you want it to be. This is actually a very powerful paradigm, but it comes at a cost.

Note that JavaSpaces are datastores! A message here is used as a semantic term, not a limiting factor, just like method calls in an object-oriented language are often called “messages.” A message in JavaSpaces is an object, and can be thought of as storing state for as long as the message exists.

JavaSpaces provides four basic operations: write, take, read, and notify.

Write sends a message into the space. A message can have a limited lifetime (“disappear after 4096 milliseconds”) or can be set to survive forever, barring the container being shut down or something else removing the message.

Read pulls a copy of the message from the space; it’s a nondestructive read of data.

Take removes a message from the space. Therefore, it’s a destructive version of read; once taken, a message is no longer in the space. A delete is a take message that discards the data it retrieves.

Notify registers a callback for when data is available from the space.

Querying data – through take, read, and notify – is done with a query-by-example mechanism. Therefore, to find customer X, you’d create a customer object, fill in the unique identifier for that customer, and then read using that template as an example.

However, there are … issues with this approach. For example, a search for users by last name would potentially have many results with that name; the JavaSpaces API itself actually has been updated to handle this. (It’s startling to think that initial revisions did not have proper capability to handle multiple objects coming back from the space.) However, the major providers of JavaSpaces implementations provide either the updated API (the JavaSpace05 interface) or a custom API that provides the same feature.

To map this into our six functional areas for datastores: CRUD is… mostly supported. Updates are actually destructive reads, followed by writes (hopefully wrapped in a transaction). Queries vary by product, but it’s really quite limited: Query-by-example is really the only specification-supported technique, although products can and do provide other mechanisms. Transaction support is mandated through the JINI specification and is a core part of the API, although it’s more difficult to use than JDBC’s transaction API.

Please note that this is a very cursory overview of JavaSpaces, and can hardly do justice to the specification.

Blitz

Blitz is an implementation of JavaSpaces and little else, and includes the JavaSpace05 interface. It’s very fast, easy to install and administer.

Installing is simply a matter of downloading the installer and running one of the provided batch files. Applications use a simple lookup process following one of these examples, and voila! The Spaces API is at your disposal.

As an implementation of the pure JavaSpace API, messages must conform to the JINI specification, which means: no primitives, and all data exposed is via public fields. However, the objects don’t have to be entirely anemic; they can provide methods and include behavior.

Since the JavaSpace contains full Java objects, Blitz supports master/worker and distributed processing very well. It also tends to be very fast at data retrieval, which makes it excellent as a sort of cache for data, in front of a relational datastore.

However, persistence of the space is really up to the user, therefore if you need cache write-through, you’re likely to write it in yourself. This isn’t hard, at all – but it’s still something to be aware of. Blitz does not extend the JavaSpaces API in any real way, although Blitz’ author is a fantastic resource for JavaSpaces users.

GigaSpaces

GigaSpaces is a JavaSpaces implementation driven by commercial and pragmatic interests. It differs from Blitz in many, many ways, some of which I’ll enumerate hopefully without bias:

  • It uses the older JavaSpace interface, while extending it with a GigaSpace interface that provides many of the same capabilities as the JavaSpace05 interface.
  • It’s designed to be clustered, and this clustering support implies a distributed architecture that, if followed, can be both very invasive and incredibly efficient.
  • It provides other data access mechanisms, such as a map interface, JMS, and JDBC. As a result, it has a number of rich querying techniques available.
  • It provides a Spring layer, called OpenSpaces, that makes writing callback handlers very convenient (and declarative).
  • The clustered application architecture provides for a lot of failover capabilities, and the ability to collocate processing and data means that data access tends to happen at the speed of accessing RAM.
  • It provides for built-in persistence of the space, surviving restarts.

Running GigaSpaces involves three types of applications: a Lookup Service (“LUS”), a GigaSpaces Manager (the “GSM”) and a set of GigaSpaces Containers (“GSCs”). A lookup service keeps track of the other components in the system. A GSM distributes applications among the available GSCs, and a GSC uses the GSM to determine to which processes it should sync data.

A GSC can contain JavaSpaces and what GigaSpaces terms “processing units.” A processing unit has direct access to the space and is designed to handle any processing that it can reach.

Designing an object model can be far easier in GigaSpaces than in the standard JINI model, using annotations to indicate whether attributes are to be persisted and indexed. You can also indicate a routing field, so that a specific container gets matching data routed to it, which means that if you’re using the partitioned processing unit model, distribution happens almost automagically.

Transactions in a processing unit tend to be very, very quick – as the distribution mechanism means that data lives in the same VM as the processing algorithms (giving you access to the data at the speed of your RAM, rather than exposing your data to a network.) This is a very powerful feature, but does require converting your application to something rather unlike the traditional client/server model.

One of the neat things about GigaSpaces – and I’m biased through experience – is that most of the limitations of JavaSpaces have workarounds in the GigaSpaces API.

JavaSpaces Summary

From an architectural standpoint, JavaSpaces provide very rapid access to data, but the form of the data will be changed in the process of making it appropriate for JavaSpaces. An ORM is designed to make that transition fairly easy, at the cost of data retrieval efficiency; JavaSpaces data, once modified, can be accessed easily and quickly, but the transition isn’t entirely easy.

Since the support for CRUD is actually slightly limited (JavaSpaces supports CRD, not CRUD) and queries can be limited depending on your data store provider, JavaSpaces has to be seen as weak in queries, even if specific products are much, much stronger.

If your application is heavy on reporting, JavaSpaces doesn’t easily support indexed data (you have to order it yourself, storing the order in the Space or ordering it after retrieval), unless you use GigaSpaces’ alternate query APIs. Therefore, you typically end up writing the data out to a warehouse-type datastore (i.e., a relational database) for reporting.

Sets of data – lists – can be very efficient in JavaSpaces but again, supporting efficiency typically means altering your data model fairly dramatically. For example, despite JavaSpaces being able to store full objects, if you want to query a set of objects in a container, they usually need to be broken out into external objects.

JavaSpaces can also be used as caches behind an ORM like Hibernate, which might be a quick and easy way to get a performance boost.

With such a simple API, most issues surrounding JavaSpaces applications revolve around how to build your object model (with recommendations pretty much all leaning to “be very very simple”) or container configuration. Your best support is going to be commercial, however, with the userbase not being gigantic. However, its users (including me) are very enthusiastic about the technology.

The benchmark used GigaSpaces, in two modes: an “embedded” container and an external container. The embedded container is the “normal” way GigaSpaces works; you put your code into the container and it runs alongside your data. The scary thing here is that, depending on your requirements, you can speed the embedded numbers up by quite a bit as well, even though the results are sometimes in tenths of milliseconds.

Embedded GigaSpace
Average write time 0.09ms
Average read time, keys in serial order 0.05ms
Average read time, keys in random order 0.04ms
Average delete time, keys in serial order 0.07ms
Average delete time, keys in random order 0.07ms
Average query time, value in serial order 0.04ms
Average query time, value in random order 0.04ms
External GigaSpace
Average write time 2.73ms
Average read time, keys in serial order 1.51ms
Average read time, keys in random order 1.16ms
Average delete time, keys in serial order 1.27ms
Average delete time, keys in random order 1.16ms
Average query time, value in serial order 1.33ms
Average query time, value in random order 1.26ms

Java Content Repository

Technically, this isn’t a datastore any more than JDBC is: it’s a specification. Implementations range from exposing the specification itself to providing a content management layer that abstracts away JCR, which is itself an abstraction.

The starting point for implementations of JCR is JackRabbit, from Apache. It’s very difficult to project which actual content repositories are “more popular,” because vendors have their own data and users aren’t talking enough.

JCR doesn’t manage data in traditional row/column format. Instead, JCR manages content; an object has no meaning for JCR, but an object’s properties do.

A convenient mindset to have when using JCR is that of all data being managed in XML. The data is not in XML normally, but that’s an easy and proper abstraction. Queries can use XPath (or a form of SQL); data is managed in terms of nodes, child nodes, and node attributes.

Data in JCR can be structured or unstructured. Structured data means that a valid Customer node has to have a name and an address; an address could be structured such that it requires a street address, a city, a state, and a postal code (assuming American addresses, of course.)

Unstructured data is, of course, unstructured. An unstructured customer node (if there is such a thing) could contain other customers, for example, even though that might be considered an error in data structure.

Both approaches have strengths, but most people who need structured data use a relational database directly for that, while using JCR to store typically more free-form data and images. This is part of what makes JCR a document-oriented data store – a structured node enforces referential integrity, while an unstructured node does not.

JCR can also version data, making histories easy to manage.

Querying JCR is very flexible; you can query by type or XPath (“//usa/in/marion/customers[@name='interactions']“, to get “interactions” from Indianapolis – or, more generically, “//*/customers[@name='interactions']“.) Wildcards are supported – and, most usefully, full-text search is supported. (In JCR 2.0, recently completed as a specification, the queries changed quite a bit, preferring an object model query or a variant of SQL.)

JCR implementations in practice have been very fast. The layers above JCR sometimes have not been, sadly, but that’s a factor of how the API is used more than a problem with the API itself.

As for the six features of a datastore: JCR tends to make CRUD operations very verbose. Here’s an example of the DAO read() operation:

Queries were much like the JPA queries, albeit more generic:

Again, the JCR test had some issues running, so this code isn’t likely to be perfect – but it should be fairly representative of the kind of code you’re looking into for the data layer in a JCR application. Writes can version data (and queries can use it, although this code ignores versions); transactions are managed in the form of locks. (The locking mechanism is what affected the benchmark; the data structure isn’t good for a lot of flat data, and the locks would have affected the entire dataset when in use.)

Where JCR really shines is in querying data; it’s the only data store type we’ve covered that almost mandates full-text search support.

JCR Summary

JCR is a strange API for most developers, mandating familiarity with XPath and XML for efficiency, and adding bits and bobs on top of those abstractions. For those who aren’t actually familiar with it, it’s going to be a strange ride. That said, it’s actually pretty common. Implementations are easy to find and most seem pretty capable.

JCR can retrieve data very quickly thanks to how it’s all structured in the end; XPath is easy to master enough for common use, so finding data (even old versions) and updating is usually pretty trivial. The support of structured and unstructured data offers a lot of flexibility for application programmers (provided you can figure out the structured schema), and the full-text capability is a huge win for JCR.

However, for big reporting jobs, JCR isn’t really appropriate, and it doesn’t normally expose the underlying data in human-readable format. Also, the XML-like nature of content means that your data representations can be very wide, which can be very inefficient, and updating a JCR node can be rather verbose compared to most other mechanisms.

Support for JCR is a little spotty. There are a lot of questions around how to build queries correctly, and sadly, not a lot of good answers yet. That will probably change, but for now, support isn’t extremely good.

The benchmark, however, was … problematic. JCR was able to read very very quickly, and search quickly as well, but writes were nearly intolerable, and difficult to get working concurrently. Part of the problem is that JCR isn’t designed for massive writes like the benchmark does, so it’s a little unfair to expect it to perform well in the write tests. Plus, writing in JCR involves a lot more than it does in the other data stores; if we wanted to, we could run a full-text search on our meager test data.

Memcached

Memcached is an external service (much like a relational database is) that is generally aimed at being a distributed, updated cache. It has many of the same issues that a clustered JavaSpace would have (i.e., not designed for reporting, potential transaction issues if objects participating in a single transaction are distributed across nodes), except the querying API is simpler: it’s just a map. On the other hand, it’s trivial to set up, and simple is sufficient for many needs.

From a developer’s standpoint, if this is used in any complex scenario, chances are you’ll spend a decent amount of time creating artificial hash keys to store data, or you’ll alter your data model – again, much like JavaSpaces.

Java coders would most typically use memcached as a cache (again, shades of JavaSpaces – or vice versa) behind an ORM like Hibernate; you may want to consider native-java caches like ehcache or Coherence, or even a distributed heap like Terracotta DSO instead of memcached. Most of these products have configurations documented for the popular ORMs.

The primary difference between memcached and a distributed javaspace is that memcached is more upfront about synchronization issues, and the simpler API means a simpler initial implementation.

Memcached was not benchmarked for this article.

MongoDB

MongoDB is a document database that on the surface looks and acts a lot like the Java Content Repository. It doesn’t store objects as much as it stores documents in a hierarchy of nodes.

It’s different from JCR in a number of crucial ways: the API is dramatically different (and much simpler), and it’s designed around JSON representation, which means MongoDB has far less of a Java focus than JCR does.

The API follows the same sort of “acquire connection, work with data” pattern that JDBC, et al, follow. MongoDB runs in its own server (written in C, not designed to be embedded any more than, say, MySQL is), so the first calls made in a mongo client acquire references to the external server and then acquire references to an external database – which is represented as a set of collections of data.

Like JCR and JavaSpaces, MongoDB allows referential integrity, but doesn’t enforce it. This can be a strength; your objects can be sent back and forth in an incomplete state (i.e., as they’re being built by an application) instead of requiring placeholders for incomplete data.

MongoDB supports the CRUD operations fairly well, although queries can be “interesting” to work with, because they’re fundamentally different than most of the other APIs we’ve looked at.

Finding a specific document (remember, document database, not object or row storage) is easy enough, but queries involve building a JSON object and sending it to the database, yielding a response that contains matches.

Reads and writes use a document paradigm to interact with the database. Here’s the read() method:

Queries weren’t much more complex:

Transactions are supported, but two-phase commits are not; it’s feasible that an enterprising Java coder could write a JTA wrapper around MongoDB, but it’s not likely.

MongoDB is a “poor man’s JCR,” in a world where JCR tends to be massive overkill for unstructured data. Since MongoDB “speaks” JSON natively, it’s really, really good for sending data to web-based clients and clients written in non-Java languages, as the Java API is verbose compared to other languages.

The lack of two-phase commits in transactions could potentially be a problem, but realistically, it’s not; MongoDB’s usage and focus tends to be in very short-lived transactions such that clashes would be rare in the domain anyway.

The selling point for MongoDB is that it is fast. It was easily the fastest external datastore in the entire test, with the only faster entries being embedded Db4O (as an in-memory database with occasional synchronization to disk) and the embedded GigaSpace (where the embedded GigaSpace actually provided more persistence features, transactions, and more room for enhancement).

That said, here are the numbers:

Average write time 0.016ms
Average read time, keys in serial order 0.28ms
Average read time, keys in random order 0.27ms
Average delete time, keys in serial order 0.77ms
Average delete time, keys in random order 0.83ms
Average query time, value in serial order 0.36ms
Average query time, value in random order 0.39ms

Conclusions

Datastores Not Covered

Oh, there are so many! Some weren’t covered since their scopes were too limited or they really didn’t fit the definition of datastores: actual XML as data storage (use JCR instead!), Java serialization (which works well but isn’t an actual flexible datastore), Perst (which is really an embedded ODBMS, and therefore would be roughly equivalent to the embedded db4o instance). Some were viable but weren’t covered due to lack of space or experience: Prevayler, iBatis, REST (which is a specification, not a product), and Hadoop (as well as Cassandra, and the other BigTable-like variants.)

This is definitely not to imply that these aren’t worthy! However, they tend to be rather specialized, even moreso than some of the datastores included. That said, you may want to check out the BigTable datastores – they’re very popular in cloud applications because they distribute data naturally.

Making a Choice

So which datastorage mechanism is good for what?

JDBC is great for simple applications that don’t need caching, or cases where you need to-the-metal data manipulation. Data types tend to be lacking. Queries are verbose and simple but well-known. If you’re generating reports from a data-warehousing type application, this is the strongest API.

ORMs like JPA are the workhorse of data storage for Java; they stand between JDBC and the object-oriented data model Java encourages. They support caching data in many cases. Queries follow the JDBC model for the most part, and they often support data storage neutrality (for when you develop on H2 but want to use Oracle 9 for production). These are the gold standard for data storage in Java. Reporting is rather weak.

JavaSpaces are ideal for distributed processing applications. Queries tend to be simple (although GigaSpaces certainly provides a wide array of complex query scenarios). Transactions with JavaSpaces tend to be very, very quick compared to transactions with RDMSes. The programming model takes some getting used to, though, and reporting is not very strong.

JCR is designed for content management, and it shows. For content apps like online magazines, blogs, catalogues, or even semantic data, JCR is difficult to beat as an actual storage mechanism. However, the API is very verbose; JCR really wants to have an abstraction above it to hide the gory details from an application programmer. Queries with JCR are generally more powerful than any other data storage mechanism, even though they are expressed in only two forms: a SQL variant and XPath. Designing a reporting application around JCR sounds like a very special kind of hell.

Memcached is ideal for simpler queries than JavaSpaces provides, but it’s also a naturally distributed datastore. It’s probably best suited for caching, but caching apps have a lot of benefits, so it’s not fair to call it just a cache. Reporting doesn’t enter into the picture.

ODBMSes are too different from each other to easily classify, but generally the querying capabilities are very strong (even though the Db4O queries were problematic for my benchmark.) Reporting is likely to mirror the ORM experience: not very pleasant. These tend to be very fast and small, as they’re often included in embedded applications.

14 Comments

Add a Comment
  1. “JCR really wants to have an abstraction above it to hide the gory details from an application programmer”

    I strongly disagree. It might be a little bit complex at the beginning, but all the abstraction layers on top of JCR tend to create much more problems. Application developers, once they learned JCR of course, should use the api directly.

  2. Alex :

    “JCR really wants to have an abstraction above it to hide the gory details from an application programmer”

    I strongly disagree. It might be a little bit complex at the beginning, but all the abstraction layers on top of JCR tend to create much more problems. Application developers, once they learned JCR of course, should use the api directly.

    That’s fair. However, I’d point to Sling and other such projects as proof positive for the assertion; similar analogies can be found in Apache’s commons-email and JavaMail.

    You can “do everything” with JavaMail, and people horribly familiar with JavaMail can do everything with it, but most people would prefer not to.

  3. Reports with ORMs is pretty easy, or at least with Hibernate it is.
    1. Type safe “Report” objects
    2. Collections/Arrays of objects

    I use this with BIRT. Works very well. I am told it can be done with Crystal too.

  4. Another good one in the object database space is the Versant Object Database.

    These are the guys who develop and support db4o, but the Versant database is intended for larger scale, distributed systems.

    One of your main criticisms of the object database is reporting and query. The Versant object database has a jdbc/odbc interface which will allow you to hook it up to something like Crystal Reports for reporting.

    You get the cool stuff you found in db4o, plus you get great support for larger scale databases and bells and whistles like synchronous fault tolerance, online reorg, clustering, etc.

  5. Hi,
    I am looking for a reporting framework that works by querying JCR instead of querying the database. I was wondering if there was an existing framework which would help create reports by quering the JCR repository to provide the results (may be by using xpath).

    I was looking into the open source BIRT Reporting Framework, but it doesn’t look like it can be used to query JCR.

    Please do let me know if there is any existing framework or if you have any info on using BIRT with Day JCR.

    Thank you.

    1. Dominique De Vito :
      Indeed, JCR is too complex to be widely adopted, while NoSQL databases are much more popular. So, MongoDB, being a simpler JCR, may be a better choice than JCR.
      Similar questions could be raised between JavaSpaces and NoSQL.
      See http://www.jroller.com/dmdevito/entry/thinking_about_nosql_databases_classification

      You cannot make statements like this. Look at ModeShape, it implemented JCR and on top of it there is really cool Graph API, Sequencers for automatic text/metadata processing and nodes population, Connectors, fulltext search and indexing etc. … I’d really like to see a layer on top of MongoDB that would provide something like this…

      1. Well, in all fairness, he used “may be” and not “is”.

        There is no absolute for any of these; I was more trying to illustrate the possibilities and some strengths and weaknesses of each, rather than make a single, final conclusion.

        If I was going to make a single conclusion I’d just say “use GigaSpaces!” and have done. :) But that’s not intellectually honest, nor is it very convincing.

Leave a Reply

Enigmastation.com © 2014 Frontier Theme