Wednesday, August 4, 2010

A different model for building applications

I have been working on a prototype application for a large client:

The Problem
  1. A large number of users would like to access their specific data to make queries.
  2. Each users data is held in a separate SQL database.
  3. The current app only has minimal indexes, and adding new indexes is limited by the engineers managing the SQL cluster.
  4. The new app would like to make much more complex queries than those currently supported by indexing.
  5. Future apps would like to update the schema, and queries supported.
  6. Most data sets are in the order of MB.
  7. Users are only active part time.
A Crazy Solution

Stream the data sets into an in memory schema-less database on demand, support a rich brute force SQL like query language, and add the notion that a query could be incomplete if the streaming wasn't complete.

Cross ones fingers, and hope that as the data sets grow, so does the RAM and CPU speed of the machines processing the queries.

A Working Prototype

So its working, and I have been building the start of a test application on top of the stack: I call it a stack because I'm currently running three different servers (I will add a couple more) to host my simple application.
  1. Emulation layer that emulates the simplest feature sets of the current SQL cluster. Access to the real cluster is restricted because of security, and also my connection bandwidth to the cluster is restricted.
  2. Query Engine: reads the data from the cluster, and process SQL like queries over it, and returns a result set. Only SELECT is supported, but output control, WHERE filter, GROUP, ORDER, LIMIT and OFFSET are all supported. there is currently no caching of query sets, or compiled querie.
  3. Application server, provides abstract interface between application and query engine, will allow me to write a normal browser application, and a custom iPhone application.
Common Questions:

Why not index the existing data? 
Tests show the current application on the SQL cluster is already stretching its performance.

Why not cache the data in another SQL cluster, with more indexes?
It was found that the performance of most database systems don't have the write performance to efficiently stream the data on demand from the main SQL cluster. Database load times were in the order of minutes.


The prototype solution can move most data sets in seconds, while using existing databases was taking minutes. For an on demand system it was felt important to have the user data ready as quickly as possible.
 
Why not use off the shelf database running in memory?
The data currently lives in one table, but as the applications expand the schema for each object in the table is going to change, so going with a normal table approach would lead to an ever expanding main table, or make writing queries much harder.

The Query Engine is a test for building a very large scale data assimilation platform, with query capabilities. At its core there is an expression compiler that generates msil (for the clr runtime).

No comments:

Post a Comment