Archive for the ‘Language’ Category

Flexible Schemas and PostgreSQL

Thursday, May 6th, 2010

First, what is a “flexible schema”? It’s hard to pin down an exact definition, but it’s used to mean a data model that permits changes in application data structures without needing to migrate old data or incur other administrative hassles.

That’s a worthwhile goal. Applications often grow organically, especially in the early, exploratory stages of development. For example, you may decide to track when a user last did something on the website, so that you can adapt news and notices for those users (e.g. “Did you know that we added feature XYZ since you last visited?”). Developers have a need to produce a prototype quickly to work out the edge cases (do we update that timestamp for all actions, or only certain ones?), and probably a need to put it in production so that the users can benefit sooner.

A common worry is that ALTER TABLE will be a major performance problem. That’s sometimes a problem, but in PostgreSQL, you can add a column to a table in constant time (not dependent on the size of the table) in most situations. I don’t think this is a good reason to avoid ALTER TABLE, at least in PostgreSQL (other systems may impose a greater burden).

There are good reasons to avoid ALTER TABLE, however. We’ve only defined one use case for this new “last updated” field, and it’s a fairly loose definition. If we use ALTER TABLE as a first reaction for tracking any new application state, we’d end up with lots of columns with overlapping meanings (all subtly different), and it would be challenging to keep them consistent with each other. More importantly, adding new columns without thinking through the meaning and the data migration strategy will surely cause confusion and bugs. For example, if you see the following table:

    CREATE TABLE users
    (
      name         TEXT,
      email        TEXT,
      ...,
      last_updated TIMESTAMPTZ
    );

you might (reasonably) assume that the following query makes sense:

    SELECT * FROM users
      WHERE last_updated < NOW() - '1 month'::INTERVAL;

Can you spot the problem? Old user records (before the ALTER TABLE) will have NULL for last_updated timestamps, and will not satisfy the WHERE condition even though they intuitively qualify. There are two parts to the problem:

  1. The presence of the last_updated field fools the author of the SQL query into making assumptions about the data, because it seems so simple on the surface.
  2. NULL semantics allow the query to be executed even without complete information, leading to a wrong result.

Let’s try changing the table definition:

    CREATE TABLE users
    (
      name       TEXT,
      email      TEXT,
      ...,
      properties HSTORE
    );

HSTORE is a set of key/value pairs. Some tuples might have the last_updated key in the properties attribute, and others may not. This accomplishes two things:

  1. There’s no need for ALTER TABLE or cluttering of the namespace with a lot of nullable columns.
  2. The name “properties” is vague enough that query writers would (hopefully) be on their guard, understanding that not all records will share the same properties.

You could still write the same (wrong) query against the second table with minor modification. Nothing has fundamentally changed. But we are using a different development strategy that’s easy on application developers during rapid development cycles, yet does not leave a series of pitfalls for users of the data. When a certain property becomes universally recorded and has a concrete meaning, you can plan a real data migration to turn it into a relation attribute instead.

Now, we need some guiding principles about when to use a complex type to represent complex information, and when to use separate columns in the table. To maximize utility and minimize confusion, I believe the best guiding principle is the meaning of the data you’re storing across all tuples. When defining the attributes of a relation, if you find yourself using vague nouns such as “properties,” or resorting to complex qualifications (lots of “if/then” branching in your definition), consider less constrained data types like HSTORE. Otherwise, it’s best to nail down the meaning in terms of appropriate nouns, which will help keep the DBMS smart and queries simple (and correct). See Choosing Data Types and further guidance in reference [1].

I believe there are three reasons why application developers feel that relational schemas are “inflexible”:

  1. A reliance on NULL semantics to make things “magically work,” when in reality, it just makes queries succeed that should fail. See my previous posts: None, nil, Nothing, undef, NA, and SQL NULL and What is the deal with NULLs?.
  2. The SQL database industry has avoided interesting types, like HSTORE, for a long time. See my previous post: Choosing Data Types.
  3. ORMs make a fundamental false equivalence between an object attribute and a table column. There is a relationship between the two, of course; but they are simply not the same thing. This is a direct consequence of “The First Great Blunder”[2].

EDIT: I found a more concise way to express my fundamental point — During the early stages of application development, we only vaguely understand our data. The most important rule of database design is that the database should represent reality, not what we wish reality was like. Therefore, a database should be able to express that vagueness, and later be made more precise when we understand our data better. None of this should be read to imply that constraints are less important or that we need not understand our data. These ideas mostly apply only at very early stages of development, and even then, prudent use of constraints often makes that development much faster.

[1] Date, C.J.; Darwen, Hugh (2007). Databases, Types, and the Relational Model. pp. 377-380 (Appendix B, “A Design Dilemma”).

[2] Date, C.J. (2000). An Introduction To Database Systems, p. 865.

Temporal PostgreSQL Roadmap

Tuesday, March 9th, 2010

Why are temporal extensions in PostgreSQL important? Quite simply, managing time data is one of the most common requirements, and current general-purpose database systems don’t provide us with the basic tools to do it. Every general-purpose DBMS falls short both in terms of usability and performance when trying to manage temporal data.

What is already done?

(more…)

Scalability and the Relational Model

Sunday, March 7th, 2010

The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations — but it’s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.

So, right away, it’s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn’t solve that scalability problem — it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.

(more…)

Temporal Keys, Part 2

Sunday, November 8th, 2009

In the last article, I argued that:

  • A schedule conflict is a typical business problem.
  • The later you try to resolve a schedule conflict, the more costly it is to resolve.
  • In particular, there is a big jump in the cost the moment after conflicting data is recorded.
  • Therefore, it’s best for the DBMS itself to enforce the constraint, because only the DBMS can avoid the conflict effectively before the conflict is recorded.

Then, I opened up a discussion to see how people are dealing with these schedule conflicts. In the comments I received at the end of the article, as well as other anecdotes from conferences, user groups, mailing lists, and my own experience, the solutions fall into a few categories:

(more…)

PostgreSQL WEST and Temporal Databases

Monday, October 12th, 2009

I’ve been interested in temporal data and relational databases for quite some time. There are going to be at least two people talking about temporal data at PostgreSQL WEST in Seattle: Scott Bailey and me. See the talk descriptions.

In the past, I’ve worked on a temporal extension to PostgreSQL that implements the PERIOD data type. This is a data type that offers both a definite beginning and a definite end time, which is important for describing things that happen over a period of time, rather than instantaneously. Trying to use separate attributes for “start” and “end” is bad for a number of reasons, and will certainly be addressed in a subsequent blog entry. For now, I’ll just say that I believe the PERIOD data type is fundamentally important for handling all kinds of time data, which I believe is a common problem.

At WEST, I’ll be presenting my progress on temporal keys. Temporal keys are used to prevent overlapping periods of time — a schedule conflict — by using an index and following the same concurrent behavior as UNIQUE with minimal performance cost (one extra index search, to be precise).

Temporal keys cannot be expressed in PostgreSQL 8.4, unless you resort to triggers and a full table lock (ouch!). So, additional backend support is required. This is accomplished in my patch for operator exclusion constraints, which are a more general way of using arbitrary operators and index searches to enforce a constraint. I plan to do what’s required for the patch to be accepted in PostgreSQL 8.5.

Temporal modeling is a common problem. It seems like almost every PostgreSQL conference has had at least one talk on the matter, so we know there is some demand for improvement. If you’re interested, I hope you come to WEST and chat with Scott or I, and let’s see if we can come up with some real solutions.

What is the deal with NULLs?

Sunday, August 2nd, 2009

A recent thread on pgsql-hackers warrants some more extensive discussion. In the past, I’ve criticized NULL semantics, but in this post I’d just like to explain some corner cases that I think you’ll find interesting, and try to straighten out some myths and misconceptions.

First off, I’m strictly discussing SQL NULL here. SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.

(more…)

ruby-pg is now the official postgres ruby gem

Friday, December 14th, 2007
ruby-pg is now the official rubyforge project for the “postgres” ruby
gem. See the project here:

http://www.rubyforge.org/projects/ruby-pg

or install the gem directly:

# gem install –remote postgres

(more…)

On ORMs and "Impedence Mismatch"

Monday, December 3rd, 2007

The solution ORMs provide is to “map” an object class, which is a type (or domain), onto a table, which is a relation variable (a.k.a. relvar). This supposedly abstracts away an “impedance mismatch” between the two. ORMs are already off to a bad start, mapping a type to a variable, but I’ll continue.

The real impedance mismatch is a set of very fundamental differences between application data and data in a well-designed relational database.

(more…)