<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Experimental Thoughts</title>
	<atom:link href="http://thoughts.j-davis.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://thoughts.j-davis.com</link>
	<description>Ideas on Databases, Logic, and Language by Jeff Davis</description>
	<lastBuildDate>Fri, 07 Oct 2011 03:05:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>SQL: the successful cousin of Haskell</title>
		<link>http://thoughts.j-davis.com/2011/09/25/sql-the-successful-cousin-of-haskell/</link>
		<comments>http://thoughts.j-davis.com/2011/09/25/sql-the-successful-cousin-of-haskell/#comments</comments>
		<pubDate>Sun, 25 Sep 2011 07:10:29 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=472</guid>
		<description><![CDATA[Haskell is a very interesting language, and shows up on sites like http://programming.reddit.com frequently. It&#8217;s somewhat mind-bending, but very powerful and has some great theoretical advantages over other languages. I have been learning it on and off for some time, never really getting comfortable with it but being inspired by it nonetheless. But discussion on [...]]]></description>
			<content:encoded><![CDATA[<p>Haskell is a very interesting language, and shows up on sites like <a href="http://programming.reddit.com">http://programming.reddit.com</a> frequently. It&#8217;s somewhat mind-bending, but very powerful and has some great theoretical advantages over other languages. I have been learning it on and off for some time, never really getting comfortable with it but being inspired by it nonetheless.</p>
<p>But discussion on sites like reddit usually falls a little flat when someone asks a question like:</p>
<blockquote><p>If haskell has all these wonderful advantages, what amazing applications have been written with it?</p></blockquote>
<p>The responses to that question usually aren&#8217;t very convincing, quite honestly.</p>
<p>But what if I told you there was a wildly successful language, in some ways the <em>most</em> successful language ever, and it could be characterized by:</p>
<ul>
<li>lazy evaluation</li>
<li>declarative</li>
<li>type inference</li>
<li>immutable state</li>
<li>tightly controlled side effects</li>
<li>strict static typing</li>
</ul>
<p>Surely that would be interesting to a Haskell programmer? Of course, I&#8217;m talking about SQL.</p>
<p><span id="more-472"></span>Now, it&#8217;s all falling into place. All of those theoretical advantages become practical when you&#8217;re talking about managing a lot of data over a long period of time, and trying to avoid making any mistakes along the way. Really, that&#8217;s what relational database systems are all about.</p>
<p>I speculate that SQL is <em>so</em> successful and pervasive that it stole the limelight from languages like haskell, because the tough problems that haskell would solve are <em>already solved</em> in so many cases. Application developers can hack up a SQL query and run it over 100M records in 7 tables, glance at the result, and turn it over to someone else with near certainty that it&#8217;s the right answer! Sure, if you have a poorly-designed schema and have all kinds of special cases, then the query might be wrong too. But if you have a mostly-sane schema and mostly know what you&#8217;re doing, you hardly even need to check the results before using the answer.</p>
<p>In other words, if the query compiles, and the result looks anything like what you were expecting (e.g. the right basic structure), then it&#8217;s probably correct. Sound familiar? That&#8217;s exactly what people say about haskell.</p>
<p>It would be great if haskell folks would get more involved in the database community. It looks like a lot of useful knowledge could be shared. Haskell folks would be in a better position to find out how to apply theory where it has already proven to be successful, and could work backward to find other good applications of that theory.</p>
<p>Competing directly in the web application space against languages like ruby and javascript is going to be an uphill battle even if haskell is better in that space. I&#8217;ve worked with some very good ruby developers, and I honestly couldn&#8217;t begin to tell them where haskell might be a practical advantage for web application development. Again, I don&#8217;t know much about haskell aside from the very basics. But if someone like me who is interested in haskell and made some attempt to understand it and read about it still cannot articulate a practical advantage, clearly there is some kind of a problem (either messaging or technical). And that&#8217;s a huge space for application development, so that&#8217;s a serious concern.</p>
<p>However, the data management space is also huge &#8212; a large fraction of those applications exist primarily to collect data or present data. So, if haskell folks could work with the database community to advance data management, I believe that would inspire a lot of interesting development.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2011/09/25/sql-the-successful-cousin-of-haskell/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Database for a Zoo: the problem and the solution</title>
		<link>http://thoughts.j-davis.com/2011/09/21/database-for-a-zoo-the-problem-and-the-solution/</link>
		<comments>http://thoughts.j-davis.com/2011/09/21/database-for-a-zoo-the-problem-and-the-solution/#comments</comments>
		<pubDate>Wed, 21 Sep 2011 07:00:52 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[Logic]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=315</guid>
		<description><![CDATA[Let&#8217;s say you&#8217;re operating a zoo, and you have this simple constraint: You can put many animals of the same type into a single cage; or distribute them among many cages; but you cannot mix animals of different types within a single cage. This rule prevents, for example, assigning a zebra to live in the [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s say you&#8217;re operating a zoo, and you have this simple constraint:</p>
<blockquote><p>You can put many animals of the same type into a single cage; or distribute them among many cages; but you cannot mix animals of different types within a single cage.</p></blockquote>
<p>This rule prevents, for example, assigning a zebra to live in the same cage as a lion. Simple, right?</p>
<p>How do you enforce it? Any ideas yet? Keep reading: I will present a solution that uses a generalization of the standard UNIQUE constraint.</p>
<p>(Don&#8217;t dismiss the problem too quickly. As with most simple-sounding problems, it&#8217;s a fairly general problem with many applications.)</p>
<p><span id="more-315"></span>First of all, let me say that, in one sense, it&#8217;s easy to solve: see if there are any animals already assigned to the cage, and if so, make sure they are the same type. That has two problems:</p>
<ol>
<li>You have to remember to do that each time. It&#8217;s extra code to maintain, possibly an extra round-trip, slightly annoying, and won&#8217;t work unless <em>all</em> access to the database goes through that code path.</li>
<li>More subtly, the pattern <em>read, decide what to write, write</em> is prone to race conditions when another process writes after you read and before you write. Without excessive locking, solving this is hard to get right &#8212; and likely to pass tests during development before failing in production.<em></em></li>
</ol>
<p><em>[ Aside: if you use <a href="http://www.postgresql.org/docs/current/static/transaction-iso.html#XACT-SERIALIZABLE">true serializability</a> in PostgreSQL 9.1, that completely solves problem #2, but problem #1 remains. ]</em></p>
<p>Those are exactly the kinds of problems that a DBMS is meant to solve. But what to do? Unique indexes don&#8217;t seem to solve the problem very directly, and neither do foreign keys. I believe that they can be combined to solve the problem by using two unique indexes, a foreign key, and an extra table, but that sounds painful (perhaps someone else has a simpler way to accomplish this with SQL standard features?). Row locking and triggers might be an alternative, but also not a very clean solution.</p>
<p>A better solution exists in PostgreSQL 9.1 using <a href="http://www.postgresql.org/docs/current/static/sql-createtable.html#SQL-CREATETABLE-EXCLUDE">Exclusion Constraints</a> (Exclusion Constraints were introduced in 9.0, but this solution requires the slightly-more-powerful version in 9.1). If you have never seen an Exclusion Constraint before, I suggest reading <a href="http://thoughts.j-davis.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/">a previous post of mine</a>.</p>
<p>Exclusion Constraints have the following semantics (copied from documentation link above):</p>
<blockquote><p>The <tt>EXCLUDE</tt> clause defines an exclusion constraint, which guarantees that if any two rows are compared on the specified column(s) or expression(s) using the specified operator(s), not all of these comparisons will return <tt>TRUE</tt>. If all of the specified operators test for equality, this is equivalent to a <tt>UNIQUE</tt> constraint&#8230;</p></blockquote>
<p>First, as a prerequisite, we need to install <code>btree_gist</code> into our database (make sure you have the contrib package itself installed first):</p>
<pre>CREATE EXTENSION btree_gist;</pre>
<p>Now, we can use an exclude constraint like so:</p>
<pre>CREATE TABLE zoo
(
  animal_name TEXT,
  animal_type TEXT,
  cage        INTEGER,
  UNIQUE      (animal_name),
  EXCLUDE USING gist (animal_type WITH &lt;&gt;, cage WITH =)
);</pre>
<p>Working from the definition above, what does this exclusion constraint mean? If any two tuples in the relation are ever compared (let&#8217;s call these TupleA and TupleB), then the following will <em><strong>never</strong></em> evaluate to TRUE:</p>
<pre>TupleA.animal_type &lt;&gt; TupleB.animal_type AND
TupleA.cage        =  TupleB.cage</pre>
<p><em>[ Observe how this would be equivalent to a UNIQUE constraint if both operators were "=". The trick is that we can use a different operator -- in this case, "&lt;&gt;" (not equals). ]</em></p>
<p>Results:<span class="Apple-style-span" style="font-family: Consolas, Monaco, monospace; font-size: 12px; line-height: 18px; white-space: pre;"> </span></p>
<pre>=&gt; insert into zoo values('Zap', 'zebra', 1);
INSERT 0 1
=&gt; insert into zoo values('Larry', 'lion', 2);
INSERT 0 1
=&gt; insert into zoo values('Zachary', 'zebra', 1);
INSERT 0 1
=&gt; insert into zoo values('Zeta', 'zebra', 2);
ERROR:  conflicting key value violates exclusion constraint "zoo_animal_type_cage_excl"
DETAIL:  Key (animal_type, cage)=(zebra, 2) conflicts with existing key (animal_type, cage)=(lion, 2).
=&gt; insert into zoo values('Zeta', 'zebra', 3);
INSERT 0 1
=&gt; insert into zoo values('Lenny', 'lion', 2);
INSERT 0 1
=&gt; insert into zoo values('Lance', 'lion', 1);
ERROR:  conflicting key value violates exclusion constraint "zoo_animal_type_cage_excl"
DETAIL:  Key (animal_type, cage)=(lion, 1) conflicts with existing key (animal_type, cage)=(zebra, 1).
=&gt; select * from zoo order by cage;</pre>
<pre> animal_name | animal_type | cage
-------------+-------------+------
 Zap         | zebra       |    1
 Zachary     | zebra       |    1
 Larry       | lion        |    2
 Lenny       | lion        |    2
 Zeta        | zebra       |    3
(5 rows)</pre>
<pre><span class="Apple-style-span" style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px; white-space: normal;">And that is precisely the constraint that we need to enforce!</span></pre>
<ol>
<li>The constraint is <em>declarative</em> so you don&#8217;t have to deal with different access paths to the database or different versions of the code. Merely the fact that the constraint exists means that PostgreSQL will <em>guarantee it.</em></li>
<li>The constraint is also immune from race conditions &#8212; as are all EXCLUDE constraints &#8212; because again, PostgreSQL <em>guarantees it.</em></li>
</ol>
<p><em></em>Those are nice properties to have, and if used properly, will simplify the overall application complexity and improve robustness.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2011/09/21/database-for-a-zoo-the-problem-and-the-solution/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Building SQL Strings Dynamically, in 2011</title>
		<link>http://thoughts.j-davis.com/2011/07/09/building-sql-strings-dynamically-in-2011/</link>
		<comments>http://thoughts.j-davis.com/2011/07/09/building-sql-strings-dynamically-in-2011/#comments</comments>
		<pubDate>Sat, 09 Jul 2011 16:57:50 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[NULL]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=403</guid>
		<description><![CDATA[I saw a recent post Avoid Smart Logic for Conditional WHERE Clauses which actually recommended, &#8220;the best solution is to build the SQL statement dynamically—only with the required filters and bind parameters&#8221;. Ordinarily I appreciate that author&#8217;s posts, but this time I think that he let confusion run amok, as can be seen in a thread on [...]]]></description>
			<content:encoded><![CDATA[<p>I saw a recent post <em><a href="http://use-the-index-luke.com/sql/where-clause/obfuscation/smart-logic">Avoid Smart Logic for Conditional WHERE Clauses</a></em> which actually recommended, &#8220;the best solution is to build the SQL statement dynamically—only with the required filters and bind parameters&#8221;. Ordinarily I appreciate that author&#8217;s posts, but this time I think that he let confusion run amok, as can be seen in a <a href="http://www.reddit.com/r/programming/comments/ij0px/the_smartest_way_to_make_sql_slow/">thread on reddit</a>.</p>
<p>To dispel that confusion: parameterized queries don&#8217;t have any plausible downsides; always use them in applications. Saved plans have trade-offs; use them sometimes, and only if you understand the trade-offs.</p>
<p>When query parameters are conflated with saved plans, it&#8217;s creates FUD about SQL systems because it mixes the fear around SQL injection with the mysticism around the SQL optimizer. Such confusion about the layers of a SQL system are a big part of the reason that some developers move to the deceptive simplicity of NoSQL systems (I say &#8220;deceptive&#8221; here because it often just moves an even greater complexity into the application &#8212; but that&#8217;s another topic).</p>
<p>The confusion started with this query from the original article:</p>
<p><span id="more-403"></span></p>
<pre>SELECT first_name, last_name, subsidiary_id, employee_id
FROM employees
WHERE ( subsidiary_id    = :sub_id OR :sub_id IS NULL )
  AND ( employee_id      = :emp_id OR :emp_id IS NULL )
  AND ( UPPER(last_name) = :name   OR :name   IS NULL )</pre>
<p>[ Aside: In PostgreSQL those parameters should be $1, $2, and $3; but that's not relevant to this discussion. ]</p>
<p>The idea is that one such query can be used for several types of searches. If you want to ignore one of those WHERE conditions, you just pass a NULL as one of the parameters, and it makes one side of the OR always TRUE, thus the condition might as well not be there. So, each condition can either be there and have one argument (restricting the results of the query), or be ignored by passing a NULL argument; thus effectively giving you 8 queries from one SQL string. By eliminating the need to use different SQL strings depending on which conditions you want to use, you reduce the opportunity for error.</p>
<p>The problem is that the article says this kind of query is a problem. The reasoning goes something like this:</p>
<ol>
<li>Using bind parameters forces the plan to be saved and reused for multiple queries.</li>
<li>When a plan is saved for multiple queries, the planner doesn&#8217;t have the actual argument values.</li>
<li>Because the planner doesn&#8217;t have the actual argument values, the &#8220;x IS NULL&#8221; conditions aren&#8217;t constant at plan time, and therefore the planner isn&#8217;t able to simplify the conditions (e.g., if one condition is always TRUE, just remove it).</li>
<li>Therefore it makes a bad plan.</li>
</ol>
<p>However, #1 is simply untrue, at least in PostgreSQL. PostgreSQL <em>can</em> save the plan, but you don&#8217;t have to. See the documentation for <a href="http://www.postgresql.org/docs/9.1/static/libpq-exec.html#LIBPQ-PQEXECPARAMS">PQexecParams</a>. Here&#8217;s an example in ruby using the &#8220;pg&#8221; gem (EDIT: Note: this does not use any magic query-building behind the scenes, it uses a protocol level feature in the PostgreSQL server to bind the arguments):</p>
<pre>require 'rubygems'
require 'pg'

conn = PGconn.connect("dbname=postgres")

conn.exec("CREATE TABLE foo(i int)")
conn.exec("INSERT INTO foo SELECT generate_series(1,10000)")
conn.exec("CREATE INDEX foo_idx ON foo (i)")
conn.exec("ANALYZE foo")

# Insert using parameters. Planner sees the real arguments, so it will
# make the same plan as if you inlined them into the SQL string. In
# this case, 3 is not NULL, so it is simplified to just "WHERE i = 3",
# and it will choose to use an index on "i" for a fast search.
res = conn.exec("explain SELECT * FROM foo WHERE i = $1 OR $1 IS NULL", [3])
res.each{ |r| puts r['QUERY PLAN'] }
puts

# Now, the argument is NULL, so the condition is always true, and
# removed completely. It will surely choose a sequential scan.
res = conn.exec("explain SELECT * FROM foo WHERE i = $1 OR $1 IS NULL", [nil])
res.each{ |r| puts r['QUERY PLAN'] }
puts

# Saves the plan. It doesn't know whether the argument is NULL or not
# yet (because the arguments aren't provided yet), so the plan might
# not be good.
conn.prepare("myplan", "SELECT * FROM foo WHERE i = $1 OR $1 IS NULL")

# We can execute this with:
res = conn.exec_prepared("myplan",[3])
puts res.to_a.length
res = conn.exec_prepared("myplan",[nil])
puts res.to_a.length

# But to see the plan, we have to use the SQL string form so that we
# can use EXPLAIN. This plan should use an index, but because we're
# using a saved plan, it doesn't know to use the index. Also notice
# that it wasn't able to simplify the conditions away like it did for
# the sequential scan without the saved plan.
res = conn.exec("explain execute myplan(3)")
res.each{ |r| puts r['QUERY PLAN'] }
puts

# ...and use the same plan again, even with different argument.
res = conn.exec("explain execute myplan(NULL)")
res.each{ |r| puts r['QUERY PLAN'] }
puts

conn.exec("DROP TABLE foo")</pre>
<p>See? If you know what you are doing, and want to save a plan, then save it. If not, do the simple thing, and PostgreSQL will have the information it needs to make a good plan.</p>
<p>My next article will be a simple introduction to database system architecture that will hopefully make SQL a little less mystical.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2011/07/09/building-sql-strings-dynamically-in-2011/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Why PostgreSQL Already Has Query Hints</title>
		<link>http://thoughts.j-davis.com/2011/02/05/why-postgresql-already-has-query-hints/</link>
		<comments>http://thoughts.j-davis.com/2011/02/05/why-postgresql-already-has-query-hints/#comments</comments>
		<pubDate>Sat, 05 Feb 2011 18:33:25 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Logic]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=388</guid>
		<description><![CDATA[This is a counterpoint to Josh&#8217;s recent post: Why PostgreSQL Doesn&#8217;t Have Query Hints. I don&#8217;t really disagree, except that I think that there are many different definitions of &#8220;hints&#8221; floating around, leading to a lot of confusion. I could subtitle this post &#8220;More Terminology Confusion&#8221; after my previous entry. So, let&#8217;s pick a reasonable [...]]]></description>
			<content:encoded><![CDATA[<p>This is a counterpoint to Josh&#8217;s recent post: <a href="http://it.toolbox.com/blogs/database-soup/why-postgresql-doesnt-have-query-hints-44121?rss=1">Why PostgreSQL Doesn&#8217;t Have Query Hints</a>. I don&#8217;t really disagree, except that I think that there are many different definitions of &#8220;hints&#8221; floating around, leading to a lot of confusion. I could subtitle this post &#8220;More Terminology Confusion&#8221; after my <a href="http://thoughts.j-davis.com/2007/12/11/terminology-confusion/">previous entry</a>.</p>
<p>So, let&#8217;s pick a reasonable definition: &#8220;hints are some mechanism to influence the SQL planner to choose a better plan&#8221;. Why did I choose that definition? Because it&#8217;s the actual use case. If a user encounters a bad plan, or an unstable plan, they need a way to get it to choose a better plan. There&#8217;s plenty of room to argue about the right way to do that and the wrong way, but almost every DBMS allows some form of hints. Including PostgreSQL.</p>
<p><span id="more-388"></span></p>
<p>Here are a few planner variables you can tweak (out of many):</p>
<ul>
<li><code>enable_seqscan</code></li>
<li><code>enable_mergejoin</code></li>
<li><code>enable_indexscan</code></li>
</ul>
<p>Not specific enough for you? Well, you can try <a href="http://www.sai.msu.su/~megera/wiki/plantuner">plantuner</a> to pick or forbid specific indexes.</p>
<p>Want to enforce join order? Try setting <code>from_collapse_limit</code>.</p>
<p>Want to get even more specific? You can set the selectivity of individual operators.</p>
<p>There is a philosophical difference between PostgreSQL&#8217;s approach and that of many other systems. In PostgreSQL, it is encouraged to specify costs and selectivities more than exact plans. There are good reasons for that, such as sheer number of possible plans for even moderately complex queries (as Josh points out). Additionally, specifying exact plans tends to lead you into exactly the type of trouble you are trying to avoid by specifying hints in the first place &#8212; after input cardinalities change, the previous plan may now be a very poor one.</p>
<p>PostgreSQL clearly has a set of mechanisms that could be called &#8220;hints&#8221;. It turns out that there are actually quite a lot of ways to control the plan in postgres; but they generally aren&#8217;t recommended except as a solution to a specific problem someone posts to the <a href="http://archives.postgresql.org/pgsql-performance/">performance list</a>. That is part of the postgresql culture: a bit like getting a prescription for a doctor, so that the doctor can see the whole picture, help you look for alternative solutions, and weigh the side effects of the treatment against the benefits. I&#8217;m exaggerating, of course &#8212; these tweaks are documented (well, most of them), and anyone can use them; you just won&#8217;t hear them shouted from the rooftops as recommendations.</p>
<p>Except in this post, I suppose, which you should use at your own risk.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2011/02/05/why-postgresql-already-has-query-hints/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Big Company Uses Product XYZ</title>
		<link>http://thoughts.j-davis.com/2010/11/11/big-company-uses-product-xyz/</link>
		<comments>http://thoughts.j-davis.com/2010/11/11/big-company-uses-product-xyz/#comments</comments>
		<pubDate>Thu, 11 Nov 2010 18:22:16 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=362</guid>
		<description><![CDATA[Joshua Drake&#8217;s recent article makes some interesting points, but there&#8217;s one thing in particular I find missing among many of these discussions. From the article: It appeared they felt we should be impressed that Facebook runs on MySQL not PostgreSQL. &#8230; The problem I have, is that Facebook data is worthless. All of the concentration [...]]]></description>
			<content:encoded><![CDATA[<p>Joshua Drake&#8217;s recent <a href="http://www.commandprompt.com/blogs/joshua_drake/2010/11/mysql_the_elephant_in_the_room_facebook_oh_and_me/">article</a> makes some interesting points, but there&#8217;s one thing in particular I find missing among many of these discussions. From the article:</p>
<blockquote><p>It appeared they felt we should be impressed that Facebook runs on MySQL not PostgreSQL. &#8230; The problem I have, is that Facebook data is worthless.</p></blockquote>
<p>All of the concentration is on the company, and whether their use case matters (of course it does, at least to them and their customers). But phrases like &#8220;runs on&#8221; and &#8220;uses&#8221; are used too loosely, in my opinion.</p>
<p>Even with celebrity endorsements &#8212; for example, a basketball player endorsing shoes &#8212; at least they use shoes in roughly the same manner as you might. The shoes might not help you play basketball in any appreciable way, but at least &#8220;use&#8221; means the same for both the basketball player and you.</p>
<p><span id="more-362"></span></p>
<p>However, do you think that running a query at <em>[insert big company here]</em> involves just using the &#8220;mysql&#8221; client, logging in, and running any ad-hoc query you want? I doubt it. I suspect that the data is always spread around in complex ways with complex caches, and there&#8217;s a lot of custom supporting code to get the right information from the right cache at the right time. For every new query, they can unleash a team of very good engineers to build the necessary caches, provision the necessary servers, distribute data to the right places, write the code to populate and read the caches appropriately, and integrate it into the general data-movement architecture.</p>
<p>If your environment looks like that, then a lot of the little problems go away. One might complain that Slony is hard to set up; but in an environment like the one above, it&#8217;s insignificant. If there&#8217;s some missing feature, you can write it. If something is bothering you, you can fix it. People do that all the time with PostgreSQL, and many of those things get released in the community version. For MySQL, they tend to build up as &#8220;patch sets&#8221; (or forks, some might call them). I suspect that PostgreSQL gets more contributions because it does everything possible to make the process of community contribution smooth &#8212; clean code, no copyright assignment requirement, well-defined &#8220;commit fests&#8221;, community review, and a diverse group of core members, committers, and contributors. PostgreSQL also has a rock-solid foundation, giving developers more confidence to build the features they need without destabilizing the product.</p>
<p>If your environment doesn&#8217;t look like that, and you just want to use the product directly, then <em>take advantage of that</em>. Use the product that makes your life easier, helps you catch errors before they become problems, and keeps your data safe. By the time you scale up, you will be using the DBMS in such a radically different way that it almost doesn&#8217;t matter what DBMS you started with.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2010/11/11/big-company-uses-product-xyz/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Exclusion Constraints are generalized SQL UNIQUE</title>
		<link>http://thoughts.j-davis.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/</link>
		<comments>http://thoughts.j-davis.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/#comments</comments>
		<pubDate>Sat, 25 Sep 2010 20:37:22 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Temporal]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=321</guid>
		<description><![CDATA[Say you are writing an online reservation system. The first requirement you&#8217;ll encounter is that no two reservations may overlap (i.e. no schedule conflicts). But how do you prevent that? It&#8217;s worth thinking about your solution carefully. My claim is that no existing SQL DBMS has a good solution to this problem before PostgreSQL 9.0, [...]]]></description>
			<content:encoded><![CDATA[<p>Say you are writing an online reservation system. The first requirement you&#8217;ll encounter is that no two reservations may overlap (i.e. no schedule conflicts). But how do you prevent that?</p>
<p>It&#8217;s worth thinking about your solution carefully. My claim is that no existing SQL DBMS has a good solution to this problem before <a title="PostgreSQL 9.0 released" href="http://www.postgresql.org/about/news.1235">PostgreSQL 9.0</a>, which has just been released. This new release includes a feature called <a title="Exclusion Constraints" href="http://www.postgresql.org/docs/9.0/static/ddl-constraints.html#DDL-CONSTRAINTS-EXCLUSION">Exclusion Constraints</a> (authored by me), which offers a good solution to a class of problems that includes the &#8220;schedule conflict&#8221; problem.</p>
<p>I previously wrote a two part series (<a title="Temporal Keys, Part 1" href="http://thoughts.j-davis.com/2009/11/01/temporal-keys-part-1/">Part 1</a> and <a title="Temporal Keys, Part 2" href="http://thoughts.j-davis.com/2009/11/08/temporal-keys-part-2/">Part 2</a>) on this topic. Chances are that you&#8217;ve run into a problem similar to this at one time or another, and these articles will show you the various solutions that people usually employ in the real world, and the serious problems and limitations of those approaches.</p>
<p>The rest of this article will be a brief introduction to Exclusion Constraints to get you started using a much better approach.</p>
<p><span id="more-321"></span></p>
<p>First, install PostgreSQL 9.0 (the installation instructions are outside the scope of this article), and launch psql.</p>
<p>Then, install two modules: &#8220;<a title="Temporal PostgreSQL" href="http://pgfoundry.org/projects/temporal">temporal</a>&#8221; (which provides the PERIOD data type and associated operators) and &#8220;<a title="Btree GiST" href="http://www.postgresql.org/docs/9.0/static/btree-gist.html">btree_gist</a>&#8221; (which provides btree functionality via GiST).</p>
<p>Before installing these modules, make sure that PostgreSQL 9.0 is installed and that the 9.0 <code>pg_config</code> is in your <code>PATH</code> environment variable. Also, <code>$SHAREDIR</code> meas the directory listed when you run <code>pg_config --sharedir</code>.</p>
<p>To install Temporal PostgreSQL:</p>
<ol>
<li><a title="Download Temporal PostgreSQL" href="http://pgfoundry.org/frs/?group_id=1000288&amp;release_id=1548">download the tarball</a></li>
<li>unpack the tarball, go into the directory, and type &#8220;<code>make install</code>&#8220;</li>
<li>In <code>psql</code>, type: <code>\i $SHAREDIR/contrib/period.sql</code></li>
</ol>
<p>To install BTree GiST (these directions assume you installed from source, some packages may help here, like Ubuntu&#8217;s &#8220;<code>postgresql-contrib</code>&#8221; package):</p>
<ol>
<li>Go to the postgresql source &#8220;<code>contrib</code>&#8221; directory, go to <code>btree_gist</code>, and type &#8220;<code>make install</code>&#8220;.</li>
<li>In <code>psql</code>, type: <code>\i $SHAREDIR/contrib/btree_gist.sql</code></li>
</ol>
<p>Now that you have those modules installed, let&#8217;s start off with some basic Exclusion Constraints:</p>
<pre>DROP TABLE IF EXISTS a;
CREATE TABLE a(i int);
ALTER TABLE a ADD EXCLUDE (i WITH =);</pre>
<p>That is identical to a UNIQUE constraint on <code>a.i</code>, except that it uses the Exclusion Constraints mechanism; it even uses a normal BTree to enforce it. The performance will be slightly worse because of some micro-optimizations for UNIQUE constraint, but only slightly, and the performance characteristics should be the same (it&#8217;s just as scalable). Most importantly, it behaves the same under high concurrency as a UNIQUE constraint, so you don&#8217;t have to worry about excessive locking. If one person inserts <code>5</code>, that will prevent other transactions from inserting 5 concurrently, but will not interfere with a transaction inserting <code>6</code>.</p>
<p>Let&#8217;s take apart the syntax a little. The normal BTree is the default, so that&#8217;s omitted. The <code>(i WITH =)</code> is the interesting part, of course. It means that one tuple <code>TUP1</code> conflicts with another tuple <code>TUP2</code> if <code>TUP1.i = TUP2.i</code>. No two tuples may exist in the table if they conflict. In other words, there are no two tuples <code>TUP1</code> and <code>TUP2</code> in the table, such that <code>TUP1.i = TUP2.i</code>. That&#8217;s the very definition of UNIQUE, so that shows the equivalence. NULLs are always permitted, just like with UNIQUE constraints.</p>
<p>Now, let&#8217;s see if they hold up for multi-column constraints:</p>
<pre>DROP TABLE IF EXISTS a;
CREATE TABLE a(i int, j int);
ALTER TABLE a ADD EXCLUDE (i WITH =, j WITH =);</pre>
<p>The conditions for a conflicting tuple are ANDed together, just like UNIQUE. So now, in order for two tuples to conflict, <code>TUP1.i = TUP2.i AND TUP1.j = TUP2.j</code>. This is strictly a more permissive constraint, because conflicts require both conditions to be met. Therefore, this is identical to a UNIQUE constraint on <code>(a.i, a.j)</code>.</p>
<p>What can we do that UNIQUE can&#8217;t? Well, for starters we can use something other than a normal BTree, such as Hash or GiST (for the moment, GIN is not supported, but that&#8217;s only because GIN doesn&#8217;t support the full index AM API; <code>amgettuple</code> in particular):</p>
<pre>DROP TABLE IF EXISTS a;
CREATE TABLE a(i int, j int);
ALTER TABLE a ADD EXCLUDE USING gist (i WITH =, j WITH =);
-- alternatively using hash, which doesn't support
-- multi-column indexes at all
ALTER TABLE a ADD EXCLUDE USING hash (i WITH =);</pre>
<p>So now we can do UNIQUE constraints using hash or gist. But that&#8217;s not a real benefit, because a normal btree is probably the most efficient way to support that, anyway (Hash may be in the future, but for the moment it doesn&#8217;t use WAL, which is a major disadvantage).</p>
<p>The difference really comes from the ability to change the operator to something other than &#8220;<code>=</code>&#8220;. It can be any operator that is:</p>
<ul>
<li>Commutative</li>
<li>Boolean</li>
<li>Searchable by the given index access method (e.g. btree, hash, gist).</li>
</ul>
<p>For BTree and Hash, the only operator that meets those criteria is &#8220;=&#8221;. But many data types (including <code>PERIOD</code>, <code>CIRCLE</code>, <code>BOX</code>, etc.) support lots of interesting operators that are searchable using GiST. For instance, &#8220;overlaps&#8221; <code>(&amp;&amp;)</code>.</p>
<p>Ok, now we are getting somewhere. It&#8217;s impossible to specify the constraint that no two tuples contain values that overlap with eachother using a UNIQUE constraint; but it is possible to specify such a constraint with an Exclusion Constraint! Let&#8217;s try it out.</p>
<pre>DROP TABLE IF EXISTS b;
CREATE TABLE b (p PERIOD);
ALTER TABLE b ADD EXCLUDE USING gist (p WITH &amp;&amp;);
INSERT INTO b VALUES('[2009-01-05, 2009-01-10)');
INSERT INTO b VALUES('[2009-01-07, 2009-01-12)'); -- causes ERROR</pre>
<p>Now, try out various combinations (including COMMITs and ABORTs), and try with concurrent sessions also trying to insert values. You&#8217;ll notice that potential conflicts cause transactions to wait on eachother (like with UNIQUE) but non-conflicting transactions proceed unhindered. A lot better than <code>LOCK TABLE</code>, to say the least.</p>
<p>To be useful in a real situation, let&#8217;s make sure that the semantics extend nicely to a more complete problem. In reality, you generally have several exclusive resources in play, such as people, rooms, and time. But out of those, &#8220;overlaps&#8221; really only makes sense for time (in most situations). So we need to mix these concepts a little.</p>
<pre>CREATE TABLE reservation(room TEXT, professor TEXT, during PERIOD);

-- enforce the constraint that the room is not double-booked
ALTER TABLE reservation
    ADD EXCLUDE USING gist
    (room WITH =, during WITH &amp;&amp;);

-- enforce the constraint that the professor is not double-booked
ALTER TABLE reservation
    ADD EXCLUDE USING gist
   (professor WITH =, during WITH &amp;&amp;);</pre>
<p>Notice that we actually need to enforce two constraints, which is expected because there are two time-exclusive resources: professors and rooms. Multiple constraints on a table are ORed together, in the sense that an ERROR occurs if any constraint is violated. For the academic readers out there, this means that exclusion constraint conflicts are specified in <a href="http://en.wikipedia.org/wiki/Disjunctive_normal_form">disjunctive normal form</a> (consistent with UNIQUE constraints).</p>
<p>The semantics of Exclusion Constraints extend in a clean way to support this mix of atomic resources (rooms, people) and resource ranges (time). Try it out, again with a mix of concurrency, commits, aborts, conflicting and non-conflicting reservations.</p>
<p>Exclusion constraints allow solving this class of problems quickly (in a couple lines of SQL) in a way that&#8217;s safe, robust, generally useful across many applications in many situations, and with higher performance and better scalability than other solutions.</p>
<p>Additionally, Exclusion Constraints support all of the advanced features you&#8217;d expect from a system like Postgres9: deferrability, applying the constraint to only a subset of the table (allows a WHERE clause), or using functions/expressions in place of column references.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2010/09/25/exclusion-constraints-are-generalized-sql-unique/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Joining Aster Data</title>
		<link>http://thoughts.j-davis.com/2010/05/06/joining-aster-data/</link>
		<comments>http://thoughts.j-davis.com/2010/05/06/joining-aster-data/#comments</comments>
		<pubDate>Thu, 06 May 2010 17:53:05 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[PostgreSQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=292</guid>
		<description><![CDATA[On Monday, May 10th I will be joining Aster Data. I am very excited to work on some ambitious new projects there. Aster is a heavy user of PostgreSQL, so I will (of course) continue to actively participate in the community. Friday was my last day at Truviso. I enjoyed working there very much, and [...]]]></description>
			<content:encoded><![CDATA[<p>On Monday, May 10th I will be joining <a title="Aster Data" href="http://asterdata.com">Aster Data</a>. I am very excited to work on some ambitious new projects there. Aster is a heavy user of PostgreSQL, so I will (of course) continue to actively participate in the community.</p>
<p>Friday was my last day at <a title="Truviso" href="http://truviso.com">Truviso</a>. I enjoyed working there very much, and it was a rewarding experience. I wish all of my former colleagues well &#8212; I&#8217;m sure our paths will cross in the future.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2010/05/06/joining-aster-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Flexible Schemas and PostgreSQL</title>
		<link>http://thoughts.j-davis.com/2010/05/06/flexible-schemas-and-postgresql/</link>
		<comments>http://thoughts.j-davis.com/2010/05/06/flexible-schemas-and-postgresql/#comments</comments>
		<pubDate>Thu, 06 May 2010 17:42:28 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[NULL]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=267</guid>
		<description><![CDATA[First, what is a &#8220;flexible schema&#8221;? It&#8217;s hard to pin down an exact definition, but it&#8217;s used to mean a data model that permits changes in application data structures without needing to migrate old data or incur other administrative hassles. That&#8217;s a worthwhile goal. Applications often grow organically, especially in the early, exploratory stages of [...]]]></description>
			<content:encoded><![CDATA[<p>First, what is a &#8220;flexible schema&#8221;? It&#8217;s hard to pin down an exact definition, but it&#8217;s used to mean a data model that permits changes in application data structures without needing to migrate old data or incur other administrative hassles.</p>
<p>That&#8217;s a worthwhile goal. Applications often grow organically, especially in the early, exploratory stages of development. For example, you may decide to track when a user last did something on the website, so that you can adapt news and notices for those users (e.g. &#8220;Did you know that we added feature XYZ since you last visited?&#8221;). Developers have a need to produce a prototype quickly to work out the edge cases (do we update that timestamp for all actions, or only certain ones?), and probably a need to put it in production so that the users can benefit sooner.</p>
<p>A common worry is that <code>ALTER TABLE</code> will be a major performance problem. That&#8217;s sometimes a problem, but in PostgreSQL, you can add a column to a table in constant time (not dependent on the size of the table) in most situations. I don&#8217;t think this is a good reason to avoid <code>ALTER TABLE</code>, at least in PostgreSQL (other systems may impose a greater burden).</p>
<p>There are good reasons to avoid <code>ALTER TABLE</code>, however. We&#8217;ve only defined one use case for this new &#8220;last updated&#8221; field, and it&#8217;s a fairly loose definition. If we use <code>ALTER TABLE</code> as a first reaction for tracking any new application state, we&#8217;d end up with lots of columns with overlapping meanings (all subtly different), and it would be challenging to keep them consistent with each other. More importantly, adding new columns without thinking through the meaning and the data migration strategy will surely cause confusion and bugs. For example, if you see the following table:</p>
<pre>    CREATE TABLE users
    (
      name         TEXT,
      email        TEXT,
      ...,
      last_updated TIMESTAMPTZ
    );
</pre>
<p>you might (reasonably) assume that the following query makes sense:</p>
<pre>    SELECT * FROM users
      WHERE last_updated &lt; NOW() - '1 month'::INTERVAL;</pre>
<p>Can you spot the problem? Old user records (before the <code>ALTER TABLE</code>) will have <code>NULL</code> for <code>last_updated</code> timestamps, and will not satisfy the <code>WHERE</code> condition even though they intuitively qualify. There are two parts to the problem:</p>
<ol>
<li>The presence of the <code>last_updated</code> field fools the author of the SQL query into making assumptions about the data, because it seems so simple on the surface.</li>
<li>NULL semantics allow the query to be executed even without complete information, leading to a wrong result.</li>
</ol>
<p>Let&#8217;s try changing the table definition:</p>
<pre>    CREATE TABLE users
    (
      name       TEXT,
      email      TEXT,
      ...,
      properties HSTORE
    );
</pre>
<p><a title="HSTORE" href="http://www.postgresql.org/docs/8.4/static/hstore.html">HSTORE</a> is a set of key/value pairs. Some tuples might have the <code>last_updated</code> key in the properties attribute, and others may not. This accomplishes two things:</p>
<ol>
<li>There&#8217;s no need for ALTER TABLE or cluttering of the namespace with a lot of nullable columns.</li>
<li>The name &#8220;properties&#8221; is vague enough that query writers would (hopefully) be on their guard, understanding that not all records will share the same properties.</li>
</ol>
<p>You could still write the same (wrong) query against the second table with minor modification. Nothing has fundamentally changed. But we are using a different development strategy that&#8217;s easy on application developers during rapid development cycles, yet does not leave a series of pitfalls for users of the data. When a certain property becomes universally recorded and has a concrete meaning, you can plan a real data migration to turn it into a relation attribute instead.</p>
<p>Now, we need some guiding principles about when to use a complex type to  represent complex information, and when to use separate columns in the  table. To maximize utility and minimize confusion, I believe the best  guiding principle is the meaning of the data you&#8217;re storing across <em>all </em>tuples. When defining the attributes of a relation, if you find  yourself using vague nouns such as &#8220;properties,&#8221; or resorting to complex  qualifications (lots of &#8220;if/then&#8221; branching in your definition), consider less constrained data types like <a title="HSTORE" href="http://www.postgresql.org/docs/8.4/static/hstore.html">HSTORE</a>.  Otherwise, it&#8217;s best to nail down the meaning in terms of appropriate  nouns, which will help keep the DBMS smart and queries simple (and correct). See <a title="Choosing Data Types" href="../2009/09/30/choosing-data-types/">Choosing  Data Types</a> and further guidance in reference [1].</p>
<p>I believe there are three reasons why application developers feel that relational schemas are &#8220;inflexible&#8221;:</p>
<ol>
<li>A reliance on NULL semantics to make things &#8220;magically work,&#8221; when in reality, it just makes queries succeed that should fail. See my previous posts: <a title="None, nil, Nothing, undef, NA, and SQL NULL" href="http://thoughts.j-davis.com/2008/08/13/none-nil-nothing-undef-na-and-sql-null/">None, nil, Nothing, undef, NA, and SQL NULL</a> and <a title="What is the deal with NULLs?" href="http://thoughts.j-davis.com/2009/08/02/what-is-the-deal-with-nulls/">What is the deal with NULLs?</a>.</li>
<li>The SQL database industry has avoided interesting types, like <a title="HSTORE" href="http://www.postgresql.org/docs/8.4/static/hstore.html">HSTORE</a>, for a long time. See my previous post: <a title="Choosing Data Types" href="http://thoughts.j-davis.com/2009/09/30/choosing-data-types/">Choosing Data Types</a>.</li>
<li>ORMs make a fundamental false equivalence between an object attribute and a table column. There is a relationship between the two, of course; but they are simply not the same thing. This is a direct consequence of &#8220;The First Great Blunder&#8221;[2].</li>
</ol>
<p><strong>EDIT: </strong>I found a more concise way to express my fundamental point &#8212; During the early stages of application development, we only vaguely understand our data. The most important rule of database design is that the database should represent reality, not what we wish reality was like. Therefore, a database should be able to express that vagueness, and later be made more precise when we understand our data better. None of this should be read to imply that constraints are less important or that we need not understand our data. These ideas mostly apply only at very early stages of development, and even then, prudent use of constraints often makes that development much faster.</p>
<p>[1] Date, C.J.; Darwen, Hugh (2007). <em>Databases, Types, and the Relational Model</em>. pp. 377-380 (Appendix B, &#8220;A Design Dilemma&#8221;).</p>
<p>[2] Date, C.J. (2000). <em>An Introduction To Database Systems</em>, p.  865.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2010/05/06/flexible-schemas-and-postgresql/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Temporal PostgreSQL Roadmap</title>
		<link>http://thoughts.j-davis.com/2010/03/09/temporal-postgresql-roadmap/</link>
		<comments>http://thoughts.j-davis.com/2010/03/09/temporal-postgresql-roadmap/#comments</comments>
		<pubDate>Wed, 10 Mar 2010 04:49:06 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[Logic]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Temporal]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=254</guid>
		<description><![CDATA[Why are temporal extensions in PostgreSQL important? Quite simply, managing time data is one of the most common requirements, and current general-purpose database systems don&#8217;t provide us with the basic tools to do it. Every general-purpose DBMS falls short both in terms of usability and performance when trying to manage temporal data. What is already [...]]]></description>
			<content:encoded><![CDATA[<p>Why are temporal extensions in PostgreSQL important? Quite simply, managing time data is one of the most common requirements, and current general-purpose database systems don&#8217;t provide us with the basic tools to do it. Every general-purpose DBMS falls short both in terms of usability and performance when trying to manage temporal data.</p>
<p>What is already done?</p>
<p><span id="more-254"></span></p>
<ul>
<li><a href="http://pgfoundry.org/projects/temporal">PERIOD data type</a>, which can represent anchored intervals of time; that is, a chunk of time with a definite beginning and a definite end (in contrast to a SQL INTERVAL, which is not anchored to any specific beginning or end time).
<ul>
<li>Critical for usability because it acts as a <em>set</em> of time, so you can easily test for containment and other operations without using awkward constructs like BETWEEN or lots of comparisons (and keeping track of inclusivity/exclusivity of boundary points).</li>
<li>Critical for performance because you can index the values for efficient &#8220;contains&#8221; and &#8220;overlaps&#8221; queries (among others).</li>
</ul>
</li>
</ul>
<ul>
<li>Temporal Keys (called Exclusion Constraints, and will be available in the next release of PostgreSQL, 9.0), which can enforce the constraint that no two periods of time (usually for a given resource, like a person) overlap. See the <a href="http://developer.postgresql.org/pgdocs/postgres/sql-createtable.html">documentation</a> (look for the word &#8220;EXCLUDE&#8221;), and see my previous articles (<a href="../2009/11/01/temporal-keys-part-1/">part 1</a> and <a href="../2009/11/08/temporal-keys-part-2/">part 2</a>) on the subject.
<ul>
<li>Critical for usability to avoid procedural, error-prone hacks to enforce the constraint with triggers or by splitting time into big chunks.</li>
<li>Critical for performance because it performs comparably to a UNIQUE index, unlike the other procedural hacks which are generally too slow to use for most real systems.</li>
</ul>
</li>
</ul>
<p>What needs to be done?</p>
<ul>
<li>Range Types &#8212; Aside from PERIOD, which is based on TIMESTAMPTZ, it would also be useful to have very similar types based on, for example, DATE. It doesn&#8217;t stop there, so the natural conclusion is to generalize PERIOD into &#8220;range types&#8221; which could be based on almost any subtype.</li>
<li>Range Keys, Foreign Range Keys &#8212; If Range Types are known to the Postgres engine, that means that we can have syntactic sugar for range keys (like temporal keys, except for any range type), etc., that would internally use Exclusion Constraints.</li>
<li>Range Join &#8212; If Range Types are known to the Postgres engine, there could be syntactic sugar for a &#8220;range join,&#8221; that is, a join based on &#8220;overlaps&#8221; rather than &#8220;equals&#8221;. More importantly, there could be a new join type, a Range Merge Join, that could perform this join efficiently (without a Range Merge Join, a range join would always be a nested loop join).</li>
<li>Simple table logs &#8212; The ability to easily create an effective &#8220;audit log&#8221; or similar trigger-based table log, that can record changes and be efficiently queried for historical state or state changes.</li>
</ul>
<p>I&#8217;ll be speaking on this subject (specifically, the new Exclusion Constraints feature) in the upcoming <a href="http://postgresqlconference.org">PostgreSQL Conference EAST 2010</a> (my <a href="http://postgresqlconference.org/2010/east/talks/not_just_unique_exclusion_constraints">talk description</a>) in Philadelphia later this month and <a href="http://pgcon.org">PGCon 2010</a> (my <a href="http://www.pgcon.org/2010/schedule/events/201.en.html">talk description</a>) in Ottawa this May. In the past, these conferences and others have been a great place to get ideas and help me move the temporal features forward.</p>
<p>The existing features have been picking up a little steam lately. The <a href="http://lists.pgfoundry.org/pipermail/temporal-general/">temporal-general mailing list</a> has some traffic now &#8212; fairly low, but enough that others contribute to the discussions, which is a great start. I&#8217;ve also received some great feedback from a number of people, including the folks at <a href="http://pgexperts.com">PGX</a>. There&#8217;s still a ways to go before we have all the features we want, but progress is being made.</p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2010/03/09/temporal-postgresql-roadmap/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Scalability and the Relational Model</title>
		<link>http://thoughts.j-davis.com/2010/03/07/scalability-and-the-relational-model/</link>
		<comments>http://thoughts.j-davis.com/2010/03/07/scalability-and-the-relational-model/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 21:37:24 +0000</pubDate>
		<dc:creator>Jeff Davis</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Language]]></category>
		<category><![CDATA[Logic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://thoughts.j-davis.com/?p=242</guid>
		<description><![CDATA[The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations &#8212; but it&#8217;s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper. So, right away, it&#8217;s meaningless to talk about [...]]]></description>
			<content:encoded><![CDATA[<p>The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations &#8212; but it&#8217;s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.</p>
<p>So, right away, it&#8217;s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn&#8217;t solve that scalability problem &#8212; it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.</p>
<p><span id="more-242"></span>There may exist scalability problems in:</p>
<ul>
<li>SQL, which defines requirements outside the scope of the relational model, such as ACID properties and transactional semantics.</li>
<li>Traditional architectures and implementations of SQL, such as the &#8220;table is a file&#8221; equivalence, lack of sophisticated types, etc.</li>
<li>Particular implementations of SQL &#8212; e.g. &#8220;MySQL can&#8217;t do it, so the relational model doesn&#8217;t scale&#8221;.</li>
</ul>
<p>Why are these distinctions important? As with many debates, <a title="Terminology Confusion" href="http://thoughts.j-davis.com/2007/12/11/terminology-confusion/">terminology confusion</a> is at the core, and prevents us from dealing with the problems directly. If SQL is defined in a way that causes scalability problems, we need to identify precisely those requirements that cause a problem, so that we can proceed forward without losing all that has been gained. If the traditional architectures are not suitable for some important use-cases, they need to be adapted. If some particular implementations are not suitable, developers need to switch or demand that it be made competitive.</p>
<p>The NoSQL movement (or at least the hype surrounding it) is far too disorganized to make real progress. Usually, incremental progress is best; and sometimes a fresh start is best, after drawing on years of lessons learned. But it&#8217;s never a good idea to start over with complete disregard for the past. For instance, an <a href="http://about.digg.com/blog/looking-future-cassandra">article from Digg</a> starts off great:</p>
<blockquote><p>The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes.</p></blockquote>
<p>That&#8217;s good because he blames it on the <em>mindset</em> not the <em>model</em>,<em> </em>and then identifies a specific problem. But then the article completely falls flat:</p>
<blockquote><p>Computing the intersection with a JOIN is much too slow in MySQL, so we have to do it in PHP.</p></blockquote>
<p>A join is faster in PHP than MySQL? Why bother even discussing SQL versus NoSQL if your particular implementation of SQL &#8212; MySQL &#8212; can&#8217;t even do a hash join, the exact operation that you need? Particularly when almost every other implementation can (including PostgreSQL)? That kind of reasoning won&#8217;t lead to solutions.</p>
<p>So, where do we go from here?</p>
<ol>
<li>Separate the SQL model from the other requirements (some of which may limit scalability) when discussing improvements.</li>
<li>Improve the SQL model (my readers know that I&#8217;ve criticized SQL&#8217;s logical problems many times in the past).</li>
<li>Improve the implementations of SQL, particularly how tables are physically stored.</li>
<li>If you&#8217;re particularly ambitious, come up with a relational alternative to SQL that takes into account what&#8217;s been learned after decades of SQL and can become the next general-purpose DBMS language.</li>
</ol>
<p>EDIT 2010-03-09: I should have cited Josh Berkus&#8217;s talk on <a href="http://www.pgexperts.com/document.html?id=40">Relational vs. Non-Relational</a> (complete list of <a href="http://www.pgexperts.com/presentations.html">PGX talks</a>), which was part of the inspiration for this post.<a href="http://www.pgexperts.com/document.html?id=40"><br />
</a></p>
]]></content:encoded>
			<wfw:commentRss>http://thoughts.j-davis.com/2010/03/07/scalability-and-the-relational-model/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>

