::

theory.pm

Whither Impala Fault Tolerance?

§

Justin Erickson on the Cloudera Blog

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

Impala is a pretty nice-looking SQLish query engine that runs on Hadoop. It provides the same basic interface as Hive, but circumvents MapReduce to access data directly from data nodes in parallel. But, in my opinion, to be useful as a real-time query engine, it needs fault tolerance. From the Impala FAQ, under the list of unsupported features:

Fault tolerance for running queries (not currently). In the current release, Impala aborts the query if any host on which the query is executing fails. In the future, we will consider adding fault tolerance to Impala, so that a running query would complete even in the presence of host failures.

Sounds like an unfortunate issue. Since it’s pretty typical for data nodes to go down, this seems like an essential feature. Products like CitusDB offer fault tolerance:

Does CitusDB recover from failures?

Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node’s work can evenly be distributed among the remaining nodes in the cluster.

That sound exactly right. I’m excited about Citus, and if it adds solid support for more data formats, such as ORC and Parquet, it may well be the way to go. But Impala will be a nice alternative if it can get fault tolerance figured out. I’m disappointed it’s not on the road map.

Lexical Subroutines

§

Ricardo Signes:

One of the big new experimental features in Perl 5.18.0 is lexical subroutines. In other words, you can write this:

my sub quickly { ... }
my @sorted = sort quickly @list;

my sub greppy (&@) { ... }
my @grepped = greppy { ... } @input;

These two examples show cases where lexical references to anonymous subroutines would not have worked. The first argument to sort must be a block or a subroutine name, which leads to awful code like this:

sort { $subref->($a, $b) } @list

With our greppy, above, we get to benefit from the parser-affecting behaviors of subroutine prototypes.

My favorite tidbit about this feature? Because lexical subs are lexical, and method-dispatch is package-based, lexical subs are not subject to method lookup and dispatch! This just might alleviate the confusion of methods and subs, as chromatic complained about just yesterday. Probably doesn’t solve the problem for imported subs, though.

Multirow Database Updates

§

William Blunn:

So, given a list of updates to apply we could effect them using the following steps:

  1. Use CREATE TEMPORARY TABLE to create a temporary table to hold the updates
  2. Use INSERT to populate the temporary table with the updates
  3. Use UPDATE … FROM to update the target table using updates in the temporary table
  4. Use DROP TABLE to drop the temporary table

So in the example above we can reduce five statements to four. This isn’t a significant improvement in this case. But now the number of statements is no longer directly dependent on the number of rows requiring updates.

Even if we wanted to update a thousand rows with different values, we could still do it with four statements.

Or you could just use one statement. Here’s how to do it with a CTE on PostgreSQL 9.2 and higher:

PostgreSQL Multirow Upate with CTE
1
2
3
4
5
6
7
8
9
10
WITH up(name, salary) AS ( VALUES
     ('Jane',  1200),
     ('Frank', 1100),
     ('Susan', 1175),
     ('John',  1150)
)
UPDATE staff
   SET salary = up.salary
  FROM up
 WHERE staff.name = up.name;

Still on PostgreSQL 9.1 or lower? Use a subselect in the FROM clause instead:

PostgreSQL Multirow Update with Subselect
1
2
3
4
5
6
7
8
9
UPDATE staff
   SET salary = up.salary
   FROM (VALUES
       ('Jane',  1200),
       ('Frank', 1100),
       ('Susan', 1175),
       ('John',  1150)
   ) AS up(name, salary)
 WHERE staff.name = up.name;

Stuck with MySQL or Oracle? Use a UNION query in a second table:

MySQL & Oracle Multirow Update
1
2
3
4
5
6
7
8
UPDATE staff, (
         SELECT 'Jane' AS name, 1200 AS salary
   UNION SELECT 'Frank',        1100
   UNION SELECT 'Susan',        1175
   UNION SELECT 'John',         1150
) AS up
   SET staff.salary = up.salary
 WHERE staff.name = up.name;

Using SQLite? Might make sense to use a temporary table for thousands or millions of rows. But for just a few, use a CASE expression:

SQLite Multirow Update
1
2
3
4
5
6
7
8
UPDATE staff
   SET salary = CASE name
       WHEN 'Jane'  THEN 1200
       WHEN 'Frank' THEN 1100
       WHEN 'Susan' THEN 1175
       WHEN 'John'  THEN 1150
   END
 WHERE name in ('Jane', 'Frank', 'Susan', 'John');

If you need to support multiple database architectures, sure, use something like DBIx::MultiRow to encapsulate things. But if, like most of us, you’re on one database for an app, I can’t recommend stongly enough how well it pays to get to know your database well.

Sqitch Templating

§

Last week saw the v.980 release of Sqitch, a database change management system. The headline feature in this version is support for MySQL 5.6.4 or higher. Why 5.6.4 rather than 5.1 or even 5.5? Mainly because 5.6.4 finally added support for fractional seconds in DATETIME columns (details in the release notes). This feature is essential for Sqitch, because changes often execute within a second of each other, and the deploy time is included in the log table’s primary key.

With the requirement for fractional seconds satisfied by 5.6.4, there was nothing to prevent usage of SIGNAL, added in 5.5, to mimic check constraints in a trigger. This brings the Sqitch MySQL implementation into line with what was already possible in the Postgres, SQLite, and Oracle support. Check out the tutorial and the accompanying Git repository to get started managing your MySQL databases with Sqitch.

The MySQL support might be the headliner, but the change in v0.980 I’m most excited about is improved template support. Sqitch executes templates to create the default deploy, revert, and verify scripts, but up to now they have not been easy to customize. With v0.980, you can create as many custom templates as you like, and use them as appropriate.

Understanding Window Functions

§

Dimitri Fontaine:

There was SQL before window functions and SQL after window functions: that’s how powerful this tool is. Being that of a deal breaker unfortunately means that it can be quite hard to grasp the feature. This article aims at making it crystal clear so that you can begin using it today and are able to reason about it and recognize cases where you want to be using window functions.

Great intro to a powerful feature.

Data Deployment with Sqitch

§

I’ve been thinking about data migrations. I love how well Sqitch works for schema changes, but so far have avoided data changes. Some data ought to be managed by the deployment process, rather than by end-user applications. Lists of countries, for example. Yet none of our Sqitch-managed databases include INSERTs, UPDATEs, or DELETEs in deploy scripts. Why not? Two reasons:

  1. These are mainly Postgres ports of existing Oracle databases. As such, I’ve written independent migration scripts that use oracle_fdw to copy data from Oracle. It made no sense to commit hard-coded changes to the deploy script, even for static data, as it was all to be copied from the old production Oracle databases — often months after I wrote the migrations.

  2. These projects include extensive pgTAP unit tests that expect to run many times against an empty database with no side effects. Having different data in testing than in production increases the likelihood of unforeseen behavioral variations. Better to expect no data in tests, freeing them to focus on units of behavior without regard to preexisting data.

Now that we have multiple Sqitch-deployed databases in production, the time has come to address these issues.

Deploy Hooks for External Sources

I propose to resolve the one-time migration requirement with deploy hooks. The idea is similar to Git hooks: Before or after any sqitch deploy, one or more hook scripts can run. The impetus was to ensure some higher level of consistency after every deploy. For example, a post-deploy hook might grant privileges on all tables in a database. Another might run VACCUM; ANALZYE;.

But we could also use it for one-time data migrations. An option to deploy will disable them, which would be useful for development and test databases. Once the migration has been run in production, we just delete the migration hook scripts from the project. Sqitch won’t record hook executions, so adding or removing them will be no problem.

I like this approach, as Sqitch will automatically run migration scripts, but hooks will not change the interface of Sqitch itself. And it’s more generally useful. Hell, I might want deploy hook script that sends an email notification announcing a deployment (though that might require adding support for non-SQL scripts). There are all kinds of things for which hooks will prove useful.

Changes for Static Data Maintenance

For data that must be tied to the deployment process, there are two complications that have prevetned me from simply managing them in normal Sqitch changes:

  1. There might be side-effects to the deployment. For example, a foreign key constraint to the users table, to identify the users who added rows to the database. But in a new database, perhaps there are no users — and Sqitch doesn’t create them, the app does. Chicken, meet egg.

  2. The initial data set might derived from some external source, such as another database. Consequently, none of that data was defined in Sqitch deploy scripts. This situation complicates future updates of the data. We can add data via Sqitch in the future, but then we don’t have the canonical list of all rows that should exist in all such databases.

However, I can think of no alternative that does not over-complicate Sqitch itself. I considered adding another change-related script type, named “update”, to complement the existing deploy, verify, and revert scripts. But oftentimes a data change would not be tied to a schema change, so the corresponding deploy script would be a no-op. Seems silly.

I also considered adding a completely separate command specifically for deploying data changes. Yet these data migrations are exactly like schema changes: Sqitch must execute them in the proper order relative to other changes, record successful or failed deployment, and be able to revert them when required. The only difference is what’s defined in them: data modification rather than definition.

Through several drafts of this post, I have come around to the idea that I should change nothing in Sqitch with regard to data deployments. A better solution than the above, I believe, is organizational.

Data Deploy Best Practice

Let the best practice for data deploys be this: they should be contained in Sqitch changes, but such changes should contain only data modifications. No change script should both define a table and insert its initial rows. Keep the table and its data in separate changes — keep DML separate from DDL.

For our list of countries, we might have a change named “countries”, which creates the countries table, and another, named “country_data”, which inserts the appropriate data into that table. Where necessary and appropriate, these changes may use conditional paths to bring the data up-to-date and in sync across deployments.

Conditions must deal with side-effects, such as foreign key constraints. Where possible, such side effects ought be removed from deployment-managed data. For tracking the user or users who added data to a database, for example, one can use the tools of the source code repository (git log, git blame) to assign blame. Other side-effects may be more necessary, but to the extent possible, deployed data should be independent.

Naturally, unit tests must expect static data to be present, and be updated when that data changes. We are, after all, talking about infrequently-updated data. Frequently-updated data should have separate interfaces provided by applications to change the data. Otherwise, how static is it, really?

TPF To Revamp Grants

§

Alberto Simões:

Nevertheless, this lack of “lower than $3000” grant proposals, and the fact that lot of people have been discussing (and complaining) about this value being too low, the Grants Committee is starting a discussion on rewriting and reorganizing the way it works. Namely, in my personal blog I opened a discussion about the Grants Committee some time ago, and had plenty of feedback, that will be helpful for our internal discussion.

This is great news. I would love to see more and more ambitious grant proposals, as well as awards people an subsist on. I look forward to seeing the new rules.

Mopping the Moose

§

Stevan Little:

I spent much of last week on vacation with the family so very little actual coding got done on the p5-mop, but instead I did a lot of thinking. My next major goal for the p5-mop is to port a module written in Moose, in particular, one that uses many different Moose features. The module I have chosen to port is Bread::Board and I chose it for two reasons; first, it was the first real module that I wrote using Moose and second, it makes heavy use of a lot of Moose’s features.

I’m so happy to see Stevan making progress on the Perl 5 MOP again.

A Perl Blog

§

I have been unsatisfied with Just a Theory for some time. I started that blog in 2004 more or less for fun, thinking it would be my permanent home on the internet. And it has been. But the design, while okay in 2004, is just awful by today’s standards. A redesign is something I have planned to do for quite some time.

I had also been thinking about my audience. Or rather, audiences. I’ve blogged about many things, but while a few dear family members might want to read everything I ever post, most folks, I think, are interested in only a subset of topics. Readers of Just a Theory came for posts about Perl, or PostgreSQL, or culture, travel, or politics. But few came for all those topics, in my estimation.

More recently, a whole bunch of top-level domains have opened up, often with the opportunity for anyone to register them. I was lucky enough to snag theory.pm and theory.pl, thinking that perhaps I would create a site just for blogging about Perl. I also nabbed theory.so, which I might dedicate to database-related blogging, and theory.me, which would be my personal blog (travel, photography, cultural essays, etc.).

And then there is Octopress. A blogging engine for hackers. Perfect for me. Hard to imagine something more appropriate (unless it was written in Perl). It seemed like a good opportunity to partition my online blogging.

So here we are with my first partition. theory.pm is a Perl blog. Seemed like the perfect name. I fiddled with it off and on for a few months, often following Matt Gemmell’s Advice, and I’m really happy with it. The open-source fonts Source Sans Pro and Source Code Pro, from Adobe, look great. The source code examples are beautifully marked up and displayed using the Solarized color scheme (though presentation varies in feed readers). Better still, it’s equally attractive and readable on computers, tablets and phones, thanks to the foundation laid by Aron Cedercrantz’s BlogTheme.

I expect to fork this code to create a database blog soon, and then perhaps put together a personal blog. Maybe the personal blog will provide link posts for posts on the other sites, so that if anyone really wants to read everything, they can. I haven’t decided yet.

In the meantime, now that I have a dedicated Perl blog, I guess I’ll have to start writing more Perl-related stuff. I’m starting with some posts about the state of exception handling in Perl 5, the first of which is already up. Stay tuned for more.

Trying Times

§

Exception handling is a bit of a pain in Perl. Traditionally, we use eval {}:

Perl eval
1
2
3
4
5
6
eval {
    foo();
}
if (my $err = $@) {
    # Inspect $err…
}

The use of the if block is a bit unfortunate; worse is the use of the global $@ variable, which has inflicted unwarranted pain on developers over the years1. Many Perl hackers put Try::Tiny to work to circumvent these shortcomings:

Try::Tiny
1
2
3
4
5
try {
    foo();
} catch {
    # Inspect $_…
};

Alas, Try::Tiny introduces its own idiosyncrasies, particularly its use of subroutine references rather than blocks. While a necessity of a pure-Perl implementation, it prevents returning from the calling context. One must work around this deficiency by checking return values:

Return from Try::TinyMore Gotchas
1
2
3
4
5
6
7
8
9
my $rv = try {
   f();
} catch {
   # …
};

if (!$rv) {
   return;
}

I can’t tell you how often this quirk burns me.