Latest on technology $$: August 2016

In this blog post, we’ll look at the MariaDB 10.2 CHECK and DEFAULT clauses.

MariaDB 10.2 includes some long-awaited features. In this blog, we are going to discuss the improvements to some table definitions: the DEFAULT clause and the CHECK constraints. These clauses describe columns default values and rules for data validation.

Note that MariaDB 10.2 is still in the alpha stage. This article describes the current state of these features, which could change before MariaDB 10.2 becomes GA.

The DEFAULT Clause

The DEFAULT clause has always been supported in MariaDB/MySQL, but traditionally it only accepted literal values (like “hello world” or “2”). MariaDB 10.2 removes this limitation, so DEFAULT can now accept most SQL expressions. For example:

fiscal_year SMALLINT DEFAULT (YEAR(NOW())).
valid_until DATE DEFAULT (NOW() + INTERVAL 1 YEAR).
owner VARCHAR(100) DEFAULT (USER()).

Additionally, MariaDB 10.2 allows you to set a DEFAULT value for the TEXT and BLOB columns. This was not possible in previous versions. While this might look like a small detail, it can be hard to add a column to an existing table that is used by production applications, if it cannot have a default value.

The DEFAULT clause has some limitations. For example, it cannot contain a subquery or a stored function. An apparently strange limitation is that we can mention another column in DEFAULT only if it comes first in the CREATE TABLE command.

Note that DEFAULT can make use of non-deterministic functions even if the binary log uses the STATEMENT format. In this case, default non-deterministic values will be logged in the ROW format.

CHECK Constraints

CHECK constraints are SQL expressions that are checked when a row is inserted or updated. If this expression result is false (0, empty string, empty date) or NULL, the statement will fail with an error. The error message states which CHECK failed in a way that is quite easy to parse:

ERROR 4022 (23000): CONSTRAINT `consistent_dates` failed for `test`.`author`

Some example of CHECK constraints:

CONSTRAINT non_empty_name CHECK (CHAR_LENGTH(name) > 0).
CONSTRAINT consistent_dates CHECK (birth_date IS NULL OR death_date IS NULL OR birth_date < death_date).
CONSTRAINT past_date CHECK (birth_date < NOW()).

A possible trick is checking that a column is different from its default value. This forces users to assign values explicitly.

CHECK constraints cannot be added or altered. It is only possible to drop them. This is an important limitation for production servers.

Another limitation is that CHECK metadata are not accessible via the INFORMATION_SCHEMA. The only way to find out if a table has CHECK clauses is parsing the output of SHOW CREATE TABLE.

The exact behavior of CHECK constraints in a replication environment depends on the master binary log format. If it is STATEMENT, the slaves will apply CHECK constraints to events received from the master. If it is ROW, only the master will need to apply constraints, because failed statements will not be replicated.

Thus, in all cases, we recommend having identical constraints on master and slaves, and only using deterministic constraints.

Performance

While I didn’t run a professional benchmark, I can say that both DEFAULT and CHECK clauses don’t have a noticeable impact on a simple test where we insert 1 million rows (on my local machine).

However, these clauses evaluate an SQL expression each time a row is inserted or updated. The overhead is at least equal to the SQL expression performance. If high-performing writes are important, you will probably not want to use complex data validation.

To check how fast an expression is, we can use the BENCHMARK() function:

MariaDB [(none)]> SELECT BENCHMARK(10000000, (555 / 100 * 20));
+---------------------------------------+
| BENCHMARK(10000000, (555 / 100 * 20)) |
+---------------------------------------+
| 0                                     |
+---------------------------------------+
1 row in set (1.36 sec)
MariaDB [(none)]> SELECT BENCHMARK(100000000, MD5('hello world'));
+------------------------------------------+
| BENCHMARK(100000000, MD5('hello world')) |
+------------------------------------------+
| 0                                        |
+------------------------------------------+
1 row in set (14.84 sec)

In this example, we executed the specified expressions 10 million times. BENCHMARK() always returns 0, but what we want to check is the execution time. We can see for example that evaluating MD5(‘hello world’) takes less than 0.000002 seconds. In some cases, we may want to retry the same expressions with different parameters (longer strings, higher numbers, etc.) to check if the execution times varies.

Unfortunately, we don’t have a status variable which tells us how many times MariaDB evaluated CHECK clauses. If our workload performs many writes, that variable could help us to find out if CHECK constraints are slowing down inserts. Maybe the MariaDB team can take this as a suggestion for the future.

Below are New tech I found during scrolling through the tech hunt.👀

H2O has released a new version of Sparkling Water 2.0. I found a few very cool articles on their blog. Spam Detection with ML Pipelines and H2O TensorFlow on AWS GPU!

Cool Spark Article on Clickbait Clustering with Spark (Github, Github)

Increment Fetch in Apache NiFi with QueryDatabaseTable

Awesome Article on Real Architectural Patterns for Microservices by Camille Fournier, Camille is one of the most brilliant people I have had the pleasure of speaking with. This is a must read.

Combining Agile and Spark, There's the interesting BDD-Spark library (Github).

Hortonworks has a number of interesting Demos, labs and training from their introduction to Hadoop workshop.

Cool Charting

Check out this article on Data Visualization with D3, DC, Leaflet, Python
(For more information on DC.JS, check it out.)

Spring Boot Applications in Ambari

How to Bundle a Spring Boot Application as an Ambari Service (Github)

Scala / SBT Tip

My SBT wasn't building until I upped the memory. Now this is in a shell script for all my builds:

export SBT_OPTS="-Xmx2G -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=2G -Xss2M  -Duser.timezone=GMT"
sbt -J-Xmx4G -J-Xms4G assembly

Here is an example Spark SQL with Stanford Core NLP SBT Build File (build.sbt):

name := "Sentiment"
version := "1.0"
scalaVersion := "2.10.6"
assemblyJarName in assembly := "sentiment.jar"
libraryDependencies  += "org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided"
libraryDependencies  += "org.apache.spark" % "spark-sql_2.10" % "1.6.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.6.0" % "provided"
libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.5.1"
libraryDependencies += "edu.stanford.nlp" % "stanford-corenlp" % "3.5.1" classifier "models"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

You will also need project/assembly.sbt

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")

General Hadoop CLI Tips

Keep an eye on logs! Check them, make sure they rotate and old ones are archived off to cold storage or deleted. Find the biggest files on your box (du -hsx * | sort -rh | head -10).
If you want to see things you have run before, check out:
/<user>/.beeline/history, /<user>/.hivehistory, /<user>/.sqlline/history, /<user>/.pig_history, /<user>/.spark_history.

You can also run history to check on general commands you have run (remember this will return the commands used by that previous user, which may be root or whatever you are currently logged in as.
What Java am I using and are there others available? alternatives --display java
Sometimes your PATH may not be fully set, so you can miss out on great Java CLI tools like jps

Latest on technology $$

Friday, August 19, 2016

MariaDB 10.2 CHECK and DEFAULT Clauses