Snowflake’s Elastic Data Warehouse as a Service: Favorite Features

Since I am currently working with Snowflake Elastic Datawarehouse (see www.snowflakecomputing.com), and excited about what experts regard as it’s game-changing innovations (for a summary, see http://www.snowflake.net/blog/top-10-cool-things-i-like-about-snowflake/  ), I want to take a moment and describe my own personal favorite features.  But first, some quick background:  Snowflake is a built-for-cloud, MPP column-oriented, **ANSI-SQL-COMPLIANT** data warehouse service with support for ACID transactions.

Regarding the above blog (…Top 10 Cool Things …Like About Snowflake, #6 is ‘JSON Support with SQL’, why is JSON support such a big deal?  

Here’s why:    JSON (JavaScript Object Notation) is a common message exchange format.  The term ‘message’ vs. (rigidly-structured) ‘data’ indicates that, unlike an attribute in a record in a classically normalized RDBMS table, the granular levels of data in the message may very well vary from one message (record) to the next and are not necessarily constrained to comply with a defined schema, and therefore, again from one message (record) to the next, may exhibit unexpected never-before-seen attributes, unexpected hierarchies, or modified levels of detail within a hierarchy, all of which is inconsistent with classic normalization patterns among RDBMS tables, in which an entities primary key defines each record’s exact granularity (level of detail) and all other fields must be completely, and only dependent, on that primary key.  Any field containing data of a finer granularity violates that venerable principle.  With Big Data’s unpredictability and tendency towards little or no schema (structural control) on write (load), this principle, of course, is intentionally sacrificed mostly in order to optimize loading performance.

With all of that, for a normalized RDBMS structure to store a JSON message, we typically do two things.  First we define which parts of the message are, in fact, compliant with a known structure.  In many cases, only the message header information, with elements such as [Sender, Recipient(s), “JSON”, SentTimeStamp, etc.]  These fields can all be transformed into separate, named fields for each record.  The remainder of the JSON message, however, to the extent that is uncontrolled, ends up getting flattened into a string (with code and data-values) and loaded into a one beast of a field aptly named ‘MessageBody’ with a data type something like VarChar (50,000), so that regardless of the message’s unpredictable body-content, it can be reliably loaded without having to break the load process or discarding any non-compliant records.  That accomplished, two challenges remain.  First, we need to be able to parse this beast of a field with a robust query syntax that can flexibly parse and drill down into the JSON message.  Secondly, we need an underlying performance architecture that will optimize these unusual queries, knowing that neither classic SQL string manipulations nor regular expressions benefit from any of a traditional RDBMS engine’s optimization capabilties, and thus can single-handedly cause such queries to take one or two magnitudes longer to complete than otherwise.

So, yes, it’s important.  Having said that, here is my personal favorite (meaning ‘most currently needed’) feature in Snowflake: extensive built in optimization for semi-structured data such as JSON, XML, or AVRO, using their custom ‘Variant’ data type.  Optimization, did I say?  Sure, any RDBMS can flatten out a JSON message, with its many and varied levels of nested objects and arrays, into a super-lengthy, generic, hard-to-use string, leaving analysts to try parsing it with regular expressions or a few canned functions to make any sense of it, all of which is a far cry from optimization in light of the inherent unpredictable variability of semi-structured data such as a JSON message, with it’s often indeterminate number of attributes and levels of detail in the message body.

Optimization, in my world, means two things, and Snowflake has both:  First, we need a library of SQL-syntax extensions that make parsing the JSON message easy for a typical SQL analyst, once it is, in all it’s unpredictable variability, stored into a single field.  Second, we need a way to optimize queries against such complex field, without which we would otherwise have performance-killing string manipulations within our large datasets.

Happily, Snowflake’s Elastic DW does both of these things!

JSON Query Optimization:  When a JSON document lands in a Snowflake field with it’s Variant data type, the documents data structure including, of course, all sub-elements such as objects and arrays, is automatically parsed, compressed.  Specifically, sub-elements are, under the covers, actually stored as separate columns, thus benefitting from the performance improvement typical of columnar compression, but in this case, without even having to parse it all out into separate RDBMS columns.  In situations, wherein either the JSON message content is uncontrolled, and thus potentially has ragged embedded levels of elements, or just when we don’t yet know how it should be parsed into an RDBMS schema, this is an enormous architectural advantage.  Of course, to be truly advantageous, it will also need to be easily parsed in SQL queries, right?  And with that comes my next favorite feature…

SQL Syntax Extensions for JSON:  The following easy to learn syntax will have the typical SQL analysts productive in short order.  Do you need to learn JSON first?  Yes, but just the basics, and their simplicity is one reason that JSON is popular.   To do so, I recently read, and recommend, Introduction to JavaScript Object Notation, by Lindsay Bassett.   Having said that, here are my favorite extensions.

_____________________________________________________

If the following SQL-extension code is not of interest to you, feel free to scroll down to my conclusion.

As a note, the following code is copied from Snowflake product documentation.

  1. PARSE_JSON: Notice how this simple code parses out the string into a readable output.

CREATE TABLE vartab (v VARIANT);

INSERT INTO vartab SELECT PARSE_JSON(column1) AS v FROM VALUES (‘123.12’), (‘[-1, 12, 17]’), (‘{“x”:”abc”, “y”:false}’);

SELECT * FROM vartab;

V

123.12

[-1,12,17]

{“x”:”abc”,”y”:false} |

 

  1. The Colon ( :  ) and the Dot (  .  ) help you dig into the JSON.  Notice how these items parse the various JSON sub-elements within the field…
    • column:pathelement1.pathelement2.pathelement3
    • For example, themytable table has a column named ITEM of type VARIANT:

DESC TABLE mytable.Item field

name   |  type   |  kind  | null? | default | primary key | unique key | check  | expression |

ITEM   | VARIANT | COLUMN | Y     | [NULL]  | N           | N          | [NULL] | [NULL]

  • So, suppose that the following JSON is stored in the first row:

SELECT * FROM mytable LIMIT 1; ——————————————————————–

{ “firstName”: “John”, “lastName”: “Smith”, “isAlive”: true, “age”: 25,  “height_cm”:167.64, “address”:

{         “streetAddress”: “21 2nd Street”, “city”: “New York”,  “state”: “NY”,  “postalCode”: “10021-3100”

},

“phoneNumbers”:

[   { “type”: “home”, “number”: “212 555-1234” },

{ “type”: “office”,  “number”: “646 555-4567” }

]

}

  • You can retrieve thelastName field using the following statement:

SELECT item:lastName FROM mytable————————+

mytable:lastName    “Smith”

  • A common need is to extract all of the values of a field that may appear multiple times in an object. Snowflake uses theFLATTEN function to convert a repeated field into a set of rows.

So, in the following table below, suppose you want to extract the number field from each element of the phoneNumbers array into a set of rows with one phone number per row.

SELECT * FROM mytable; ——————————————————————–+                                ITEM

{  “firstName”: “John”,   “lastName”: “Smith”,

“phoneNumbers”: [    { “type”: “home”, “number”: “212 555-1234” },

{ “type”: “office”,  “number”: “646 555-4567” }

]

You can extract the number field using the following statement:

SELECT result.value:number FROM mytable, TABLE(FLATTEN(mytable.item,‘phoneNumbers’))

–result

value:number

“212 555-1234”

“646 555-4567”

 

In conclusion, I’ve just summarized how Snowflake optimizes semi-structured data such as ragged JSON messages (meaning those with indeterminate granularity) into an ANSI-SQLcompliant SQL database and, specifically, that it automatically accelerates JSON-parsing in queries via sub-element compression while making those JSON-parsing queries simple and productive using an easy-to-learn set of SQL extensions as described above.

One thought on “Snowflake’s Elastic Data Warehouse as a Service: Favorite Features

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s