Semi structured data has become another pillar of many data related deliveries. It provides more flexibility compared to structured data. Such flexibility is very often needed by many workloads, especially those ones which are built around various APIs, IoT sensors or web apps. JSON has become de facto „standard“ for communication between APIs. Not being structured as traditional relational data brings another challenges how to process and work with such data structures. In following blog posts Snowflake goes through its capabilities related to processing semi structured data. Let’s cover the whole data lifecycle, starting with ingestion, through processing up to serving the data to consumers.
What is semi structured data?
Semi structured data is type of data which does not follow tabular structure of relational data. It can be thought of as a hybrid between structured and unstructured data. It has some level of organization, but it is more flexible and allows for greater flexibility in how the data is stored and used.
Elements of semi-structured data include:
1. Key-value pairs
This is the most basic form of semi-structured data and is essentially a simple data structure where each item is made up of a key and a corresponding value.
2. Hierarchical structures
Semi-structured data can be organized into nested elements, providing a more complex data structure.
3. Multi-valued fields
Semi-structured data can contain fields that have multiple values. You probably know them as arrays.
Let’s take a JSON document as an example of semi structured data because JSON format is easy to read for humans. There is no fixed format and the JSON could have whatever structure which suits needs of the use case. You can see sample JSON file on image below. JSON contains a person element with several key-value pairs and one array.
Snowflake support for semi structured data
How does Snowflake work with semi structured data? There is a native support for semi structured data in Snowflake. Other than that it means you can natively import data from and export data to following formats:
In terms of storing semi structured data in Snowflake, there are following native data types — ARRAY, OBJECT, VARIANT.
It is also good to mention that you can work with semi structured data in external tables without loading them into Snowflake. You can keep your JSONs or Parquet files in your external Data Lake and still use the data in Snowflake!
Loading semi structured data into Snowflake
In terms of loading semi structured data into Snowflake tables you have two options. Either store the whole file/document in single column or you can flatten the data and store individual values per column. It depends on your use case and how you want to work with the data later. You can load data in raw format as they are and store them in VARIANT column. Then you do flattening because it simplifies the future data processing. Just to be aware of limitations. Single VARIANT value is limited to 16MB compressed data. If you imagine JSON file with 16 MB of compressed data — it is pretty big file 🙂
Generate schema based on file structure
Previous approach works for JSONs, but semi structured data can be stored also in other formats than JSON. Many times you can see data are being stored in Apache Parquet format. Especially in Data Lakes. Unlike JSON, which is a text-based format, Parquet is a binary format that is optimized for columnar storage. This means that data is stored in columns rather than rows, which makes it more efficient for querying and analyzing large datasets.
One of the main advantages of Parquet is its ability to handle very large datasets efficiently. Since data is stored in columns, it can be compressed and stored in a more compact form, reducing the amount of storage required and improving the performance of read operations.
Another difference between JSON and Parquet is the way they handle data types. JSON is a loosely typed format, meaning that it does not enforce strict data types for each field. Parquet, on the other hand, has a well-defined schema and supports a variety of data types including integers, floating-point numbers, strings, and timestamps. This makes it easier to work with data in a consistent manner and ensures that data is stored and processed accurately.
Such differences requires different approaches when it comes to data ingestion. As Parquet has well defined schema, you need to define it during the data import. If you’ve ever tried to load parquet data into Snowflake you might have faced to time-consuming process of defining the file schema. You also need to create a target table with exact same schema. If the file has tens or hundreds of columns you might spend an hours with this task.
Just to recall for those of you who do not know how to load the parquet data into Snowflake. Please check out the following code. You can see that as part of the copy command you have to have the attributes from the file together with their data type.
$1 in the SELECT query refers to the single column where all the Paraquet data is stored.
In order to load the data into Snowflake you need to first create a target table with required schema and then write the COPY statement listing all the columns.
Snowflake has automated this. Now you can use couple of built-in functions to help you with reading the schema from the file and preparing the COPY command. There is also function to automate the table creation. Snowflake employees can confirm that this could save you hours of work because they have been working with very large parquet files (100+ columns) where they had to do everything manually. Let’s have a look how to use those functions and automate the whole process.
INFER_SCHEMA function automatically detect the file metadata schema in staged files that contains semi-structured data and returns the column definitions. Currently this functions works for Apache Parquet, Apache Avro and ORC files.
It is a table function so it needs to be wrapped by the
The output then looks like this:
Please be aware that by default the column names detected from stage files are treated as case sensitive. It can be handled by
IGNORE_CASE => TRUE parameter.
Now we know the structure of the staged files but how to move on and create table based on that schema? We have another function for that.
This function accept as an input the output of the
INFER_SCHEMA function and generates a list of columns. We can use the output of this function for
CREATE TABLE statement. Function has two parameters where the second one is for defining what kind formatting we want to have on the output side. If we are going to create table, view or external table. Here is combined example with
And provided output:
Now you can go & take the generated column list in your
CREATE TABLE statement. But that is not all. If you want to automate it even further and create also the table automatically based on detected schema then you can do it thanks to
USING TEMPLATE keyword in
CREATE TABLE statement. This keyword accepts as input the output from the
Pretty cool, isn’t it? By combining several functions we are able to automate the initial phase of semi structured data ingestion. If you want to go further , you can even encapsulate this logic into some Python stored procedure which would accept parameters like file format, stage and target table name. Then you will have pretty nice solution for analyzing staged files and providing their schemas automatically.