Loading Data into Snowflake - Methods and Options

Snowflake, a cloud-based data warehousing platform, supports various methods for data loading, ensuring flexibility and efficiency for different data scenarios. This document provides an overview of the main options available to load data into Snowflake, including external and internal stages, bulk and continuous loading, simple transformations, and more.

Open Table of Contents

Supported File Locations
- External Stages
- Internal Stages
Bulk vs Continuous Loading
- Bulk Loading Using the COPY Command
- Continuous Loading Using Snowpipe
Data Pipelines for Complex Transformations
SQL Functions for Schema Detection
Alternatives to Loading Data
- External Tables (Data Lake)
  - Working With Amazon S3-compatible Storage
Conclusion

Supported File Locations

Snowflake refers to the location of data files in cloud storage as a “stage.” The COPY INTO <table> command is used for both bulk and continuous data loading, supporting cloud storage accounts managed by your business entity (external stages) and cloud storage contained in your Snowflake account (internal stages).

External Stages

Data can be loaded from the following cloud storage services:

Amazon S3
Google Cloud Storage
Microsoft Azure

Data in archival cloud storage classes requiring restoration (e.g., Amazon S3 Glacier Flexible Retrieval or Azure Archive Storage) cannot be accessed. External stages are created using the CREATE STAGE command and store the URL to files in cloud storage, access settings, and file format information.

Internal Stages

Snowflake maintains several types of internal stages within your account:

User Stage: Allocated to each user for storing files, designed for single-user management but can load data into multiple tables.

Table Stage: Available for each table in Snowflake, for files managed by one or more users and loaded into a single table.

Named Internal Stage: A database object created in a schema, capable of storing files managed by multiple users and loaded into multiple tables. These stages can be controlled using security access control privileges.

Bulk vs Continuous Loading

Snowflake offers two primary methods for data loading, depending on data volume and loading frequency.

Bulk Loading Using the COPY Command

Bulk loading involves loading data from files in cloud storage or staging data from a local machine to Snowflake’s cloud storage before using the COPY command to load data into tables. Users must provide virtual warehouses appropriately sized to handle the expected loads.

Simple Transformations: Data can be transformed during loading with options such as column reordering, omission, casting, and text truncation.

Continuous Loading Using Snowpipe

Snowpipe is designed for loading small volumes of data (micro-batches) incrementally. It loads data minutes after files are added to a stage, providing users with real-time results as raw data becomes available.

Compute Resources: Snowpipe uses Snowflake’s serverless compute model, automatically resizing resources as needed, and charges based on actual workloads.

Transformations: The COPY statement in a Snowpipe definition supports the same transformation options as bulk loading. Additionally, it can leverage automated tasks and change data capture (CDC) information in streams for more complex transformations.

Data Pipelines for Complex Transformations

Data pipelines allow complex transformations of loaded data. Typically, they use Snowpipe to load raw data into a staging table and then employ table streams and tasks for transformation and optimization before analysis.

Continuous Loading Using Snowpipe Streaming

The Snowpipe Streaming API writes data directly to Snowflake tables without staging files, resulting in lower latency and costs for near real-time data streams. It is also available for the Snowflake Connector for Kafka, offering low-latency and cost-effective data loading.

Loading Data from Apache Kafka Topics

The Snowflake Connector for Kafka enables users to connect to an Apache Kafka server, read data from topics, and load that data into Snowflake tables.

Detection of Column Definitions in Staged Semi-structured Data Files

Snowflake provides solutions for handling semi-structured data, including automatic schema detection for column definitions in data files. Supported formats include Apache Parquet, Apache Avro, ORC, JSON, and CSV files (CSV and JSON are in preview).

SQL Functions for Schema Detection

INFER_SCHEMA: Detects column definitions and retrieves metadata for Snowflake objects.

GENERATE_COLUMN_DESCRIPTION: Generates columns from staged files using INFER_SCHEMA output.

These functions work with both internal and external stages. You can create tables or external tables with the detected column definitions and use the COPY statement with the MATCH_BY_COLUMN_NAME option to load files into the structured table.

Alternatives to Loading Data

External Tables (Data Lake)

External tables allow querying data stored in external cloud storage without loading it into Snowflake. This is useful for accounts with large amounts of data in external storage, where users can create materialized views on subsets of the data for improved query performance.

Working With Amazon S3-compatible Storage

Snowflake supports creating external stages and tables on software/devices highly compliant with Amazon S3 API, making it easier to manage and analyze data stored across different physical locations.

Conclusion

Snowflake offers a robust and flexible set of options for loading data, ensuring that users can choose the best method for their specific needs. Whether using bulk or continuous loading, handling complex transformations, or leveraging external tables, Snowflake provides the tools necessary to efficiently manage and analyze data in the cloud.