Snowflake, a cloud-based data warehousing platform, supports various methods for data loading, ensuring flexibility and efficiency for different data scenarios. This document provides an overview of the main options available to load data into Snowflake, including external and internal stages, bulk and continuous loading, simple transformations, and more.
Table of Contents
Open Table of Contents
Supported File Locations
Snowflake refers to the location of data files in cloud storage as a “stage.” The COPY INTO <table>
command is used for both bulk and continuous data loading, supporting cloud storage accounts managed by your business entity (external stages) and cloud storage contained in your Snowflake account (internal stages).
External Stages
Data can be loaded from the following cloud storage services:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure
Data in archival cloud storage classes requiring restoration (e.g., Amazon S3 Glacier Flexible Retrieval or Azure Archive Storage) cannot be accessed. External stages are created using the CREATE STAGE
command and store the URL to files in cloud storage, access settings, and file format information.
Internal Stages
Snowflake maintains several types of internal stages within your account:
User Stage: Allocated to each user for storing files, designed for single-user management but can load data into multiple tables.
Table Stage: Available for each table in Snowflake, for files managed by one or more users and loaded into a single table.
Named Internal Stage: A database object created in a schema, capable of storing files managed by multiple users and loaded into multiple tables. These stages can be controlled using security access control privileges.
Bulk vs Continuous Loading
Snowflake offers two primary methods for data loading, depending on data volume and loading frequency.
Bulk Loading Using the COPY Command
Bulk loading involves loading data from files in cloud storage or staging data from a local machine to Snowflake’s cloud storage before using the COPY command to load data into tables. Users must provide virtual warehouses appropriately sized to handle the expected loads.
Simple Transformations: Data can be transformed during loading with options such as column reordering, omission, casting, and text truncation.
Continuous Loading Using Snowpipe
Snowpipe is designed for loading small volumes of data (micro-batches) incrementally. It loads data minutes after files are added to a stage, providing users with real-time results as raw data becomes available.
Compute Resources: Snowpipe uses Snowflake’s serverless compute model, automatically resizing resources as needed, and charges based on actual workloads.
Transformations: The COPY
statement in a Snowpipe definition supports the same transformation options as bulk loading. Additionally, it can leverage automated tasks and change data capture (CDC) information in streams for more complex transformations.
Data Pipelines for Complex Transformations
Data pipelines allow complex transformations of loaded data. Typically, they use Snowpipe to load raw data into a staging table and then employ table streams and tasks for transformation and optimization before analysis.
Continuous Loading Using Snowpipe Streaming
The Snowpipe Streaming API writes data directly to Snowflake tables without staging files, resulting in lower latency and costs for near real-time data streams. It is also available for the Snowflake Connector for Kafka, offering low-latency and cost-effective data loading.
Loading Data from Apache Kafka Topics
The Snowflake Connector for Kafka enables users to connect to an Apache Kafka server, read data from topics, and load that data into Snowflake tables.
Detection of Column Definitions in Staged Semi-structured Data Files
Snowflake provides solutions for handling semi-structured data, including automatic schema detection for column definitions in data files. Supported formats include Apache Parquet, Apache Avro, ORC, JSON, and CSV files (CSV and JSON are in preview).
SQL Functions for Schema Detection
INFER_SCHEMA
: Detects column definitions and retrieves metadata for Snowflake objects.
GENERATE_COLUMN_DESCRIPTION
: Generates columns from staged files using INFER_SCHEMA output.
These functions work with both internal and external stages. You can create tables or external tables with the detected column definitions and use the COPY
statement with the MATCH_BY_COLUMN_NAME
option to load files into the structured table.
Alternatives to Loading Data
External Tables (Data Lake)
External tables allow querying data stored in external cloud storage without loading it into Snowflake. This is useful for accounts with large amounts of data in external storage, where users can create materialized views on subsets of the data for improved query performance.
Working With Amazon S3-compatible Storage
Snowflake supports creating external stages and tables on software/devices highly compliant with Amazon S3 API, making it easier to manage and analyze data stored across different physical locations.
Conclusion
Snowflake offers a robust and flexible set of options for loading data, ensuring that users can choose the best method for their specific needs. Whether using bulk or continuous loading, handling complex transformations, or leveraging external tables, Snowflake provides the tools necessary to efficiently manage and analyze data in the cloud.