Last week, this author published a Python script that will take a large data file (txt, csv... whatever) and cut it up into smaller pieces to allow it to be uploaded in Snowflake using Snowflake's Load Table Utility. (And let me tell you, it was a riveting read. Allow yourself the majestic pleasure of reading through it.) But as you can probably guess, this is not the most efficient way to load data into a table in Snowflake.
So is there another way to do this...? Of course. You knew that. I knew that. And now this blog will guide you through it.
Like every major data warehouse that exists out there, Snowflake has created a Python connector that allows you to directly connect to your account. And also, like every major data warehouse, there is a package in Python where Snowflake has already setup the connection. We can just call it in our script, and avoid a lot of the hassle that surrounded the last blog post's dividing up of the data. There's plenty of good documentation on the connector that can be found here, but the best part (for those of you who read last week's post) is that there is no limit to the size of the file you can upload (forget about that 50 MB limit from the previous article).
The functional limit for this approach is really the capacity of your RAM to handle the size of a larger dataframe in Python's Pandas package. But you can get as large as you desire by using any variety of their other data uploading methods. How efficient Snowflake is with the data they process is one of the best reasons to transition to Snowflake. And of course, Mobilize.Net is here to help should that need arise.
Let's look again at our example of the 200 MB text file from the previous blog post. Before we get to the code, there's a couple of things you need to make sure are in place. The first is that you actually have to have the ability to run Python code on your computer. There's plenty of ways to get it, but the Anaconda package is the preferred deployment of this Data Scientist (and many others...). This will already include the Pandas package and many others that are super helpful for all your analytics tasks.
You will also need to ensure that you install the Snowflake Connector Package in your Python distribution as well as the extension of that package that works with a Pandas dataframe. You can do a pip install of the package simply enough by going to your python command prompt (or the Anaconda prompt, if you went that route) and typing:
pip install snowflake-connector-python
Once that is complete, get the pandas extension by typing:
pip install snowflake-connector-python[pandas]
Now you should be good to go. Point the below code at your original (not cut into pieces) file, and point the output at your desired table in Snowflake. Here's the code, and I'll highlight what you need to change.
"""
@author: Mobilize.Net
@title: Script for uploading a single large file using the
snowflake connector and a pandas dataframe.
"""
# Note that this code is written to be as easy to understand as possible.
# There are many ways to write this in a more concise manner, but this way
# will still work just fine.# Importing the required packages for all your data framing needs.
import pandas as pd
# The Snowflake Connector library.
import snowflake.connector as snow
from snowflake.connector.pandas_tools import write_pandas
## Phase I: Truncate/Delete the current data in the table
# The connector...
conn = snow.connect(user="USERNAME",
password="PASSWORD",
account="ACCOUNT",
# (the prefix in your snowflake space... for example,
# company.snowflakecomputing.com would just be "company" as the ACCOUNT name) warehouse="WAREHOUSE",
database="YOUR_DATABASE_NAME",
schema="YOUR_SCHEMA_NAME"
# (FYI, if you don't want to hard code your password into a script, there are
# some other options for security.)# Create a cursor object.
cur = conn.cursor()
# Execute a statement that will delete the data from the current folder.
# If you would prefer not to do this, then just comment it out.
# In fact, we'll leave it commented out, just in case the data file you # are importing will be appended to the existing table, and not replacing it.
# sql = "truncate table if exists YOUR_TABLE_NAME"
# cur.execute(sql)
# Close the cursor.
cur.close()
## Phase II: Upload from the Exported Data File.
# Let's import a new dataframe so that we can test this.
original = r"C:\Users\you\awesome_coding\file.csv" # <- Replace with your path.
delimiter = "," # Replace if you're using a different delimiter.
# Get it as a pandas dataframe.
total = pd.read_csv(original, sep = delimiter)
# Drop any columns you may not need (optional).
# total.drop(columns = ['A_ColumnName',
# 'B_ColumnName'],
# inplace = True)
# Rename the columns in the dataframe if they don't match your existing table.
# This is optional, but ESSENTIAL if you already have created the table format
# in Snowflake.# total.rename(columns={"A_ColumnName": "A_COLUMN",
# "B_ColumnName": "B_COLUMN"},
# inplace=True)
# Actually write to the table in snowflake.
write_pandas(conn, total, "YOUR_TABLE_NAME")
# (Optionally, you can check to see if what you loaded is identical
# to what you have in your pandas dataframe. Perhaps... a topic for a future
# blog post.)## Phase III: Turn off the warehouse.
# Create a cursor object.
cur = conn.cursor()
# Execute a statement that will turn the warehouse off.
sql = "ALTER WAREHOUSE WAREHOUSE SUSPEND"
cur.execute(sql)
# Close your cursor and your connection.cur.close()
conn.close()
# And that's it. Much easier than using the load data utility, but maybe
# not as user friendly.
Now... note that this uses a table, schema, and database that you already have created in Snowflake. Can you create the table in Python? Of course. You can create a new table and the table format right before you insert the data into it. In fact, as you can probably guess from the execute(sql)
statements above, you can do just about anything you can write in SQL in Python using the Snowflake Connector API. That sounds like another blog post (and sure enough, it is another blog post).
Loading data into Snowflake is relatively straightforward using scripts like the one above in Python. If you're looking to move an entire workload, Mobilize.Net is here to help migrate data from any legacy SQL data warehouse to Snowflake. If you've got some Teradata BTEQ and want to see how much of it can be automated to Python, check out the free Teradata to Snowflake assessment tool by using the button below. Even the most complex BTEQ scripts can be converted automatically to functionally equivalent Python that can be run directly in Snowflake using the python connector.
Stay safe. Keep coding.