Snowflake plus Snowpark
by Brandon Carver, on Feb 14, 2022 5:30:00 AM
In case you missed it, Snowpark has finally gone to general availability for Snowflake. If you were participating in the public or private preview, you may already be familiar with this excellent extension to Snowflake's Data Cloud offering. In fact, if you've been following Snowflake's product announcements, it would be pretty hard to miss with promotion at last year's Snowflake Summit and Snowflake BUILD conferences.
But why should we care? I mean... is Snowpark actually a big deal or just another in a series of overhyped announcements from Silicon Valley? These are great questions. Let us try to answer those and some other questions at a high level in this blog, and we'll get into the details later.
What is Snowpark?
So... what is Snowpark? This seems like an appropriate first question. Let's all get on the same page first.
Snowpark is described by Snowflake here, but if you're interested in a more human-friendly version, it is essentially the ability for users to take Java, Scala, Python, and potentially other sources (in the future) and run that code in your Snowflake account. Snowpark is not a new interface or new code language on its own, but rather an API to connect to Snowflake and use your Snowflake credits to run your code in the cloud. If you already have one of these languages, with a few tweaks (more on that in a bit) you can run that code in the cloud using your Snowflake account.
Scala is the first language fully supported in Snowpark, so we'll focus on that for the rest of this blog post. When Snowpark for Python and Java [not just Java UDF's] becomes available, we'll blog on that. Fear not.
So, if we can now run Scala in Snowflake now, that begs the next question:
Where is my Scala running right now?
This might be an obvious one, but let's leave no stone unturned. If you're a lone coder on a mission to save the world with better programming, then you're probably running a local instance of of one of those languages on your computer. You can download and install a local edition of Scala by visiting the Scala language website. Is that the only option? Of course not. Like Snowflake, there are other cloud computing providers out there that will take your Scala and run it on some cloud computing clusters (likely... Spark clusters). This means, your code runs in the cloud, and not using your own computer's resources. There may be some caveats there, but that's the general idea. If you're writing Scala for your organization, you may have some mainframe clusters managed by your organization. If you're less constrained by the limitations of your own compute power and have embraced the cloud computing model, then you will likely be using distributed Spark clusters.
If you're an avid Scala code-writer, then you may be more familiar with other cloud based supporters of Spark such as Databricks or Amazon EMR. Those platforms have their own implementation of the Spark API, and it has become more and more successful for them.
So...
What is the advantage of using Snowpark?
There might be several depending on your needs, but from a business perspective there's two that jump out immediately: integration and scale. Integration is the obvious one as there's now no ETL or other process needed to "get" the data in your database to run in Databricks or elsewhere. If you've already got a Snowflake account, you can start using the Snowpark API now. Like... right now. And your data is already there in your Snowflake account. All those tables and views that you're building your dataframes in Scala out of, they're right there in your Snowflake account. It's all integrated. Synergy.
What comes with that synergy though is something even more important: scale. Snowpark opens up a lot of possibilities that will help you scale (both up or down as needed) that you may not have now. Snowpark takes all of the advantages that Snowflake has a cloud-based platform, and applies that now to your analytics code, not just your database-specific SQL. Let's say you're writing some Scala code to build a logistic regression model (maybe something like this one). As mentioned earlier, you may be running this on a local machine or a Spark cluster. The speed and performance on that piece of code you have is based on the limitations of your machine or designated resources. Should you need more resources, you need to add more resources, either more clusters or literally, more compute resources to your local stack. Snowflake's whole architecture is based on scale. Upgrade or downgrade your compute warehouse at will with Snowflake. As your data gets larger, the resources designed to run your analytics operations can scale accordingly.
Some of you may be saying, "Nice." Others may be saying, "Alas! I do not have my data in a Snowflake account. I use another data warehouse or am merely exporting data spreadsheets and building my dataframes from there." If that's true, this author would say that it matters a great deal. You can now simplify your organization's data structures to have all of your custom analytics scripts and data warehousing in one location. And that location is cloud-based, not located in a back room in what may be a relatively empty office building. Even if you find yourself in that lone coder situation, you can get started for free with Snowflake and see if it's a solution that is scalable for you.
What are the disadvantages?
It can't all be advantages, can it? As with any scenario where you're considering a new platform, there will always be differences. While the implementation of Spark-Scala in Snowpark is pretty close to what you may be used to, it's not 100%. There will be gaps between the language and functionality you may be used to when writing Scala outside of Snowpark. Luckily, Mobilize.Net is building a SnowConvert for Snowpark that will be available soon to smooth your road from your current implementation of Scala to Snowflake's. Stay tuned for that.
There are also some limitations around how Snowpark handles aggregate functions and some of the datatypes that are supported. As Snowflake continues to evolve its implementation of Snowpark, that will improve over time. Mobilize.Net has been participating in the preview versions all along, and we would be happy to chat with you about the positives and negatives as you're considering a move to Snowpark.
What's the verdict?
This all sounds magical so far, but really... what does Mobilize.Net think? Why would I care if my analysis code can be run inside of Snowflake? In our experience so far, the benefits are real. Snowpark is more than just giving coders the ability to write in different languages. It's much more than that. It's bringing together your analytics code written in Java, Scala, or Python with your cloud data platform. There are always considerations when evaluating a new platform or making an important decision like this, but as an organization that has been doing exactly that for a quarter century, the value that you get by simplifying your code stack with Snowpark is worth any of the risks.
We'll help you get up and running in Snowpark in future blog posts, but if you're thinking about learning more about getting started with Snowpark, let us know. And if you're already using Snowpark and are looking to get more out of it, let us know. And if you're already using Snowpark and are using it well, let us know. We're always interested in hearing more about how users are using not just the tools made by Mobilize.Net, but how you're using your data platform... Snowflake or otherwise. Welcome to Snowpark. Watch your head on the lift!