Note: SnowConvert is now available directly from Snowflake. Learn more by visiting the documentation for both SnowConvert and the Snowpark Migration Accelerator (SMA).
Last week, Snowflake announced that their support for Python in Snowflake (and Snowpark) is now generally available. Today, Mobilize.Net announces SnowConvert for PySpark is available in both assessment and conversion modes.
More on SnowConvert below, but why should you care about Python in Snowflake? We described why you should care back when Scala support in Snowpark became generally available, but to summarize in one word: performance. Everything stays in Snowflake. Nothing has to travel. Let's say you have a Snowflake account now and are using Spark. You may be doing an ETL job to pull data from Snowflake, transform the data in your Spark cluster, execute whatever task you're looking to do in Spark, and insert this back into Snowflake (if that's where it's landing...). All of this could be simplified by leaving the data in Snowflake and running the Spark tasks you're looking to run with Snowpark. There would be no E or L in your ETL. This would be a huge time savings and performance boost for anyone working with large amounts of data in your Spark clusters.
That's the why. But what about the how? If you're using Spark now, how do you get started with Snowpark? First, we'd suggest checking out the Snowpark API by taking the 3 minutes to Snowpark challenge in BlackDiamond Studio. This will get you up and running writing Python code in Snowpark, and you can even package that code as a UDF or procedure in Snowflake. (If you get lost in the Snowpark, we've been there.)
That's how you can start trying out Snowpark with Python from scratch or rewriting small pieces of code. What if there is more than a few lines of PySpark code? How can you actually run this existing Spark code with Snowflake's Snowpark? That starts with an assessment, which we did a complete walkthrough of last week. Getting this assessment will allow you to understand what is supported and what is not supported both in Snowflake and by SnowConvert for PySpark. There's no downside to getting an assessment. It's free and you can do it right now.
This assessment will give you a basic inventory of the references to the Spark API in your source code. How will you know what's converted or not? When you run the assessment, you will be given next steps on how to get the amount of your code that is "ready" for Snowpark. Once you've had an assessment and understand what is next, SnowConvert for PySpark can be used to execute the conversion.
Automated conversion is the next step and that's where the Mobilize.Net SnowConvert team's years of experience come into play. SnowConvert for PySpark takes all the references that you have to the Spark API present in your Python code and converts them to references to the Snowpark API. All you have to do is specify the Python code or notebook files that you'd like to convert, and call the tool on those files. You can do so in BlackDiamond Studio, if you've already run the assessment. SnowConvert will then execute the conversion by building a complete semantic model of the source code. If you're familiar with SnowConvert at all, you may be aware that it is not a glorified find-and-replace or regex tool. It is a functional equivalence tool that builds an abstract syntax tree (AST) to represent the functionality present in the source code, and subsequently translate that over to the target.
This functionality can be created in a variety of different ways. If an element cannot be mapped, then SnowConvert will tell you that. To better understand what can and cannot be converted to Snowpark, let's take a look at the categories the tool outputs for any reference to the Spark API in the source code.
col("col1")
col("col1")
orderBy("date")
sort("date")
instr(str, substr)
def instr(source str, substr: str) => str:
return charindex(substr, str)
col1 = col("col1")
col2 = col("col2")
col1.contains(col2)
col1 = col("col1")
col2 = col("col2")
form snowflake.snowpark.functions as f
f.contains(col, col2)
instr(str, substr)
charindex(substr, str)
df:DataFrame = spark.createDataFrame(rowData, columns)
df.alias("d")
df:DataFrame = spark.createDataFrame(rowData, columns)
# Error Code SPRKPY1107 - DataFrame.alias is not supported
# df.alias("d")
If you're wondering how much of your codebase will fall into one of these categories, running the assessment on your PySpark code will let you know.
SnowConvert for PySpark is now available. Get started with an assessment today. And if you have Spark Scala, fear not. SnowConvert for Spark Scala is also available today. You can get an assessment on any Spark Scala code you may have by following the same steps in the assessment guide.
Happy migrating.