Automating Migration from PySpark to Snowpark
by Brandon Carver, on Nov 15, 2022 10:08:14 PM
Note: SnowConvert is now available directly from Snowflake. Learn more by visiting the documentation for both SnowConvert and the Snowpark Migration Accelerator (SMA).
Last week, Snowflake announced that their support for Python in Snowflake (and Snowpark) is now generally available. Today, Mobilize.Net announces SnowConvert for PySpark is available in both assessment and conversion modes.
More on SnowConvert below, but why should you care about Python in Snowflake? We described why you should care back when Scala support in Snowpark became generally available, but to summarize in one word: performance. Everything stays in Snowflake. Nothing has to travel. Let's say you have a Snowflake account now and are using Spark. You may be doing an ETL job to pull data from Snowflake, transform the data in your Spark cluster, execute whatever task you're looking to do in Spark, and insert this back into Snowflake (if that's where it's landing...). All of this could be simplified by leaving the data in Snowflake and running the Spark tasks you're looking to run with Snowpark. There would be no E or L in your ETL. This would be a huge time savings and performance boost for anyone working with large amounts of data in your Spark clusters.
Getting Started with Snowpark
That's the why. But what about the how? If you're using Spark now, how do you get started with Snowpark? First, we'd suggest checking out the Snowpark API by taking the 3 minutes to Snowpark challenge in BlackDiamond Studio. This will get you up and running writing Python code in Snowpark, and you can even package that code as a UDF or procedure in Snowflake. (If you get lost in the Snowpark, we've been there.)
That's how you can start trying out Snowpark with Python from scratch or rewriting small pieces of code. What if there is more than a few lines of PySpark code? How can you actually run this existing Spark code with Snowflake's Snowpark? That starts with an assessment, which we did a complete walkthrough of last week. Getting this assessment will allow you to understand what is supported and what is not supported both in Snowflake and by SnowConvert for PySpark. There's no downside to getting an assessment. It's free and you can do it right now.
This assessment will give you a basic inventory of the references to the Spark API in your source code. How will you know what's converted or not? When you run the assessment, you will be given next steps on how to get the amount of your code that is "ready" for Snowpark. Once you've had an assessment and understand what is next, SnowConvert for PySpark can be used to execute the conversion.
Conversion with SnowConvert for PySpark
Automated conversion is the next step and that's where the Mobilize.Net SnowConvert team's years of experience come into play. SnowConvert for PySpark takes all the references that you have to the Spark API present in your Python code and converts them to references to the Snowpark API. All you have to do is specify the Python code or notebook files that you'd like to convert, and call the tool on those files. You can do so in BlackDiamond Studio, if you've already run the assessment. SnowConvert will then execute the conversion by building a complete semantic model of the source code. If you're familiar with SnowConvert at all, you may be aware that it is not a glorified find-and-replace or regex tool. It is a functional equivalence tool that builds an abstract syntax tree (AST) to represent the functionality present in the source code, and subsequently translate that over to the target.
This functionality can be created in a variety of different ways. If an element cannot be mapped, then SnowConvert will tell you that. To better understand what can and cannot be converted to Snowpark, let's take a look at the categories the tool outputs for any reference to the Spark API in the source code.
- Direct - Direct translation where the same function exists in both PySpark and Snowpark with no change needed other the import call. Example:
- Spark
col("col1")
- Snowpark
col("col1")
- Spark
- Rename - The function from PySpark exists in Snowpark, but there is a rename that is needed. Example:
- Spark
orderBy("date")
- Snowpark
sort("date")
- Spark
- Helper - These next 2 are where the power and experience that Mobilize.Net's proven code conversion technology can really shine through. Helper classes are created automatically by SnowConvert to mimic functionality that is present in the source platform, but may not yet be available in the target. In the case of PySpark, these are functions that have a small difference in Snowpark from Spark that can be resolved with a helper function. If you take parameters passed to a function as an example, conversions like "fixed" additional parameters and changing the order of parameters can be resolved with a helper class or function. Example:
- Spark
instr(str, substr)
- Snowpark
def instr(source str, substr: str) => str:
return charindex(substr, str)
- Spark
- Transformation - As with the helper category above, these are functional recreations of the source code's behavior using SnowConvert. Unlike the helper category above, these functions will be completely recreated, often without any resemblance to the original function. The can include calling additional functions, adding multiple lines of code, or any number of other transformative operations that will support a functionally equivalent output. Example:
- Spark
col1 = col("col1")
col2 = col("col2")
col1.contains(col2)
- Snowpark
col1 = col("col1")
col2 = col("col2")
form snowflake.snowpark.functions as f
f.contains(col, col2)
- Spark
- Workaround - SnowConvert can identify functions where a workaround has been found, but the implementation cannot be automated at this time. This could be for any number of reasons including the workaround requires user input or the user to make a decision, the workaround will only work in certain scenarios, or the workaround has not yet been automated. An example is shown below, but it will be very basic. Example:
- Spark
instr(str, substr)
- Snowpark
Manually replace withcharindex(substr, str)
- Spark
- Not Supported - There will always be a functional gap between one platform and another that cannot be bridged through automation. These elements fall into this category. As with the other tools in the family of SnowConvert products, SnowConvert for PySpark does it's best to identify where all of these elements occur and Mobilize.Net will make recommendations on how to address migrating these elements. The tool will insert a comment that will identify that this function is not supported. Example:
- Spark
df:DataFrame = spark.createDataFrame(rowData, columns)
df.alias("d")
- Snowpark
df:DataFrame = spark.createDataFrame(rowData, columns)
# Error Code SPRKPY1107 - DataFrame.alias is not supported
# df.alias("d")
- Spark
- Not Defined - This is the final category of element: the great unknown. If an element returns as Not Defined, it means that the tool has not yet categorized whether the function has a mapping from Spark to Snowpark. This could be because it is a new function or one the tool has never seen before. Any functions that return as Not Defined are evaluated and added to the core refence table in one of the above categories.
If you're wondering how much of your codebase will fall into one of these categories, running the assessment on your PySpark code will let you know.
SnowConvert for PySpark is now available. Get started with an assessment today. And if you have Spark Scala, fear not. SnowConvert for Spark Scala is also available today. You can get an assessment on any Spark Scala code you may have by following the same steps in the assessment guide.
Happy migrating.