Have any of you seen this meme as you peruse the web outside of the wonder that is the Mobilize.Net website? I have had that exact look on my face when faced with another Map Reduce job or when trying to export a massive table from my data warehouse to put into a Pandas dataframe in Python. That… exact… look. I’m not sure if I’ve ever made it to 78 billion rows in a text file, but I’ve certainly made it into the billions before. And that is intimidating, terrifying, and just a bit challenging to someone who feels up for a challenge.
Welcome to the world of data science, where you dream of machine learning but end up with massive text files and debates over how much RAM you can pack in to your system.
I found myself dealing with a similar albeit significantly smaller in scale problem recently. I was transitioning some data from a third party source into a table in Snowflake. It was a 200 MB txt file (well, csv, but who’s asking) that was aspiring to get to 78 billion lines, but just couldn’t quite get there. Snowflake has a "Load Table" utility that is supposed to make this a really simple process. So... what’s the problem? Why reference a meme that may already be significantly outdated? Surely, with one txt file (and not even an exceptionally large one at that), it would be simple.
But alas, when staging this particular dataset, I discovered that one can only have files 16 MB in size or lower to load into a table. [EDIT: This is actually not true. If you read the article that this is linked to, you'll see that 16 MB is the limit for the VARIANT data type, which means that this is a row limitation, not a file limitation. In other words, you can't have a row that is larger than 16MB in size. The filesize limit for using the Load Table Utility in the GUI is actually 50MB, as is pretty clearly spelled out in this article. If you read a more recent blog post by this author, you'll see that you can upload a table of any size you want using Snowflake's Python Connector (or any other way they allow you to upload data outside of the web GUI). Thus ends this edit. Keep the feedback coming, y'all!] How tragic for me and maybe you if this has ever happened to you. You could get lost in a world of file conversions and potentially lose some integrity of the data or get lost in a sea of altered settings that can happen from changes in filetype. There are a lot of random software applications out there that claim to be able to achieve this simple task, but most are in need of some modernization and upgrading (VB6, dare I say?). If you were not so programming-ly inclined, you could do this yourself by taking some time to cut and paste rows into a different files until you have a series of much smaller files. But fortunately for you, this author along with, undoubtedly, this audience, are savvy, solution oriented, programming-ly inclined humans. Why not write your own script that is tailored to this particular solution?
Well, of course, that is what we chose to do. I brushed off some rusty Python skills (feel free to suggest simpler solutions if you feel the need to judge my code, though the stack overflow rudeness analysis is thoroughly used by this author) and threw together a tailored solution. Feel free to take advantage of my few minutes of effort to solve this problem for you if you ever encounter a similar problem. This code will take in one csv and output a series of CSV files that are smaller than the threshold for Snowflake. When you are finished, you can use Snowflake's "Load Table" utility to finish the job. As mentioned previously, in a future edition of this blog we'll post an upgraded edition of this script that can take the output directly into Snowflake without messing with the GUI in Snowflake. (And again, here is that future version of this blog. Get excited!).
Note that this solution will work with any size file, and not just the 200 MB csv file that I started with. (Could you do a 78 billion row file...? Possibly, if you had unlimited RAM, and an incredible supply of patience.) This code also scans for a delimiter that is not being used in your data, so you don't end up with some weird line breaks anywhere in there (an issue that the world of data science often encounters).
"""
@author: Mobilize.Net
@title: Script for uploading the issues files.
"""
# Importing the required packages for all your file separation needs.
import pandas as pd
import os
import math
## Part 1: Variables that you may need to change/setup each run.
# The original file's filepath. This is the file you want to split
# into multiple smaller files. Use the full pathway, unless you want
# to get fancy and do some additional programming. If you put the full
# pathway in, you can run this python file from any location on
# your computer.
original_file = r"C:\Users\you\awesome_coding\file.csv" # <- Replace with your file path.
# Not the delimiter of the file you just imported. If it's a standard
# .csv, then it will be a comma. But I generally prefer to use pipe
# whenever possible.
delimiter = ","
# Denote the target size. If you're loading in data, check in the
# snowflake documentation to determine what the maximum file size
# you can use. 50 MBs is a good standard to use.
target_size = 50 # in megabytes
## Part 2: Load in the original spreadsheet.
# Note that read_csv reads any text file, not just those with
# the .csv extension.
issues_total = pd.read_csv(original_file, sep = delimiter)
# What's the size of the file in question?
size = os.path.getsize(original_file) # in bytes
size_MB = round(size/1000000, 2) # in megabytes
# Drop in just below the maximum target file's size.
target_size_safe = 0.8 * target_size
# So, how many partitions doe we need?
partitions = math.ceil(size_MB / target_size_safe)
print("We need %s partitions of this file."%partitions)
# And how many rows does that mean we need?
length = len(issues_total)
row_set = math.ceil(length / partitions)
print("There will be approximately %s rows in each output file."%row_set)
## Part 3: Find the best delimiter for this dataset.
# You can't always trust the comma or the pipe! Python (pandas) may be
# able to handle anything, but Snowflake won't. If you have some text
# in your file that has a comma (or whatever else) in it, Snowflake will
# struggle mightily to upload your data (speaking from personal
# experience). You can skip this part if it's not necessary, but why
# not run it anyways? Can't hurt, though it will cost you some time
# dependent on the size of your data.
# Here are some potential candidates for a delimiter.
# Feel free to add additional characters if you think this list is not
# sufficient.
possible_delimiters = [",","|","!","@","#","$","%","^","&","*","<",">"]
# Try each possible delimiter to find a delimiter that is not being used.
for x in possible_delimiters:
count_list = []
for i in issues_total.columns:
current_column = issues_total[i].astype(str)
count = sum(current_column.str.count(r"\%s"%x))
count_list.append(count)
if sum(count_list) < 1:
chosen_delimiter = x
break
if sum(count_list) > 0:
# If you want updates to print out, you can un-comment the next line.
# print("For %s, the count is %s."%(x,sum(count_list)))
pass
# And at the end, we print the chosen one.
print("The chosen delimiter is %s."%(chosen_delimiter))
## Part 4: Break up the files.
# So we've got the original file, the delimiter ("the chosen one"!),
# and the number of rows we want for each file... so let's loop through and
# split them apart. The while loop below is very simple, but note that
# it could take a long time to complete depending on the number of files
# you have. The maximum number of files that can be produced is 999, but
# that can be changed by changing the filenum variable.
# The number that will be put at the end of the filename. This will
# be updated by the loop each time it completes.
filenum = "001"
# The next two lines take the original file and rename it with
# filenum put in before the extension.
filename_index = original_file.rfind(".")
output_file = original_file[0:filename_index] + filenum + original_file[filename_index:]
# Setup piece for the loop.
start_loc = 0
# Looping through the files.
while start_loc < length:
# Subset of data under consideration.
consideration = issues_total[start_loc:start_loc+row_set]
# Note that each file will out put to the same directory that the
# original file was in. They will just have a number at the end.
consideration.to_csv(output_file, index = False, sep = chosen_delimiter)
# If you want to track your progress, you can un-comment the line below.
# print(filenum, start_loc)
# Resetting the iterating variables.
start_loc = start_loc + row_set
file_update = int(filenum) + 1
index_spot = output_file.rfind(filenum)
# Applying the correct naming convention to the next file.
if file_update < 10:
filenum = "00%s"%str(file_update)
elif file_update < 100:
filenum = "0%s"%str(file_update)
elif file_update > 99:
filenum = str(file_update)
# Updating the name of the output file for the next iteration of the loop.
output_file = output_file[:index_spot] + filenum + output_file[index_spot+3:]
print("There were %s files created from the original."%(int(filenum) - 1))
print("The end. \nCheck the directory with your original file to see the output.")
# P.S. - We'll take this directly into Snowflake by using Snowflake's
# python connector in a future blog post, so stay tuned.
And that is that. The output from this will give you a directory full of files that you can use Snowflake's "Load Table" utility and simply choose the files you want to upload. You could also just write the Snowflake SQL, but make sure you specify what the delimiter is (... the chosen one!) from the code.
As always, feel free to reach out to us to learn more about migrating to Snowflake, data warehouses, or just the magic of SQL (or Python). Or you could always drop us a line in the comments.
Stay safe. Keep coding.