Test Driven Development w/ Apache Spark

Published on June 01, 2017

When I first started working on data applications using Spark i was very happy to see that they thought about testing from the start. However, it wasn’t obvious how to best utilize a test SparkContext to catch real world problems. In this blog post, we’ll explore how to use ScalaTest to write RSpec style tests for Spark applications.

I come from the Ruby world where I was spoiled by amazing tools such as FactoryGirl, RSpec, etc. Before diving into Scala, I only had a little bit of Java experience. However, Scala brought the best of dynamic languages, strict types, and the JVM. It was basically like a fast Ruby. This brought me to ScalaTest, a testing framework for Scala. In this post, we’ll be looking at its WordSpec testing style.

We first must decide on a common directory structure for applications. This is the structure I use:

Notice src/test/samples. This is where we put our data samples, or factories. It’s not as fully fledged as FactoryGirl for Ruby, but it’ll do. Within this folder, you should have small samples of your data that you can run automated tests against.

First, we’ll start by writing a Spec helper that’ll take care of initializing the Spark context and copying sample files to a temporary directory. It will also delete the temporary directory after each test. This is important so that you don’t get errors or erroneously read data from a previous test.

I like to organize projects into tasks, services, and models.

Tasks are the entry points for any given application.

Imagine where you have a Spark applications that runs different types of jobs.

Services are just pieces in the Single Responsibility Principle. They have only one public method, should end with a verb, and do one thing only.

Models are just like MVC or Ruby on Rails models. They are basically case classes that will define your problem domain.

I also like to use the repository and adapter pattern extensively. The point of using patterns is to be able to use Dependency Injection for both making code more generic and for facilitating automated testing.

Let’s create a very simple service that’ll simply read files from one folder and put them into another folder.

Now, we want to develop and test this code without having to spin up a full Spark cluster and waste time waiting for things to fail. We’ll want to make sure that the code not only compiles, but that it works within a Spark context. For this purpose, we’ll write a spec to test the results of this service.

First we’ll want to add some sample data. In this case we’ll create src/test/samples/my.bucket/sample.json.

Then we create the spec accordingly.

Notice how we point inputPath and outputPath to the temporary folder created by SparkHelper. This will contain any samples you placed in src/test/samples. In a production environment, you will change the paths to use s3 or hdfs.

Now you can develop and test your Spark application as any other Scala applications. By using RSpec style testing, we can mimic methods successfully used in Ruby projects.