Tue 01 November 2016 Nils Amiet

conference spark big data

I attended Spark Summit Europe 2016 in Brussels this year in October, a conference where Apache Spark enthusiasts meet up. I've been using Spark for nearly a year now on multiple projects and was delighted to see so many Spark users at Square Brussels.

Spark Summit Europe 2016 photo  

There were three trainings to choose from on the first day. I went for “Exploring Wikipedia with Spark (Tackling a unified case)”.

The class was taught in Scala and Databricks notebooks were used. Databricks is a cloud platform that lets data scientists use Spark without having to setup or manage a cluster themselves. Databricks uses AWS as their backend. Clusters can be started and then attached to notebooks where code can be executed on the attached cluster.

The class started with a recap of the basics, covering multiple APIs, including RDDs, Dataframes and the new Datasets. We used publicly available Wikipedia datasets and leveraged Spark SQL, Spark Streaming, GraphFrames, UDFs and machine learning algorithms. I was impressed to see how easy it was to run code snippets on the Databricks platform and get insights into the data.

Another great feature is the support for mixing languages in a notebook. For instance a UDF can be defined and registered in Python and can then be used in Scala. The other two trainings which I wasn't able to attend were “Apache Spark Essentials (Python)” and “Data Science with Apache Spark”.  

The talks

The following days were conference days. Usually each day started with keynotes and then there were three or four talks to choose from every 30 minutes. I will highlight some of the talks and keynotes I attended.  

Simplifying Big Data Applications with Apache Spark 2.0

Spark 2.0 was released and brings many improvements over the 1.6 branch, namely:

  • Performance improvements with whole-stage code generation and vectorization
  • Unified API: Dataframes are now just an alias for Datasets
  • The new SparkSession single entry point. This replaces SparkContext, StreamingContext, SQLContext, etc.

The Next AMPLab: Real-Time, Intelligent, and Secure Computing

Spark was born at AMPLab. We were shown what projects AMPLab is currently working on and thus what can be expected in the next 5 years for Spark. They currently have two main projects: Drizzle and Opaque. Drizzle aims at reducing latency in Spark Streaming while Opaque is an attempt at improving security in Spark, for instance by protecting against pattern recognition attacks.  

Spark's Performance: The Past, Present, and Future

Performance in Spark 2.0 is improved with whole-stage code generation, a new technique which will optimize the code of the whole pipeline and can boost performance by one order of magnitude in some cases. Another technique used to improve performance is vectorization, or in other words, using an in-memory columnar format for faster data access. Databricks published a blog post discussing this.  

How to Connect Spark to Your Own Datasource

The author of the MongoDB Spark connector shared his experience in writing a Spark connector. There is a lack of official documentation on writing these so the best way to start writing your own connector is to look at how others did it, for example the Spark Cassandra connector.  

Dynamic Resource Allocation, Do More With Your Cluster

This technique is useful for shared clusters and jobs of varying load. In this talk we were shown some parameters that can be set for optimizing dynamic resource allocation on a Spark cluster.  

Vegas, the Missing MatPlotLib for Spark

Two engineers from Netflix showed their project called Vegas. This project will generate HTML code that can be used on web pages. Vegas also supports Apache Zeppelin notebooks, has console support and can render to SVG. Vegas uses Vega-Lite underneath. It is currently in beta stage.  

SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster

Groupon announced the availability of SparkLint, a performance debugger for Spark. It can detect over-allocation and has CPU utilization graphs for Spark jobs. SparkLint is available on Github.  

Spark and Object Stores —What You Need to Know

This talk gives a set of optimal parameters to use when working with Object Stores and Spark. When using the Amazon S3 API, make sure to use the new s3a:// protocol in your URLs. This is the only one that is currently supported.  

Mastering Spark Unit Testing

A few tips and tricks from Blizzard were presented for unit testing Spark jobs. The main ideas were that one should not use a Spark context if it's not necessary. Code can usually be tested outside of a Spark job.

If it's really necessary to run a Spark job in your test, then use the local master and run it on your local machine. You can then set breakpoints for instance in IntelliJ Idea and debug both driver and executor code. A cool idea that the speaker gave was to share the Spark context across various unit tests so that the initialization is done only once and the tests are running faster.  

Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs

Flame Graphs are a great visualization tool that can be used to profile Spark jobs in order to find the most frequent code paths and optimize bottlenecks. This talk is about the use of Flame graphs at CERN in order to analyze the performance of Spark 1.6 and 2.0.  

TensorFrames: Deep Learning with TensorFlow on Apache Spark

Databricks presented TensorFrames, a bridge between Spark and TensorFlow. A TensorFlow graph can be defined and used as a mapper function that can be applied to a Dataframe. TensorFrames can bring a huge performance increase when running on GPUs.  

Apache Spark at Scale: A 60 TB+ Production Use Case

Facebook uses Spark at scale and during this talk they presented a few tips and tricks that they found while working with Spark. They use Flame Graphs for profiling. They highlighted that the thread dump function available in the Spark UI is useful for debugging.

They gave interesting ideas for configuration:

  • Use memory off heap
  • Use parallel GC instead of G1GC
  • Tune the shuffle service (number of threads, etc.)
  • Configure the various buffer sizes

They published a blog post about this.

Apache Kudu and Spark SQL for Fast Analytics on Fast Data

An engineer from Cloudera presented Apache Kudu, a top level Apache project that sits between HDFS and HBase. The speaker revealed an interesting fact during the Q&A session: Kudu does not store its data on HDFS, but rather on a local file system. Kudu is a data store that has some of the advantages of the Parquet file format: it's a columnar store. Support for Kerberos in Kudu is coming soon.  

SparkOscope: Enabling Apache Spark Optimization Through Cross-Stack Monitoring and Visualization

SparkOscope is an IBM research project. It collects OS-level metrics while Spark jobs are running. It does not guarantee that the metrics correspond to the resource usage of the Spark job. In the event that other processes are running at the same time as the Spark job that is being observed then the metrics will include usage of multiple processes unrelated to Spark jobs. SparkOscope is available on Github.  

Problem Solving Recipes Learned from Supporting Spark

  • OutOfMemoryErrors usually happen when allocating too many objects. Tune the spark.memory.fraction setting and do not allocate objects in tight loops. Be careful when allocating objects in mapPartitions() for instance.
  • NoSuchMethodError is usually thrown when there is a library version mismatch. Try to upgrade or downgrade Spark, change the library loading order or shade libraries to fix this.
  • Use spark.speculation to restart slow-running tasks.
  • Use df.explain to debug queries on dataframes.  

Containerized Spark on Kubernetes

There was this excellent talk by William Benton from Red Hat about running Spark on Kubernetes. Don't miss out on this one!  

Spark SQL 2.0 Experiences Using TPC-DS

Very interesting talk and Q&A session about running a large scale benchmark with Spark SQL on a $5.5 million cluster. About 90% of the 99 queries defined in the TPC-DS specification were runnable on Spark SQL.

See the related blog post.  

The talks I missed

There are a few more talks that I couldn't attend but that I will watch as soon as the video streams become available:

  • No One Puts Spark in the Container
  • Hive to Spark—Journey and Lessons Learned
  • Adopting Dataframes and Parquet in an Already Existing Warehouse
  • A Deep Dive into the Catalyst Optimizer  

Closing words

A few interesting things/trends I heard at Spark Summit:

  • Parquet is an efficient, fast, columnar file format
  • Many people use Databricks notebooks. You don't have to manage your own cluster.
  • There is no better API (RDD, Dataframes, Datasets), it's a question of preference
  • Dataframes do not replace RDDs but they have an advantage one should be aware of: the Catalyst optimizer will rewrite your poorly optimized queries when using Spark SQL and dataframes. This is not true when using the low-level RDD API directly.

Presentation slides and recordings from the event will be available on the Spark Summit website by November 4.

Update: cross-posted on research.kudelskisecurity.com