- Relational data set : TPC DS : SQL Queries that emulates the Data I/O intensive operations
- Malware data set for classification : https://www.kaggle.com/c/malware-classification : ML for Memory/Core Intensive operations
Performance Monitoring Tools
- Sparklens - A profiling tool for Spark with built-in Spark Scheduler simulator.
- SparkMeasure simplifies the collection and analysis of Spark performance metrics. It is also intended as a working example of how to use Spark Listeners for collecting Spark task metrics data.
- Sparklint provides advanced metrics and visualization about your spark application's resource utilization.
- Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Hadoop and Spark.
About Spark performance in general
- Tuning Apache Spark Jobs the Easy Way: Web UI Stage Detail View
Shows how to identify some common Spark issues the easy way: by looking at a particularly informative graphical report that is built into the Spark Web UI.
- Best practices for successfully managing memory for Apache Spark applications on Amazon EMR
Best practices to prevent memory-related issues with Apache Spark on Amazon EMR.
- Tips and Best Practices to Take Advantage of Spark 2.x
Gives a quick overview of what changes were made in 2.x and then some tips to take advantage of these changes.
- How does Facebook tune Apache Spark for Large-Scale Workloads?
A writeup based on the original Spark Summit presentation on "Tuning Apache Spark for Large-Scale Workloads" - by Gaoxiang Liu
- Apache Spark: core concepts, architecture and internals
Covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver.
- Apache Spark @Scale: A 60 TB+ production use case
"We describe our experiences and lessons learned while scaling Spark to replace one of our Hive workload."
- Apache Spark - Performance
"Takes on the task of processing the London Cycle Hire data into two separate sets, Weekends and Weekdays, grouping the data into smaller subsets for further processing (a common business requirement) and how Spark can help with the task." (paraphrased).
Shuffling in Spark
- Apache Spark Shuffles Explained In Depth
Talks about shuffles "generally".
- Understand the Shuffle Component in Spark-core
- What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming? (Stackoverflow)
- Spark performance optimization: shuffle tuning
A detailed explanation of the principle of shuffle, and the description of the relevant parameters, while giving the parameters of the tuning recommendations.
- Optimizing Shuffle Performance in Spark (PDF link)
"we identify the bottlenecks in the execution of the current design, and propose alternatives that solve the observed problems. We evaluate our results in terms of application level throughput."
- Shuffling, Partitioning, and Closures (PDF link)
Parallel Programming and Data Analysis
- Tuning Spark applications
Discusses managing CPU and memory resources in Spark applications.
- Deep Understanding of Spark Memory Management Model
Focuses on memory management of Spark Executor.
- Why your Spark apps are slow or failing: Part 1 memory management
Captures some of the most common reasons why a Spark application fails or slows down. The first and most common is memory management. (paraphrased).
- Beginner’s Configuration Guide for Spark (IBM Analytics Engine)
"In this blog, we focus on tips for configuring Spark clusters, which can be tricky to configure. In our experience, the default settings don’t work very well (at least, for sparklyr — an R interface for Apache Spark), and changing any of Spark’s many parameters at random often only makes things worse."
Focuses on the IBM Analytics Engine.