1. Relational data set : TPC DS : SQL Queries that emulates the Data I/O intensive operations
  2. Malware data set for classification : https://www.kaggle.com/c/malware-classification : ML for Memory/Core Intensive operations

Performance Monitoring Tools

  • Sparklens - A profiling tool for Spark with built-in Spark Scheduler simulator.
  • SparkMeasure simplifies the collection and analysis of Spark performance metrics. It is also intended as a working example of how to use Spark Listeners for collecting Spark task metrics data.
  • Sparklint provides advanced metrics and visualization about your spark application's resource utilization.
  • Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Hadoop and Spark.

About Spark performance in general

Shuffling in Spark

Memory Management