Spark 2.0 SQL source code tour part 1 : Introduction and Catalyst query parser

by Bipul Kumar

PS: Cross posted from

While digging through Apache Spark SQL’s source code, what I realized is that looking deeper into its source code is a must for all those who are interested in implementation details of any SQL based data processing engine. Spark SQL has three major API modules: Datasource Api, Catalyst Api and Catalog Api. This series of posts concentrate mostly on the Catalyst correlating the query processing stages(Parsing, Binding, Validations, Optimization, Code generation, Query execution) to highlight all the entry points for further exploration.

PREREQUISITE: Overview of Spark Catalyst. You can start with this link.

NOTE: Code snippets in these posts refer to Spark SQL 2.0.1.

As we know since 2.0 SparkSession is the new entry point for the DataFrame through Dataset, is associated with a SessionState which wraps the SQLContext.

Investigations on Spark Join

by Sachin Tyagi

(Note - this post is extracted from an email on the topic and has been posted on retroactively.)

TLDR: There are things you can do to improve your RDD join performance on Spark in some cases.