Spark 2.0 SQL source code tour part 1 : Introduction and Catalyst query parser

by Bipul Kumar

PS: Cross posted from blog.imaginea.com

While digging through Apache Spark SQL’s source code, what I realized is that looking deeper into its source code is a must for all those who are interested in implementation details of any SQL based data processing engine. Spark SQL has three major API modules: Datasource Api, Catalyst Api and Catalog Api. This series of posts concentrate mostly on the Catalyst correlating the query processing stages(Parsing, Binding, Validations, Optimization, Code generation, Query execution) to highlight all the entry points for further exploration.

PREREQUISITE: Overview of Spark Catalyst. You can start with this link.

NOTE: Code snippets in these posts refer to Spark SQL 2.0.1.

As we know since 2.0 SparkSession is the new entry point for the DataFrame through Dataset, is associated with a SessionState which wraps the SQLContext.