Netezza on Spark

In this blog we are gonna see how to use Spark to read tables from Netezza.

https://hortonworks.com/blog/yarn-capacity-scheduler/

  • This comes into picture when you have a production cluster and you wanted to limit the resources on each queue so that teams can work independently, without overloading the cluster

  • To propagate the queue setting use --conf spark.yarn.queue=<name>

export CLASSPATH=/path/to/nzjdbc3.jar

spark-shell --master yarn --conf spark.yarn.queue=root.gsnmon  –packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 –driver-class-path /opt/mapr/sqoop/sqoop/lib/nzjdbc3.jar --jars /opt/mapr/sqoop/sqoop/lib/nzjdbc3.jar
# Its mandate to include the path and jar when starting the shell
// Setup the config to connect to the JDBC
val options = Map("url" -> "jdbc:netezza://hostname:port/db_name",
        "user" -> "unser_name",
        "password" -> "password",
        "dbtable" -> "schema.table_name",
        "driver" -> "org.netezza.Driver",
        "numPartitions" -> "4")

// Use Spark JDBC bridge 
val df = spark.sqlContext.read.format("jdbc").options(options).load()

// Register the table as dynamic/temp table for any SQL queries
df.registerTempTable("table_name") //schema.table_name is not allowed in Spark

val df1 = spark.sqlContext.sql("Select '' as record_type,ACT_RSLT_SEQ_ID,trim(TRANS_PEGA_KEY_ID) AS TRANS_PEGA_KEY_ID,trim(ACT_RSLT_CD) AS ACT_RSLT_CD,NEXT_REV_DT FROM ROADS_ACT_RSLT_DW") //some query

//Store the results
df1.write.format("csv").option("header", "true").save("hdfs:///store/to/path/")
Mageswaran Dhandapani avatar
About Mageswaran Dhandapani, "Mages"
Data Science and Engineering team.