Writing data from Spark into Greenplum via JDBC

In this section, you can write data from Spark DataFrame into Greenplum table by using JDBC driver.

  1. Run the script under scripts/download_postgresql.sh to download postgresql jar to the directory ‘scripts’.
     $ scripts/download_postgreql.sh
     ...
     HTTP request sent, awaiting response... 200 OK
     Length: 713037 (696K) [application/java-archive]
     Saving to: ‘postgresql-42.1.4.jar’
     postgresql-42.1.4.jar         100%[=================================================>] 696.33K   850KB/s    in 0.8s
     2017-09-24 20:59:25 (850 KB/s) - ‘postgresql-42.1.4.jar’ saved [713037/713037]
  1. Make sure your spark shell is loaded the Postgresql jar.
  1. Determine the number of records in the “basictable” table by using psql command.
$ docker exec -it docker_gpdb_1 /bin/bash
[root@d632f535db87 data]# psql -h localhost -U gpadmin -d basic_db -c "select count(*) from basictable"
  1. Configure JDBC URL and connection Properties and use DataFrame write operation to write data from Spark into Greenplum. You can use different write mode
scala> :paste
// Entering paste mode (ctrl-D to finish)
val jdbcUrl = s"jdbc:postgresql://docker_gpdb_1/basic_db?user=gpadmin&password=pivotal"
val connectionProperties = new java.util.Properties()
dataFrame.write.mode("Append") .jdbc( url = jdbcUrl, table = "basictable", connectionProperties = connectionProperties)
// Exiting paste mode, now interpreting.
  1. Verify the write operation is successful by exec into GPDB container and run psql command-line. The total number records in the Greenplum table must be 2x of the original data.
$ docker exec -it docker_gpdb_1 /bin/bash
[root@d632f535db87 data]# psql -h localhost -U gpadmin -d basic_db -c "select count(*) from basictable"
  1. Next, you can write DataFrame data into an new Greenplum table via append mode.
scala>dataFrame.write.mode("Append") .jdbc( url = jdbcUrl, table = "NEWTable", connectionProperties = connectionProperties)
  1. Run psql commands to verify the new table with new records.
[root@d632f535db87 scripts]# psql -h localhost -U gpadmin -d basic_db -c "\dt"
List of relations
Schema |            Name             | Type  |  Owner
--------+-----------------------------+-------+---------
public | basictable                  | table | gpadmin
public | newtable                    | table | gpadmin
public | spark_7ac1947b17a17725_0_41 | table | gpadmin
public | spark_7ac1947b17a17725_0_42 | table | gpadmin
(4 rows)