Writing data from Spark into Greenplum

In this section, you can write data from Spark DataFrame into Greenplum table by using Greenplum-Spark connector.

  1. Run the script under scripts/download_postgresql.sh to download postgresql jar to the directory ‘scripts’.
     $ scripts/download_postgreql.sh
     ...
     HTTP request sent, awaiting response... 200 OK
     Length: 713037 (696K) [application/java-archive]
     Saving to: ‘postgresql-42.1.4.jar’
     postgresql-42.1.4.jar         100%[=================================================>] 696.33K   850KB/s    in 0.8s
     2017-09-24 20:59:25 (850 KB/s) - ‘postgresql-42.1.4.jar’ saved [713037/713037]
  1. Make sure your spark shell is loaded the Postgresql jar.
  1. Determine the number of records in the “basictable” table by using psql command.
$ docker exec -it docker_gpdb_1 /bin/bash
[root@d632f535db87 data]# psql -h localhost -U gpadmin -d basic_db -c "select count(*) from basictable"
  1. Configure JDBC URL and connection Properties and use DataFrame write operation to write data from Spark into Greenplum. You can use different write mode
scala> :paste
// Entering paste mode (ctrl-D to finish)
val jdbcUrl = s"jdbc:postgresql://docker_gpdb_1/basic_db?user=gpadmin&password=pivotal"
val connectionProperties = new java.util.Properties()
dataFrame.write.mode("Append") .jdbc( url = jdbcUrl, table = "basictable", connectionProperties = connectionProperties)
// Exiting paste mode, now interpreting.
  1. Verify the write operation is successful by exec into GPDB container and run psql command-line. The total number records in the Greenplum table must be 2x of the original data.
$ docker exec -it docker_gpdb_1 /bin/bash
[root@d632f535db87 data]# psql -h localhost -U gpadmin -d basic_db -c "select count(*) from basictable"
  1. Next, you can write DataFrame data into an new Greenplum table via append mode.
scala>dataFrame.write.mode("Append") .jdbc( url = jdbcUrl, table = "NEWTable", connectionProperties = connectionProperties)
  1. Run psql commands to verify the new table with new records.
[root@d632f535db87 scripts]# psql -h localhost -U gpadmin -d basic_db -c "\dt"
List of relations
Schema |            Name             | Type  |  Owner
--------+-----------------------------+-------+---------
public | basictable                  | table | gpadmin
public | newtable                    | table | gpadmin
public | spark_7ac1947b17a17725_0_41 | table | gpadmin
public | spark_7ac1947b17a17725_0_42 | table | gpadmin
(4 rows)