Writing data from Spark into Greenplum via JDBC¶
In this section, you can write data from Spark DataFrame into Greenplum table by using JDBC driver.
- Run the script under scripts/download_postgresql.sh to download postgresql jar to the directory ‘scripts’.
$ scripts/download_postgreql.sh
...
HTTP request sent, awaiting response... 200 OK
Length: 713037 (696K) [application/java-archive]
Saving to: ‘postgresql-42.1.4.jar’
postgresql-42.1.4.jar 100%[=================================================>] 696.33K 850KB/s in 0.8s
2017-09-24 20:59:25 (850 KB/s) - ‘postgresql-42.1.4.jar’ saved [713037/713037]
- Make sure your spark shell is loaded the Postgresql jar.
- Determine the number of records in the “basictable” table by using psql command.
$ docker exec -it docker_gpdb_1 /bin/bash
[root@d632f535db87 data]# psql -h localhost -U gpadmin -d basic_db -c "select count(*) from basictable"
- Configure JDBC URL and connection Properties and use DataFrame write operation to write data from Spark into Greenplum. You can use different write mode
scala> :paste
// Entering paste mode (ctrl-D to finish)
val jdbcUrl = s"jdbc:postgresql://docker_gpdb_1/basic_db?user=gpadmin&password=pivotal"
val connectionProperties = new java.util.Properties()
dataFrame.write.mode("Append") .jdbc( url = jdbcUrl, table = "basictable", connectionProperties = connectionProperties)
// Exiting paste mode, now interpreting.
- Verify the write operation is successful by exec into GPDB container and run psql command-line. The total number records in the Greenplum table must be 2x of the original data.
$ docker exec -it docker_gpdb_1 /bin/bash
[root@d632f535db87 data]# psql -h localhost -U gpadmin -d basic_db -c "select count(*) from basictable"
- Next, you can write DataFrame data into an new Greenplum table via append mode.
scala>dataFrame.write.mode("Append") .jdbc( url = jdbcUrl, table = "NEWTable", connectionProperties = connectionProperties)
- Run psql commands to verify the new table with new records.
[root@d632f535db87 scripts]# psql -h localhost -U gpadmin -d basic_db -c "\dt"
List of relations
Schema | Name | Type | Owner
--------+-----------------------------+-------+---------
public | basictable | table | gpadmin
public | newtable | table | gpadmin
public | spark_7ac1947b17a17725_0_41 | table | gpadmin
public | spark_7ac1947b17a17725_0_42 | table | gpadmin
(4 rows)