Inserting Data into Databricks with Azure Blob
Databricks is a data warehouse solution to contain large amounts of data. They do have a JDBC (Java Database Connectivity) driver, which can insert data into the database. Although, this method can be slow and is only recommend for small data loads. For larger data, it is recommended to load a CSV (or other flat file) into object storage and then execute a COPY INTO command to insert the data into the desired table. The article reviews how load data into Azure Blob, execute a COPY INTO command within Databricks to ingest the data that is within Azure Blob, and finally remove the file from Azure Blob (if required). This method will only process files once in the Azure Blob container. Even if you try to execute them a second time, Databricks retains that the file has already be loaded, and will not re-ingest the file.
Figure 1. General process overview. Branch 1 has a message shape to add test data to the integration.
This article is based on the following two pieces of documentation and adapted for Boomi.
Databricks: Bulk load data into a table with COPY INTO Azure: Bulk load data into a table with COPY INTO
WARNING: The Databricks JDBC driver does not work on the Boomi Public Cloud Atom. It must be used on a private runtime.
Setting Up Azure Blob - Branch 1
I will be walking through each branch of the process above (Figure 1). First, you will need to create an Azure Blob with specific configurations so that it can be used with Databricks. Databricks requires Azure Blob to use the abfss transfer protocol.
Azure Documentation: Create a storage account to use with Azure Data Lake Storage Gen2
Step 1) First, create a storage account. Under the Basic tab, Performance should be Premium and Premium account type set to Block blobs. Under the Advanced tab, Enable Hierarchical namespace should be selected.
Figure 2. Basic tab during Storage Account creation.
Figure 3. Advanced tab during Storage Account creation.
Step 2) Once the storage account is created, go into it and click Containers. Create + Container to create a new one.
Figure 4. Creating a Storage Container
Step 3) Click on Access keys to get the storage account name and primary key. These both will be used in a following next step.
Figure 5. Retrieving Access Key.
Step 4) Get a Shared Access Key (used later to setup the Databricks COPY INTO command). Go to Shared access signature. The extend the end data. The token created will only last for a set amount of time and will need to be updated once expired. Next Generate SAS and connection string. Finally, copy the SAS token. Another option is to use the script within the following article to dynamical create the Shared Access Key during each execution. How to Create a Azure Blob Shared Access Signature (SAS) Token
Figure 6. Acquire SAS token
Step 5) Next setup the Microsoft Azure Blob Storage connector within Boomi. Add the storage account and access key from step 3. The endpoint is core.windows.net. The Authentication Type is Account Access Key.
Figure 7. Microsoft Azure Blob Storage Connection
Step 6) Create a Microsoft Azure Blob Storage Operation CREATE. Add the container name that was created in step 2.
Figure 8. Microsoft Azure Blob Storage Operation
Setting Up Databricks Connectivity - Branch 2
Step 7) Create a table for data to be loaded into. I performed the load by a sample file into the Databricks by clicking Ingest data and it created the table for me with columns that matched the columns within the CSV file that is being loaded into Azure Blob. It’s important that the file being loaded in Azure Blob has matching headers to the columns within the table that you are loading data into. Within the default database, I created a table called databricks_csv to load data into.
Figure 9. Table created within the default database.
Step 8) Next create an access token. Click on Settings -> User Settings -> Generate new token under the Access token tab.
Figure 10. Create Access token.
Step 9) Next to the cluster’s JDBC connection information. Go to Compute -> click on your cluster -> Under Configuration -> Click on Advanced Options -> Click on JDBC/ODBC. Databricks does add some extra information to the JDBC URL. So, remove UID=token;PWD=/<personal-access-token>
from the end. We will use the token information within the user name/password section within Boomi.
Figure 11. Obtaining JDBC Connection details.
Step 10) Download the Databricks JDBC Driver. The current driver is called DatabricksJDBC42.jar. Spark is the legacy driver that is referenced in some places. To download the driver, go to Databricks JDBC Driver and click download.
Step 11) Next set up the connection within Boomi. First, add to driver to Account Libraries. Go to Settings -> Account Information and Setup -> Account Libraries. Add the Databricks jar file. Next create a custom library component. Set the custom library to connector, add DatabricksJDBC42.jar, and deploy to your lowest environment for testing.
Figure 12. Custom Library setup.
Step 12) Set up the Database connector. This example is using the Database connector, since I decided to use the Program Command shape. This would also work with a Database v2 Connector executing the SQL statement below.
Driver Type: Custom
Class Name: com.databricks.client.jdbc.Driver
User Name: token
Password: <Access Token from Step 7>
Connection URL: <URL from Step 8, remember to remove everything after AuthMech>
Figure 13. Databricks Database Connection.
Step 13) Setup a Program Command shape to execute the COPY INTO command. Review the Databricks documentation below for other possible options. The SQL statement below will take all files that have not been synced previously within the Azure Blob container and insert them into the desired table.
In Step 4, there is a reference to a script that will generate an Azure SAS token. The script uses an access key, which does not expire, unlike the SAS token. If you would like to use the script to generate a token, replace
Databricks COPY INTO documentation
Update the 5 parameters below: database, table, container-name, storage-account-name, and sas-token. Also, the FILEFORMAT is based on a CSV, but there are other options that are in the documentation above.
COPY INTO <database>.<table>
FROM 'abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/'
WITH (
CREDENTIAL (
AZURE_SAS_TOKEN = '<sas-token>'
)
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
'header' = 'true',
'inferSchema' = 'true'
)
An example based on the previous screenshots with an invalid token
COPY INTO default.databricks_csv
FROM 'abfss://abdatabricks@abdatabricks03.dfs.core.windows.net/'
WITH (
CREDENTIAL (
AZURE_SAS_TOKEN = '?sv=2020-08-04&ss=b&srt=sco&sp=rwdlacx&se=2023-05-19T10:47:44Z&st=2022-05-19T02:47:44Z&spr=https,http&sig=notavalidtoken'
)
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
'header' = 'true',
'inferSchema' = 'true'
)
Figure 14. Program Command Shape configuration
Setup Azure Blob Delete Operation - Branch 3 (optional)
Step 14) Databricks will not ingest data a second time. Once the data in Azure Blob has been consumed, it is no longer needed, unless required. The last action is to setup a delete operation. Within branch 1, I have the file name set to a Dynamic Process Property (DPP). Finally, I assume 1 file is being processed at a time, and I use the DPP to know the name of the file to delete.
Article originally posted at Boomi Community.