All Articles

Inserting Data into Databricks with Azure Blob

Databricks is a data warehouse solution to contain large amounts of data. They do have a JDBC (Java Database Connectivity) driver, which can insert data into the database. Although, this method can be slow and is only recommend for small data loads. For larger data, it is recommended to load a CSV (or other flat file) into object storage and then execute a COPY INTO command to insert the data into the desired table. The article reviews how load data into Azure Blob, execute a COPY INTO command within Databricks to ingest the data that is within Azure Blob, and finally remove the file from Azure Blob (if required). This method will only process files once in the Azure Blob container. Even if you try to execute them a second time, Databricks retains that the file has already be loaded, and will not re-ingest the file.

General Process Overview

Figure 1. General process overview. Branch 1 has a message shape to add test data to the integration.

This article is based on the following two pieces of documentation and adapted for Boomi.

Databricks: Bulk load data into a table with COPY INTO Azure: Bulk load data into a table with COPY INTO

WARNING: The Databricks JDBC driver does not work on the Boomi Public Cloud Atom. It must be used on a private runtime.

Setting Up Azure Blob - Branch 1

I will be walking through each branch of the process above (Figure 1). First, you will need to create an Azure Blob with specific configurations so that it can be used with Databricks. Databricks requires Azure Blob to use the abfss transfer protocol.

Azure Documentation: Create a storage account to use with Azure Data Lake Storage Gen2

Step 1) First, create a storage account. Under the Basic tab, Performance should be Premium and Premium account type set to Block blobs. Under the Advanced tab, Enable Hierarchical namespace should be selected.

Azure Storage Account

Figure 2. Basic tab during Storage Account creation.

Azure Storage Account Data Lake

Figure 3. Advanced tab during Storage Account creation.

Step 2) Once the storage account is created, go into it and click Containers. Create + Container to create a new one.

Azure Storage Container

Figure 4. Creating a Storage Container

Step 3) Click on Access keys to get the storage account name and primary key. These both will be used in a following next step.

Azure Access Key

Figure 5. Retrieving Access Key.

Step 4) Get a Shared Access Key (used later to setup the Databricks COPY INTO command). Go to Shared access signature. The extend the end data. The token created will only last for a set amount of time and will need to be updated once expired. Next Generate SAS and connection string. Finally, copy the SAS token. Another option is to use the script within the following article to dynamical create the Shared Access Key during each execution. How to Create a Azure Blob Shared Access Signature (SAS) Token

Azure SAS Token

Figure 6. Acquire SAS token

Step 5) Next setup the Microsoft Azure Blob Storage connector within Boomi. Add the storage account and access key from step 3. The endpoint is core.windows.net. The Authentication Type is Account Access Key.

Azure Blob Connection

Figure 7. Microsoft Azure Blob Storage Connection

Step 6) Create a Microsoft Azure Blob Storage Operation CREATE. Add the container name that was created in step 2.

Azure Blob Operation

Figure 8. Microsoft Azure Blob Storage Operation

Setting Up Databricks Connectivity - Branch 2

Step 7) Create a table for data to be loaded into. I performed the load by a sample file into the Databricks by clicking Ingest data and it created the table for me with columns that matched the columns within the CSV file that is being loaded into Azure Blob. It’s important that the file being loaded in Azure Blob has matching headers to the columns within the table that you are loading data into. Within the default database, I created a table called databricks_csv to load data into.

Databricks Default Datanbase

Figure 9. Table created within the default database.

Step 8) Next create an access token. Click on Settings -> User Settings -> Generate new token under the Access token tab.

Databricks Access Token

Figure 10. Create Access token.

Step 9) Next to the cluster’s JDBC connection information. Go to Compute -> click on your cluster -> Under Configuration -> Click on Advanced Options -> Click on JDBC/ODBC. Databricks does add some extra information to the JDBC URL. So, remove UID=token;PWD=/<personal-access-token> from the end. We will use the token information within the user name/password section within Boomi.

Databricks JDBC Connection Details

Figure 11. Obtaining JDBC Connection details.

Step 10) Download the Databricks JDBC Driver. The current driver is called DatabricksJDBC42.jar. Spark is the legacy driver that is referenced in some places. To download the driver, go to Databricks JDBC Driver and click download.

Step 11) Next set up the connection within Boomi. First, add to driver to Account Libraries. Go to Settings -> Account Information and Setup -> Account Libraries. Add the Databricks jar file. Next create a custom library component. Set the custom library to connector, add DatabricksJDBC42.jar, and deploy to your lowest environment for testing.

Custom Library Settings

Figure 12. Custom Library setup.

Step 12) Set up the Database connector. This example is using the Database connector, since I decided to use the Program Command shape. This would also work with a Database v2 Connector executing the SQL statement below.

Driver Type: Custom
Class Name: com.databricks.client.jdbc.Driver
User Name: token
Password: <Access Token from Step 7>
Connection URL: <URL from Step 8, remember to remove everything after AuthMech>

Databricks Database Connection

Figure 13. Databricks Database Connection.

Step 13) Setup a Program Command shape to execute the COPY INTO command. Review the Databricks documentation below for other possible options. The SQL statement below will take all files that have not been synced previously within the Azure Blob container and insert them into the desired table.

In Step 4, there is a reference to a script that will generate an Azure SAS token. The script uses an access key, which does not expire, unlike the SAS token. If you would like to use the script to generate a token, replace with ?. Then within the variable section of the program command shape add the dynamic document property that was generated from the script.

Databricks COPY INTO documentation

Update the 5 parameters below: database, table, container-name, storage-account-name, and sas-token. Also, the FILEFORMAT is based on a CSV, but there are other options that are in the documentation above.

COPY INTO <database>.<table>
FROM 'abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/'
WITH (
  CREDENTIAL (
    AZURE_SAS_TOKEN = '<sas-token>'
  )
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
  'header' = 'true',
  'inferSchema' = 'true'
)

An example based on the previous screenshots with an invalid token

COPY INTO default.databricks_csv
FROM 'abfss://abdatabricks@abdatabricks03.dfs.core.windows.net/'
WITH (
  CREDENTIAL (
    AZURE_SAS_TOKEN = '?sv=2020-08-04&ss=b&srt=sco&sp=rwdlacx&se=2023-05-19T10:47:44Z&st=2022-05-19T02:47:44Z&spr=https,http&sig=notavalidtoken'
  )
)
FILEFORMAT = CSV
FORMAT_OPTIONS (
  'header' = 'true',
  'inferSchema' = 'true'
)

Program Command Shape Configuration

Figure 14. Program Command Shape configuration

Setup Azure Blob Delete Operation - Branch 3 (optional)

Step 14) Databricks will not ingest data a second time. Once the data in Azure Blob has been consumed, it is no longer needed, unless required. The last action is to setup a delete operation. Within branch 1, I have the file name set to a Dynamic Process Property (DPP). Finally, I assume 1 file is being processed at a time, and I use the DPP to know the name of the file to delete.

Article originally posted at Boomi Community.

Published May 26, 2022

Developing a better world.© All rights reserved.