Parsing HTML in Boomi
Occasionally HTML data within Boomi will need to be parsed. It could be data from an HTML email or data from a website. There is not an out-of-the-box way to do that within Boomi. Yet, one way to solve the problem is by using Jsoup to parse the HTML. Jsoup is a popular Java library that is used to parse HTML and will be used in this example. This article does assume some familiarity with writing/reading code. Jsoup’s syntax is very similar to JQuery and additional documentation is provided at the bottom of the article.
Figure 1. Parse HTML Process Overview.
First, download JSoup from maven by clicking on jar next to File. 1.15.3 is the current version at the time of writing this article. Add the jar file to your Boomi Account Library, create a Custom Library, and then deploy it to an environment. Documentation on adding jar files is found here.
In this example, we will use the following HTML email to be parsed. We will parse out the title, email header, and email body. Then use that data and map it to a flat file within Boomi.
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
<meta charset="UTF-8">
<title>Welcome to Boomi HTML</title>
</head>
<body>
<table align="center">
<tr>
<th>Email Header</th>
</tr>
<tr>
<td>Email Body</td>
</tr>
</table>
</body>
</html>
The data above will be put into a message shape for testing, which will be a stand-in for an Email Connector. Next, we will use a Set Property shape and set a Dynamic Document Property with the name DDP_CURRENT_DATA
to current data.
Figure 2. DDP_CURRENT_DATA within Data Process Shape.
Within a map, the source profile will be set to a flat file with a single element. This profile is used so that the data does not cause an exception. The destination profile will be a flat file profile with 3 elements: title, header, and body. In the middle, we will set up a custom function to parse our HTML email.
Figure 3. Map setup.
Create a custom map function. Within it add a Get Document Property and a Custom Script function. The document property will read DDP_CURRENT_DATA
.
Figure 4. A custom map function to parse the HTML.
The following script was used to parse the title, header, and body.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document
\\----------------------------
\\ INPUTS
\\ html
\\ OUTPUTS
\\ title
\\ header
\\ body
\\----------------------------
Document doc = Jsoup.parse(html);
title = doc.select("title").text()
header = doc.select("table")
.select("tr")
.first()
.select("th")
.text();
body = doc.select("table")
.select("tr")
.last()
.select("td")
.text()
Figure 5. Custom script used to parse HTML.
Once complete, connect everything and run the process. Below is an example output from the mapping.
Figure 6. Map output with parsed data.
References
- Jsoup: Javadocs Elements
- Jsoup: Use selector-syntax to find elements
- Jsoup: Use DOM methods to navigate a document
The article was originally posted at Boomi Community.