All Articles

Parsing HTML in Boomi

Occasionally HTML data within Boomi will need to be parsed. It could be data from an HTML email or data from a website. There is not an out-of-the-box way to do that within Boomi. Yet, one way to solve the problem is by using Jsoup to parse the HTML. Jsoup is a popular Java library that is used to parse HTML and will be used in this example. This article does assume some familiarity with writing/reading code. Jsoup’s syntax is very similar to JQuery and additional documentation is provided at the bottom of the article.

overview parse html process

Figure 1. Parse HTML Process Overview.

First, download JSoup from maven by clicking on jar next to File. 1.15.3 is the current version at the time of writing this article. Add the jar file to your Boomi Account Library, create a Custom Library, and then deploy it to an environment. Documentation on adding jar files is found here.

In this example, we will use the following HTML email to be parsed. We will parse out the title, email header, and email body. Then use that data and map it to a flat file within Boomi.

<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
    <meta charset="UTF-8">
	<title>Welcome to Boomi HTML</title>
</head>
<body>
    <table align="center">
        <tr>
            <th>Email Header</th>
        </tr>
        <tr>
            <td>Email Body</td>
        </tr>
    </table>
</body>
</html>

The data above will be put into a message shape for testing, which will be a stand-in for an Email Connector. Next, we will use a Set Property shape and set a Dynamic Document Property with the name DDP_CURRENT_DATA to current data.

email set property shape

Figure 2. DDP_CURRENT_DATA within Data Process Shape.

Within a map, the source profile will be set to a flat file with a single element. This profile is used so that the data does not cause an exception. The destination profile will be a flat file profile with 3 elements: title, header, and body. In the middle, we will set up a custom function to parse our HTML email.

parse html map

Figure 3. Map setup.

Create a custom map function. Within it add a Get Document Property and a Custom Script function. The document property will read DDP_CURRENT_DATA.

parse email custom function

Figure 4. A custom map function to parse the HTML.

The following script was used to parse the title, header, and body.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document

\\----------------------------
\\ INPUTS
\\   html
\\ OUTPUTS
\\   title
\\   header
\\   body
\\----------------------------

Document doc = Jsoup.parse(html);

title = doc.select("title").text()
header = doc.select("table")
        .select("tr")
        .first()
        .select("th")
        .text();
body = doc.select("table")
        .select("tr")
        .last()
        .select("td")
        .text()

parse email custom script

Figure 5. Custom script used to parse HTML.

Once complete, connect everything and run the process. Below is an example output from the mapping.

parse email output

Figure 6. Map output with parsed data.

References

The article was originally posted at Boomi Community.

Published Dec 24, 2022

Developing a better world.© All rights reserved.