Jamie Does Data Things

Synapse Link Ramblings

After the announcement that the Export to Data Lake add-in would be deprecated for Microsoft Dynamics 365, the focus turned to Synapse Link as the central approach for getting at that valuable system-held data. D365 reporting, while extensive, is still somewhat limited by the functionality built into the system. By exposing the data externally, the data can be modelled and interrogated much more dynamically to allow for significantly improved insights and, in my opinion, a much more cohesive approach to information, reporting and data literacy. So, what are the differences? Well, on the face of it, Synapse Link is a bit simpler to set up, the data seem nicer (to a small degree), the data are structured a bit differently, you can export entities (insert pained expression) and perhaps most usefully, you can export both CRM and ERP data in the same portal. Lots of big things!

Getting Set Up

On the data side, you simply need to ensure that the account with access to Power Platform and the D365 environment is assigned some roles in Azure:

Synapse Administrator on the Synapse Workspace (must be configured inside Synapse)
Reader on the Resource Group
Owner on Synapse Analytics
Owner on the Storage Account
Storage Blob Data Contributor on the Storage Account

Additionally, a Spark Pool should be configured within Synapse (more on this later). At this point, simply log into the ‘make portal’ in Power Platform (make.powerapps.com) and configure Synapse Link. Ensure the environment is set as the environment that Is paired with the D365 F&O instance and (for ease), choose to pin Azure Synapse Link to the menu bar on the left. Configuring a new link is super-straightforward. Just click new link and select the appropriate resources, being sure to tick the box that enables you to select a Spark Pool. After this simply select a table to ensure the configuration is working as expected (I’ll typically choose custtable at this point). And that’s it – usually after about 20 minutes you’ll be able to log back into Synapse Analytics and see your pretty new dataverse-based Lake Database, freshly stocked with the (mostly) raw data from custtable in D365.

Photo by Johnyvino on Unsplash

Following the Process

One of the most useful things to understand as a data engineer utilising Synapse Link is the flow of information. When you click the ‘Save’ button on a new table export, Synapse Link follows a very predictable track – but it can be incredibly helpful when debugging issues (actually, that line is an understatement – it’s incredible for pinpointing the failure).

The steps, in basic terms, are:

Create timestamp directory in storage account container (the container will be named after the database, ie, dataverse_etc….)
Push .csv files into timestamp directory under the appropriate table directory
Create follow up shortly thereafter with another timestamp directory for metadata (optionset, etc)
Once previous steps are complete, fire Spark job to populate deltalake folder
Spark job creates Parquet files in deltalake table directory
Spark job complete
And short period of time passes, inexplicably, with no database visible
Synapse shows the table in the Lake Database

When you’re exporting that first table, it’s useful to watch this process through as you can see the data flow through and get a good sense of how it operates.

Photo by Cookie the Pom on Unsplash

The ‘Gotchas’

vCores

This has, to this point, been the biggest gotcha I’ve experienced with Synapse Link. It’s important to note that the Microsoft documentation suggests a Spark Pool of 5-10 small nodes. While this seems like a perfectly reasonable ask on the surface of it, it becomes much less acceptable when the concept of vCore quotas is brought to the table. The ‘size’ of a node in a Spark Pool is governed (mostly) by the number of vCores it has. A ‘small’ node size, for example, has 4 vCores. By setting up a Spark Pool with 5-10 small nodes, you’re effectively asking for anywhere between 20 and 40 vCores (4 * 5 up to 4 * 10). In many cases, this will be fine. In many cases, not so much. Each subscription in Azure is limited in a number of ways, and one of these ways is a quota on vCores. Some subscription types (PAYG being one) are limited to 12 vCores by default. Therefore, if you spin up the Spark Pool according to Microsoft’s own documentation, you’ll be requesting a minimum of 20 vCores. This won’t fail, it won’t throw an error, it will simply set up as normal. Then, when a job is triggered, the job will await Spark to confirm that the resources are ready, and they quite simply never will be… since 12 vCores is the limit. The behaviour for this issue presents as a long list of Spark jobs that will simply queue for around 30 minutes and then subsequently time out and be ‘Cancelled’. You’ll see a long list of timestamped folders in the data lake where the .csv files are being written but never deleted as a result of the successful Spark job. Very frustrating! I have an existing pull request oustanding with Microsoft to help alert people to this - but as yet there’s been nothing merged into the official documentation.

Parent Tables

This is an ongoing issue where tables (most famously dirpartytable) that have inherited columns from child tables (such as dirorganisation in the example here). In Synapse Link, the functionality no longer brings back all the child tables pre-joined. It brings back only the root table – a series of additional tables and some engineering is required to develop the table back to its historic state. To further complicate matters, some child tables are not supported for row versioning, and therefore not capable of being exported. So data that used to be available with the parent table is actually not available at all. You might then choose to create an entity and then enable row versioning and export that – but this also must adhere to a fairly extensive list of requirements in order to be eligible for row versioning and subsequent export. In the majority of cases, around 90%, this will not be an issue – but for some, it could be a showstopper. Update: In the explicit example of ecoresreleasedengineeringproductversionattributeinstancevalue, we’ve repeatedly seen an error stating that ‘Profile Is In Error State’ when trying to export. It appears, though I’m uncertain, that the name of the table is too long. On creation of an entity to mimic the table as a workaround, we experienced the same fault, only to find out that the exact same approach with a much shorter entity name was successful. Worth noting as we progress.

Labels and Remodelling

There’s been a fair degree of flip-flopping on this. Some columns that were historically enumerated in the Export to Data Lake functionality are now labels – which render some historic joins pointless. Similarly, but oppositely, historic labels that may have been applied during the engineering process can now be collected from the optionset and globaloptionset tables in a way that will feel very familiar for historic users of Synapse Link for Dataverse. In good news, however, for users who have implemented logical views over the top of the Export to Data Lake’s csv file structure, you should be able to access the same tables (different though they may be) using the OPENROWSET functionality in Synapse Link. Notably, this means, that in any existing SQL databases in Synapse, you simply need to create a new EXTERNAL DATA SOURCE pointing at the deltalake folder in the data lake and then do a simple select statement over the top:

SELECT
*
FROM
OPENROWSET(
BULK ‘/custtable_partitioned/’,
DATA_SOURCE = ‘deltalake’,
FORMAT = ‘DELTA’
) dat

This will read the table with nearly no difficulties at all.

Next Steps

There have been a whole swathe of other, lesser issues that I’ve encountered in the short, but intensive, run of adding Synapse Link to a series of accounts. As time passes, Microsoft will continue to improve the experience - but in the short term, get in touch if you have challenges. I’d love to hear from other perspectives.

Lessons Learned in Synapse Link