OdeToCode IC Logo

PDF Generation in Azure Functions V2

Wednesday, February 14, 2018

PDF generation.

Yawn.

But, every enterprise application has an “export to PDF” feature.

There are obstacles to overcome when generating PDFs from Azure Web Apps and Functions. The first obstacle is the sandbox Azure uses to execute code. You can read about the sandbox in the “Azure Web App sandbox” documentation. This article explicitly calls out PDF generation as a potential problem. The sandbox prevents an app from using most of the kernel’s graphics API, which many PDF generators use either directly or indirectly.

The sandbox document also lists a few PDF generators known to work in the sandbox. I’m sure the list is not exhaustive, (a quick web search will also find solutions using Node), but one library listed indirectly is wkhtmltopdf (open source, LGPLv3). The wkhtmltopdf library is interesting because the library is a cross platform library. A solution built with .NET Core and wkhtmltopdf should work on Windows, Linux, or Mac.

The Azure Functions Project

For this experiment I used the Azure Functions 2.0 runtime, which is still in beta and has a few shortcomings. However, the ability to use precompiled projects and build on .NET Core are both appealing features for v2.

To work with the wkhtmltopdf library from .NET Core I used the DinkToPdf wrapper. This package hides all the P/Invoke messiness, and has friendly options to control margins, headers, page size, etc. All an app needs to do is feed a string of HTML to a Dink converter, and the converter will return a byte array of PDF bits.

Here’s an HTTP triggered function that takes a URL to convert and returns the bytes as application/pdf.

using DinkToPdf;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.Azure.WebJobs.Host;
using System;
using System.Net.Http;
using System.Threading.Tasks;
using IPdfConverter = DinkToPdf.Contracts.IConverter;

namespace PdfConverterYawnSigh
{
    public static class HtmlToPdf
    {
        [FunctionName("HtmlToPdf")]
        public static async Task<IActionResult> Run(
            [HttpTrigger(AuthorizationLevel.Function, "post")]
            ConvertPdfRequest request, TraceWriter log)
        {
            log.Info($"Converting {request.Url} to PDF");

            var html = await FetchHtml(request.Url);
            var pdfBytes = BuildPdf(html);
            var response = BuildResponse(pdfBytes);

            return response;
        }

        private static FileContentResult BuildResponse(byte[] pdfBytes)
        {
            return new FileContentResult(pdfBytes, "application/pdf");
        }

        private static byte[] BuildPdf(string html)
        {
            return pdfConverter.Convert(new HtmlToPdfDocument()
            {
                Objects =
                {
                    new ObjectSettings
                    {
                        HtmlContent = html
                    }
                }
            });
        }

        private static async Task<string> FetchHtml(string url)
        {
            var response = await httpClient.GetAsync(url);
            if (!response.IsSuccessStatusCode)
            {
                throw new InvalidOperationException($"FetchHtml failed {response.StatusCode} : {response.ReasonPhrase}");        
            }
            return await response.Content.ReadAsStringAsync();
        }

        static HttpClient httpClient = new HttpClient();
        static IPdfConverter pdfConverter = new SynchronizedConverter(new PdfTools());
    }
}

What to Worry About

Notice the converter class has the name SynchronizedConverter. The word synchronized is a clue that the converter is single threaded. Although the library can buffer conversion requests until a thread is free to process those requests, it would be safer to trigger the function with a message queue to avoid losing conversion requests in case of a restart.

You should also know that the function will not execute successfully in a consumption plan. You’ll need to use a Basic or higher app service plan in Azure. 

To deploy the application you’ll need to include the wkhtmltopdf native binaries. You can build the binary you need from source, or download the binaries from various places, including the DinkToPdf repository. Function apps currently only support .NET Core on Windows in a 32-bit process, so use the 32-bit dll. I added the binary to my function app project and set the build action “Copy to Output Directory”. As we are about to see, the 32 bit address space is not a problem.

Performance Testing

To see how the function performs, I created a single instance of the lowest standard app service plan (S1 – single CPU).

For requests pointing to 18KB of HTML, the function produces a PDF in under 3 seconds regularly, although 20 seconds isn’t abnormal either. Even the simplest functions on the v2 runtime have a high standard deviation for the average response time. Hopefully the base performance characteristics improve when v2 is out of beta.

Using a single threaded component like wkhtmltopdf in server-side code is generally a situation to avoid. To see what happens with concurrent users I ran some load tests for 5 minutes starting with 1 user. Every 30 seconds the test added another user up to a maximum of 10 concurrent users. The function consistently works well up to 5 concurrent requests, at which point the average response time is ~30 seconds. By the time the test reaches 7 concurrent users the function would consistently generate HTTP 502 errors for a subset of requests. Here are the results from one test run. The Y axis labels are for the average response time (in seconds).

Load Testing PDF Generation in Azure Functions

Looking at metrics for the app service plan in Azure, you can see the CPU pegged at 100% for most of the test time. With no headroom left for other apps, you’d want to give this function a dedicated plan. Azure App Service Plan Metrics for PDF Load Test

Summary

I wouldn’t consider this solution viable for a system whose sole purpose is generating large number of PDF files all day, but for small workloads the function is workable. Much would depend on the amount of HTML in the conversion. In my experience the real headaches with PDFs come down to formatting. HTML to PDF conversions always look like they’ve been constructed by a drunken type-setter using a misaligned printing press, unless you control the HTML and craft the markup specifically for conversion.