Reputation: 51
When using Azure Translator to translate Markdown files, the formatting of the result does not exactly match the source file. This causes bolded text to become unbolded, incorrect file paths for images and other assets, and, when it occurs within metadata section, causes file to be unparsable by Markdown metadata parsing libraries.
I tried excluding all metadata keys by adding them into the custom glossary and I tried wrapping all metadata in double quotes.
I expected that anything wrapped in quotes, as well as any newlines, tabs, and spaces would be preserved.
In one file:
---
title: "Document Translation"
excerpt: "Document Translation is a new feature in Azure Translator service which enables enterprises, translation agencies, and consumers who require volumes of complex documents to be translated into one or more languages preserving structure and format in the original document."
coverImage: "/assets/blog/dynamic-routing/cover.jpg"
date: "2024-05-23T23:53:00.000Z"
author:
name: "JJ Kasper"
picture: "/assets/blog/authors/jj.jpeg"
ogImage:
url: "/assets/blog/dynamic-routing/cover.jpg"
---
## Overview
Document Translation is a new feature in [Azure Translator service](https://azure.microsoft.com/en-us/services/cognitive-services/translator/) which enables enterprises, translation agencies, and consumers who require volumes of complex documents to be translated into one or more languages preserving structure and format in the original document. It asynchronously translates whole documents in a variety of file formats including **Text, HTML, Word, Excel, PowerPoint, Outlook, PDF, and Markdown** across any of the 111 languages and dialects supported by Translator service.
Standard translation offerings in the market accept only plain text or html, and limits count of characters in a request. Users translating large documents must parse the documents to extract text, split them into smaller sections and translate them separately. If sentences are split in an unnatural breakpoint it can lose the context resulting in suboptimal translations. Upon receipt of the translation results, the customer must merge the translated pieces into the translated document. This involves keeping track of which translated piece corresponds to the equivalent section in the original document. The problem gets complicated when customers want to translate complex documents having rich content.
Document Translation makes it easy for the customer to translate:
1. volumes of large documents,
2. documents in variety of file formats,
3. documents requiring preserving the original layout and format, and
4. documents into multiple target languages.
## User experience
User makes a request to the Document Translation service specifying location of source and target documents, and the list of target languages. The service returns an identifier enabling the user to track the status of the translation. Asynchronously, Document Translation pulls each document from the source location, recognizes the document format, applies right parsing technique to extract textual content in the document, translates the textual content into target languages. It then reconstructs the translated document preserving layout and format as present in the source documents, and stores translated document in a specified location. Document Translation updates the status of translation either at the job or document level.
Users can provide a custom model id built using custom translator portal, custom glossaries, or both as part of the request to translate documents. Document translation applies such customization retaining specific terminologies and providing domain specific translations in the translated documents.
import { app } from '@azure/functions';
import {
BlobServiceClient,
StorageSharedKeyCredential,
generateBlobSASQueryParameters,
BlobSASPermissions
} from '@azure/storage-blob';
import axios from 'axios';
import delay from 'delay';
app.http('TranslateContent', {
methods: ['GET'],
authLevel: 'anonymous',
handler: async (_, context) => {
// Generate new SAS token for `output-files` container
const outputSas = generateContainerSas(
process.env.OUTPUT_BLOB_STORAGE_ACCOUNT_NAME,
process.env.OUTPUT_BLOB_STORAGE_ACCOUNT_KEY,
process.env.OUTPUT_BLOB_STORAGE_CONTAINER_NAME,
'wl'
);
// Create container client for `input-files` container
const inputSharedKeyCredential = new StorageSharedKeyCredential(
process.env.INPUT_BLOB_STORAGE_ACCOUNT_NAME,
process.env.INPUT_BLOB_STORAGE_ACCOUNT_KEY
);
const inputBlobServiceClient = new BlobServiceClient(
`https://${process.env.INPUT_BLOB_STORAGE_ACCOUNT_NAME}.blob.core.windows.net`,
inputSharedKeyCredential
);
const inputContainerClient = inputBlobServiceClient.getContainerClient(
process.env.INPUT_BLOB_STORAGE_CONTAINER_NAME
);
// Initial config for Azure AI Translator resource
const translatorEndpoint = `https://${process.env.TRANSLATOR_RESOURCE_NAME}.cognitiveservices.azure.com/translator/text/batch/v1.1`;
const translatorRoute = `/batches`;
const translatorKey = process.env.TRANSLATOR_RESOURCE_KEY;
let processedBlobCounter = 0;
let activeOperations = [];
const glossaryContainerClient =
inputBlobServiceClient.getContainerClient('glossaries');
const glossaryUrl = generateBlobSas(
glossaryContainerClient,
inputSharedKeyCredential,
'en-es.csv'
);
// Check each blob in the `input-files` container, but only process English localized content HTML files with content
for await (const blob of inputContainerClient.listBlobsFlat()) {
// Generate blob-scoped SAS token
const inputBlobSasUrl = generateBlobSas(
inputContainerClient,
inputSharedKeyCredential,
blob.name
);
const targetFileName = blob.name.replace('en-us', 'es-es');
// Azure AI Translator blob-specific config
const data = JSON.stringify({
inputs: [
{
source: {
sourceUrl: inputBlobSasUrl,
},
targets: [
{
// Even though we are specifying the blob name, we still use the container-scoped SAS token, since the new file hasn't been created yet, so it's impossible to have a blob-specific SAS token for it.
targetUrl: `https://${process.env.OUTPUT_BLOB_STORAGE_ACCOUNT_NAME}.blob.core.windows.net/${process.env.OUTPUT_BLOB_STORAGE_CONTAINER_NAME}/${targetFileName}?${outputSas}`,
// Refer to https://learn.microsoft.com/en-us/azure/ai-services/translator/language-support for which code to use for each language/dialect
language: 'es', // Spanish
glossaries: [
{
glossaryUrl,
format: 'csv',
},
],
},
],
storageType: 'File', // Default: 'Folder'
},
],
});
// Axios request config
const config = {
method: 'post',
url: translatorEndpoint + translatorRoute,
headers: {
'Ocp-Apim-Subscription-Key': translatorKey,
'Content-Type': 'application/json',
},
data: data,
};
// Submit blob file to Azure AI Translator service for translation
await axios(config)
.then(async (response) => {
activeOperations.push({
blobName: blob.name,
operationUrl: response.headers['operation-location'],
});
const result = {
statusText: response.statusText,
statusCode: response.status,
headers: response.headers,
};
context.log(JSON.stringify(result));
})
.catch(context.error);
}
do {
context.log(`Waiting 30 seconds before checking translation status...`);
await delay(30000); // Wait 30 seconds before checking the status of each translation
context.log(
`Checking translation status of ${activeOperations.length} operations...`
);
for (const operationObj of activeOperations) {
// await delay(1000); // Wait 1 second between each status request
const operationUrl = operationObj.operationUrl;
context.log(`Checking status of '${operationObj.blobName}'...`);
const statusResponse = await checkTranslationStatus(
operationUrl,
translatorKey
);
const operationStatus = statusResponse.data.status;
context.log(`operationStatus: ${operationStatus}`);
if (operationStatus === 'Succeeded') {
// If the translation is complete, increment the number of processed blobs and remove the operation object from the 'activeOperations' array
context.log(`'${operationObj.blobName}' translated!`);
processedBlobCounter++;
activeOperations = activeOperations.filter((operation) => {
operation.operationUrl === operationUrl;
context.log(`Removed ${JSON.stringify(operation)}`);
});
context.log(
`${processedBlobCounter} files translated so far, ${activeOperations.length} remaining.`
);
} else if (
operationStatus === 'Failed' ||
operationStatus === 'ValidationFailed'
) {
// If the translation failed, remove the operation object from the 'activeOperations' array, but don't increment the number of processed blobs
context.log(
`'${operationObj.blobName}' couldn't be translated. Error message: ${statusResponse.data.error.message}.`
);
activeOperations = activeOperations.filter((operation) => {
operation.operationUrl === operationUrl;
context.log(`Removed ${JSON.stringify(operation)}`);
});
// Also, if the error wasn't due to simply not having any content to translate, throw an error
if (
statusResponse.data.error.message !==
'The document does not have any translatable text.'
) {
throw new Error(statusResponse.data.error);
}
} else {
// Otherwise, just add the current status to the operation object
operationObj.status = operationStatus;
}
}
} while (activeOperations.length > 0);
return {
status: 200,
body: `${processedBlobCounter} files translated!`,
headers: {
'Content-Type': 'text/plain',
},
};
}
});
async function checkTranslationStatus(operationUrl, translatorKey) {
const statusConfig = {
method: 'get',
url: operationUrl,
headers: {
'Ocp-Apim-Subscription-Key': translatorKey,
},
};
return await axios(statusConfig);
}
function generateBlobSas(containerClient, sharedKeyCredential, blobName) {
const blobClient = containerClient.getBlobClient(blobName);
const startDate = new Date();
const expiryDate = new Date(startDate);
expiryDate.setMinutes(startDate.getMinutes() + 100);
const permissions = BlobSASPermissions.parse('r'); // Read and list permissions
const sasQueryParameters = generateBlobSASQueryParameters(
{
containerName: containerClient.containerName,
blobName,
permissions,
startsOn: startDate,
expiresOn: expiryDate,
},
sharedKeyCredential
);
const sasUrl = blobClient.url + '?' + sasQueryParameters.toString();
return sasUrl;
}
function generateContainerSas(
accountName,
accountKey,
containerName,
permissions
) {
const sharedKeyCredential = new StorageSharedKeyCredential(
accountName,
accountKey
);
const sasOptions = {
containerName,
permissions,
startsOn: new Date(),
expiresOn: new Date(new Date().valueOf() + 3600 * 1000), // Expires in 1 hour
};
return generateBlobSASQueryParameters(
sasOptions,
sharedKeyCredential
).toString();
}
Upvotes: 0
Views: 41