AWS CDK Serverless Text-to-Speech pt II

In the first part of this article check it out right here, we saw an overview of the solution. Now, we will delve into the code and explain how to build the solution from scratch. I will highlight the key aspects of the codebase that I have created for exploring this tool and constructing the serverless solution.

If you haven’t read the first part yet, I recommend doing so before proceeding. However, if you are eager to get your hands dirty and build this solution step by step, you can continue reading.

Alternatively, if you simply want to deploy and test how this solution works, you can follow the instructions provided in the readme of the GitHub repository

Build from scratch

To create this application you’ll need to have Node.js (>= 16.3.0), AWS CDK, and AWS CLI installed. You’ll also need an AWS account. The AWS CLI should be configured with your AWS credentials.

If you try to run CDK with a Node version below the one indicated above, you will see the warning message: “v16.2.0 is unsupported and has known compatibility issues with this software.” Update your Node version, and it should be fine.

Here are the relevant links to the official AWS documentation that will help you with the pre-requisites:

AWS CDK: The AWS CDK Developer Guide provides detailed information on how to install and configure AWS CDK. The AWS CDK (Cloud Development Kit) can be installed via npm (Node Package Manager) using the command: npm install -g aws-cdk
AWS CLI: You can install the AWS CLI (Command Line Interface) following the instructions in the AWS CLI User Guide. You can install the AWS CLI using pip, which is a package manager for Python. The command to install the AWS CLI is: pip install awscli
Configuring the AWS CLI: Once the AWS CLI is installed, you will need to configure it with your AWS credentials. This includes your Access Key ID and Secret Access Key. Instructions for this can be found in the AWS CLI User Guide. The command to configure the AWS CLI is: aws configure

Step 1: Create a new CDK project

Create a new directory for the project:

mkdir text-to-speech-app
cd text-to-speech-app

Initialize a new CDK project:

cdk init --language=typescript

Step 2: Install AWS CDK packages

You will need several packages for this application. Install them as follows:

npm install @aws-cdk/aws-s3 @aws-cdk/aws-dynamodb @aws-cdk/aws-sqs @aws-cdk/aws-lambda @aws-cdk/aws-lambda-event-sources @aws-cdk/aws-iam --save

After that your package json has to look like this:

 "dependencies": {
    "@aws-cdk/aws-dynamodb": "^1.204.0",
    "@aws-cdk/aws-iam": "^1.204.0",
    "@aws-cdk/aws-lambda": "^1.204.0",
    "@aws-cdk/aws-lambda-event-sources": "^1.204.0",
    "@aws-cdk/aws-s3": "^1.204.0",
    "@aws-cdk/aws-s3-notifications": "^1.204.0",
    "@aws-cdk/aws-sqs": "^1.204.0",
    "aws-cdk-lib": "2.87.0",
    "constructs": "^10.0.0",
    "source-map-support": "^0.5.21"
  }

Step 3: Create S3 Buckets

Open the lib/text-to-speech-app-stack.ts file and start creating the S3 buckets:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';

import * as s3 from 'aws-cdk-lib/aws-s3';

export class TextToSpeechAppStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const textBucket = new s3.Bucket(this, 'TextBucket', {
      versioned: false,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
    });

    const audioBucket = new s3.Bucket(this, 'AudioBucket', {
      versioned: false,
      encryption: s3.BucketEncryption.S3_MANAGED,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
    });
  }
}

Step 4: Create a DynamoDB Table

Add the following lines to create the MetadataTable DynamoDB table:

import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

// inside the constructor bellow audioBucket Creation

const metadataTable = new dynamodb.Table(this, 'MetadataTable', {
      partitionKey: { name: 'uuid', type: dynamodb.AttributeType.STRING },
      sortKey: { name: 'submissionTime', type: dynamodb.AttributeType.STRING },
      billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
    });

Step 5: Create an SQS Queue

Next, create the ProcessingQueue SQS queue:

import * as sqs from 'aws-cdk-lib/aws-sqs';

// inside the constructor bellow Metadatatable

  const processingQueue = new sqs.Queue(this, 'ProcessingQueue', {
      visibilityTimeout: cdk.Duration.seconds(300), // 5 minutes
    });

Step 6: Create an S3 Event Notification with SQS as target

As we had our textBucket and processingQueue now we can create an S3 Event Notification to sends a message to our processing queue when a file is uploaded to the /upload path.

//
import * as s3n from 'aws-cdk-lib/aws-s3-notifications';

// inside the constructor bellow processingQueue    textBucket.addEventNotification(s3.EventType.OBJECT_CREATED, new s3n.SqsDestination(processingQueue), { prefix: 'upload/' });

We’ve created the main infrastructure to be used in our solution, and the highlight here is that we’ve written this in only 38 lines of TypeScript code.

You can execute the cdk deploy command to deploy the current state of our Stack before we add the function as constructs to be used in this stack.

The CDK shows to you the progress of deploy right on your terminal:

Image description

And after this process had finished you can go to your console to check the status of our infrastructure.

Image description

Now we can go forward and create the lambda function that will actually does the magic for us

Step 7: Create the TextToSpeech Construct

Our lambda function will be created as a new constructor in our CDK project, for this let’s create a new folder on the /lib called TextToSpeech and inside this folder lets creates two new files TextToSpeech.ts and TextToSpeech.lambda.ts, our project structure will be like this:

Image description

Now we can go to the TextToSpeech.ts file and create our lambda constructor, here we’ll add all the lambda function configurations:

import { Construct } from "constructs";

import { IBucket } from "aws-cdk-lib/aws-s3";
import { ITable } from "aws-cdk-lib/aws-dynamodb";
import { IQueue } from "aws-cdk-lib/aws-sqs";

export class TextToSpeech extends Construct {
    constructor(scope: Construct, id: string, props: {
        textBucket: IBucket,
        audioBucket: IBucket,
        metadataTable: ITable,
        processingQueue: IQueue

    }) {
        super(scope, id)

       }
}

This is our base constructor where we’ll define all the lambda resources, the highlight here is the ‘props’ where we’ve declared all the necessary resources the lambda will interact with.

Step 8: Now we can finally define our lambda funtion

import path = require("path");
import * as cdk from 'aws-cdk-lib';

import { NodejsFunction } from "aws-cdk-lib/aws-lambda-nodejs";
import * as iam from 'aws-cdk-lib/aws-iam';
import { IBucket } from "aws-cdk-lib/aws-s3";
import { ITable } from "aws-cdk-lib/aws-dynamodb";
import { SqsEventSource } from 'aws-cdk-lib/aws-lambda-event-sources';
import { IQueue } from "aws-cdk-lib/aws-sqs";

// Inside the constructor
const textToSpeechFunction = new NodejsFunction(this, "NodejsFunction", {
            entry: path.resolve(__dirname, "TextToSpeech.lambda.ts"),
            bundling: {
                nodeModules: ['aws-sdk'],
            },
            environment: {
                TEXT_BUCKET_NAME: props.textBucket.bucketName,
                AUDIO_BUCKET_NAME: props.audioBucket.bucketName,
                METADATA_TABLE_NAME: props.metadataTable.tableName
            },
            timeout: cdk.Duration.seconds(60) // 1 min
        })

The highlight here is that we use path to define where the lambda code is in the entry propertie and from now we just pass in the bundling propertie the aws-sdk depedencie to be applied at our lambda bundle.

The timeout on our tests was enoght to use this API that has an affordable response time.

Step 9: Grant Permissions to the Resources

Now we need to grant the necessary permissions to our resources.

props.textBucket.grantRead(textToSpeechFunction);
        props.metadataTable.grantReadWriteData(textToSpeechFunction);
        props.audioBucket.grantWrite(textToSpeechFunction);
        textToSpeechFunction.addToRolePolicy(
            new iam.PolicyStatement({
                actions: ['polly:SynthesizeSpeech'],
                resources: ['*'],
            })
        );

        textToSpeechFunction.addEventSource(
            new SqsEventSource(props.processingQueue)
        )
);

In this instance, we utilize the resources that we’ve declared in the constructor props to configure the necessary permissions. This piece of code simplifies the creation of roles and policies by using the grantRead and grantReadWriteData functions. Following this, we use the IAM PolicyStatement to add a custom policy to our Lambda function, specifically defining the ‘synthesizeSpeech’ action.

The full file can be founded here at github

Step 10: Create the Lambda Function Code

Now go to the TextToSpeech.lambda.ts file to define our lambda code as bellow:

import { S3, DynamoDB, Polly } from 'aws-sdk';
import { SQSEvent } from 'aws-lambda';
import { GenerateUUID } from './utils/UUIDGenerator';

const s3 = new S3();
const dynamodb = new DynamoDB.DocumentClient();
const polly = new Polly();

exports.handler = async (event: SQSEvent) => {
    const textBucketName = process.env.TEXT_BUCKET_NAME || '';
    const audioBucketName = process.env.AUDIO_BUCKET_NAME || '';
    const metadataTableName = process.env.METADATA_TABLE_NAME || '';

    for (const record of event.Records) {
        const body = JSON.parse(record.body);
        const textKey = body.Records[0].s3.object.key;

        // Read the text file from the text bucket
        const textObject = await s3.getObject({
            Bucket: textBucketName,
            Key: textKey
        }).promise();
        if (!textObject.Body) {
            throw new Error(`Failed to get text body from S3 object: ${textKey}`);
        }
        const text = textObject.Body.toString();

        const pollyResponse = await polly.synthesizeSpeech({
            OutputFormat: 'mp3',
            Text: text,
            VoiceId: 'Joanna',
        }).promise();

        if (pollyResponse.AudioStream) {
            const date = new Date();
            const year = date.getFullYear();
            const month = date.getMonth() + 1;
            const day = date.getDate();
            const timestamp = Date.now();

            const uuid = GenerateUUID();

            const audioKey = `synthesized/${uuid}/year=${year}/month=${month}/day=${day}/${timestamp}.mp3`;
            await s3.putObject({
                Bucket: audioBucketName,
                Key: audioKey,
                Body: pollyResponse.AudioStream
            }).promise();

            await dynamodb.put({
                TableName: metadataTableName,
                Item: {
                    'uuid': uuid,
                    'submissionTime': textObject.LastModified?.toISOString(),
                    'textKey': textKey,
                    'audioKey': audioKey,
                    'characters': pollyResponse.RequestCharacters,
                    'status': 'completed'
                }
            }).promise();

            console.log(`Process ${uuid} synthesized file ${textKey} with ${pollyResponse.RequestCharacters}, saved at: ${audioKey}`)
        }
    }

    return {
        statusCode: 200,
        body: JSON.stringify('Text to speech conversion completed successfully!'),
    };
};

Not any news right here it’s just our implementation to handle the sqs event grab the s3 objectKey and than using the Amazon Polly from aws-sdk to synthesize the text from file into audio. It saves the data into the metadata table when finishes

Step 11: Uses the TextToSpeech constructor

Now we just have to import our newly created constructor into our stack file text-to-speech-app-stack.ts

import { TextToSpeech } from './TextToSpeech/TextToSpeech';

// Bellow the addEventNotification 

    new TextToSpeech(this, 'TextToSpeech', {
      textBucket: textBucket,
      audioBucket: audioBucket,
      metadataTable: metadataTable,
      processingQueue: processingQueue
    });

After this setup we can just deploy our Stack and test if its everything working.

Step 12: Deploy the Application

Finally, build and deploy the application:

npm run build
cdk deploy

This will build your application and deploy it to your AWS account.

Testing our App

Once the deployment process is complete, please access your account and locate the text bucket. Place a text file into the bucket. After a short period of time, you will be able to view your audio file in the audio bucket.

To view detailed information about each processed file and the number of characters used to synthesize the texts you have sent, refer to the metadata table.

If you prefer, you can directly download the audio files from the S3 platform. Alternatively, you have the option to explore the journey and integrate an API with CDK. I highly recommend this option as it is simple to implement.

Final Thoughts and Further Exploration

After exploring this Text-to-speech solution, I am excited about the potential to create more solutions using CDK and TypeScript. TypeScript is particularly useful because it is a typed language, which aligns well with backend and frontend development and now to infrastructure development too. CDK also offers the opportunity to use the same language for infrastructure code. Additionally, I am interested in exploring CDK in other programming languages. In my opinion, CDK simplifies a lot of the previous challenges we faced with tools like terraform or cloudFormation, which the CDK utilizes behind the scenes.

If you have any thoughts or would like to discuss further, please let me know. I also welcome any feedback. Thank you for joining me on this journey, and stay tuned for potential new posts on my pages.

Build from scratch#

Testing our App#

Final Thoughts and Further Exploration#

Build from scratch

Testing our App

Final Thoughts and Further Exploration