Getting hands-on with Amazon Textract – AWS Application Services for AI/ML – MLS-C01 Study Guide

Getting hands-on with Amazon Textract

In this section, you will use the Amazon Textract API to read an image file from our S3 bucket and print the FORM details on Cloudwatch. The same can be stored in S3 in your desired format for further use or can be stored in DynamoDB as a key-value pair. Let’s get started:

  1. First, create an IAM role called textract-use-case-role with the following policies. This will allow the Lambda function to execute so that it can read from S3, use Amazon Textract, and print the output in CloudWatch logs:
    • CloudWatchFullAccess
    • AmazonTextractFullAccess
    • AmazonS3ReadOnlyAccess
  2. Let’s create an S3 bucket called textract-document-analysis and upload the receipt.png image file. This will be used to contain the FORM details that will be extracted. The image file is available at https://github.com/PacktPublishing/AWS-Certified-Machine-Learning-Specialty-MLS-C01-Certification-Guide-Second-Edition/tree/main/Chapter08/Amazon%20Textract%20Demo/input_doc:

Figure 8.20 – An S3 bucket with an image (.png) file uploaded to the input folder

  • The next step is to create a Lambda function called read-scanned-doc, as shown in Figure 8.21, with an existing execution role called textract-use-case-role:

Figure 8.21 – The AWS Lambda Create function dialog

  • Once the function has been created, paste the following code and deploy it. Scroll down to Basic Settings to change the default timeout to a higher value (40 seconds) to prevent timeout errors. You have used the analyze_document API from Amazon Textract to get the Table and Form details via the FeatureTypes parameter of the API:

import boto3

import time

from trp import Document

textract_client=boto3.client(‘textract’)

def lambda_handler(event, context):

    print(“- – – Amazon Textract Demo – – -“)

    # read the bucket name from the event

    name_of_the_bucket=event[‘Records’][0][‘s3’][‘bucket’] [‘name’]

    # read the object from the event

    name_of_the_doc=event[‘Records’][0][‘s3’][‘object’][‘key’]

    print(name_of_the_bucket)

    print(name_of_the_doc)

    response =

textract_client.analyze_document(Document={‘S3Object’:

{‘Bucket’: name_of_the_bucket,’Name’:

name_of_the_doc}},FeatureTypes=[“TABLES”,”FORMS”])

    print(str(response))

    doc=Document(response)

    for page in doc.pages:

        # Print tables

        for table in page.tables:

            for r, row in enumerate(table.rows):

                for c, cell in enumerate(row.cells):

                    print(“Table[{}][{}] =

{}”.format(r, c, cell.text))

    for page in doc.pages:

        # Print fields

        print(“Fields:”)

        for field in page.form.fields:

            print(“Key: {}, Value:

{}”.format(field.key, field.value))

Unlike the previous examples, you will create a test configuration to run our code.

  • Click on the dropdown left of the Test button.
  • Select Configure test events and choose Create new test event.
  • Select Amazon S3 Put from the Event template dropdown.
  • In the JSON body, change the highlighted values as per our bucket name and key, as shown here:

Figure 8.22 – The Event template for testing the Lambda function

  • In the Event name field, name the test configuration TextractDemo.
  • Click Save.
  • Select your test configuration (TextractDemo) and click on Test:

Figure 8.23 – Selecting the test configuration before running your test

  1. This will trigger the Lambda function. You can monitor the logs from CloudWatch > CloudWatch Logs > Log groups > /aws/lambda/ read-scanned-doc.
  2. Click on the streams and select the latest one. It will look as follows; the key-value pairs can be seen in Figure 8.24:

Figure 8.24 – The logs in CloudWatch for verifying the output