AWS Secure Document Pipeline - Part 2: Lambda Function for Document Processing

Learn how to build a serverless document processing pipeline using AWS Lambda. Complete setup with Python 3.11, Lambda Layers, S3 triggers, and PDF processing with PyPDF2 and ReportLab.

AWS Services

October 26, 2025

AWS Secure Document Pipeline - Part 2: Lambda Function for Document Processing

Share This Post

Twitter LinkedIn Copy Link

AWS Secure Document Pipeline - Part 2: Lambda Function for Document Processing

Introduction

Building a serverless document processing pipeline is essential for modern applications that need to automatically process documents as they arrive. This comprehensive guide walks you through creating a production-ready Lambda function that automatically triggers when files arrive in your S3 bucket, processes PDFs with metadata extraction, and saves enhanced results for delivery.

What You’ll Learn

How to build a serverless document processing pipeline using AWS Lambda
Setting up Lambda Layers with Python dependencies (PyPDF2, ReportLab, Pillow)
Configuring S3 event triggers for automatic processing
Implementing comprehensive error handling and logging
Building production-ready Lambda functions with proper IAM permissions
Testing and monitoring your document processing pipeline

Prerequisites

Phase 1 completed successfully (S3 infrastructure deployed)
All 5 S3 buckets created and replication working
Python 3.11 installed locally (for Lambda Layer creation)
Docker installed (optional, for consistent layer building)
Basic understanding of AWS Lambda, S3, and Python

Architecture Overview

Our serverless document processing pipeline provides automated, scalable document processing:

  internal-processing bucket
           │
           │ S3 Event Notification
           ↓
  ┌─────────────────────┐
  │   Lambda Function   │
  │   (Python 3.11)     │
  │                     │
  │  Dependencies:      │
  │  - pandas           │
  │  - PyPDF2           │
  │  - Pillow           │
  │  - reportlab        │
  └──────────┬──────────┘
             │
             │ Write processed file
             ↓
  processed-output bucket
           │
           │ S3 Replication (from Phase 1)
           ↓
  delivery bucket

Key Benefits

Serverless Processing: Automatic scaling based on document volume
Event-Driven: Triggers only when documents arrive
Cost-Effective: Pay only for processing time used
Scalable: Handles single documents or batch processing
Secure: Integrated with existing S3 security controls
Monitored: Comprehensive logging and error handling

Step-by-Step Setup

Phase 1: Create Lambda Function Code

Step 1.1: Directory Structure

Create a new directory structure:

cd /path/to/secure-doc-pipeline

# Create Lambda directories
mkdir -p lambda
mkdir -p lambda/function
mkdir -p lambda/layer/python

cd lambda

Your structure should look like:

secure-doc-pipeline/
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── terraform.tfvars
└── lambda/
    ├── function/
    │   └── lambda_function.py
    └── layer/
        └── python/
            └── (dependencies will go here)

Step 1.2: Create Lambda Function Code

Create file: lambda/function/lambda_function.py

import json
import boto3
import os
from datetime import datetime
from urllib.parse import unquote_plus
import logging

# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize AWS clients
s3_client = boto3.client('s3')

def lambda_handler(event, context):
    """
    Main Lambda handler function.
    Triggered by S3 PUT events in the internal-processing bucket.

    Args:
        event: S3 event notification
        context: Lambda context object

    Returns:
        dict: Status response
    """

    logger.info(f"Lambda function invoked. Event: {json.dumps(event)}")

    try:
        # Parse S3 event
        for record in event['Records']:
            # Get bucket and object information
            source_bucket = record['s3']['bucket']['name']
            source_key = unquote_plus(record['s3']['object']['key'])
            file_size = record['s3']['object']['size']

            logger.info(f"Processing file: {source_key} from bucket: {source_bucket}")
            logger.info(f"File size: {file_size} bytes")

            # Validate file
            if not source_key.lower().endswith('.pdf'):
                logger.warning(f"Skipping non-PDF file: {source_key}")
                continue

            if file_size == 0:
                logger.warning(f"Skipping empty file: {source_key}")
                continue

            # Process the document
            result = process_document(source_bucket, source_key)

            if result['success']:
                logger.info(f"Successfully processed: {source_key}")
                logger.info(f"Output file: {result['output_key']}")
            else:
                logger.error(f"Failed to process: {source_key}. Error: {result['error']}")

        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'Document processing completed',
                'processed_files': len(event['Records'])
            })
        }

    except Exception as e:
        logger.error(f"Unexpected error in lambda_handler: {str(e)}", exc_info=True)
        return {
            'statusCode': 500,
            'body': json.dumps({
                'message': 'Error processing documents',
                'error': str(e)
            })
        }


def process_document(source_bucket, source_key):
    """
    Process a PDF document: extract metadata, add processing info, and save to output bucket.

    Args:
        source_bucket: Source S3 bucket name
        source_key: Source object key

    Returns:
        dict: Processing result with success status
    """

    try:
        # Get output bucket from environment variable
        output_bucket = os.environ.get('OUTPUT_BUCKET')
        if not output_bucket:
            raise ValueError("OUTPUT_BUCKET environment variable not set")

        # Download the source file to /tmp
        local_input_path = f"/tmp/{os.path.basename(source_key)}"
        logger.info(f"Downloading {source_key} to {local_input_path}")

        s3_client.download_file(source_bucket, source_key, local_input_path)

        # Get file metadata
        file_metadata = get_file_metadata(source_bucket, source_key)
        logger.info(f"File metadata: {json.dumps(file_metadata)}")

        # Process the PDF
        processed_content = process_pdf(local_input_path, file_metadata)

        # Generate output filename
        timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
        base_name = os.path.splitext(os.path.basename(source_key))[0]
        output_key = f"processed/{base_name}-processed-{timestamp}.pdf"

        local_output_path = f"/tmp/processed-{os.path.basename(source_key)}"

        # Create enhanced PDF with processing metadata
        create_enhanced_pdf(local_input_path, local_output_path, processed_content)

        # Upload to output bucket
        logger.info(f"Uploading processed file to {output_bucket}/{output_key}")

        s3_client.upload_file(
            local_output_path,
            output_bucket,
            output_key,
            ExtraArgs={
                'Metadata': {
                    'original-file': source_key,
                    'processed-timestamp': timestamp,
                    'processor': 'secure-doc-pipeline-lambda',
                    'original-size': str(file_metadata['size']),
                }
            }
        )

        # Clean up local files
        cleanup_temp_files([local_input_path, local_output_path])

        return {
            'success': True,
            'output_bucket': output_bucket,
            'output_key': output_key,
            'metadata': processed_content
        }

    except Exception as e:
        logger.error(f"Error processing document: {str(e)}", exc_info=True)
        return {
            'success': False,
            'error': str(e)
        }


def get_file_metadata(bucket, key):
    """
    Retrieve metadata about the S3 object.

    Args:
        bucket: S3 bucket name
        key: S3 object key

    Returns:
        dict: File metadata
    """

    try:
        response = s3_client.head_object(Bucket=bucket, Key=key)

        return {
            'size': response['ContentLength'],
            'last_modified': response['LastModified'].isoformat(),
            'content_type': response.get('ContentType', 'unknown'),
            'etag': response.get('ETag', '').strip('"'),
            'metadata': response.get('Metadata', {})
        }
    except Exception as e:
        logger.error(f"Error getting metadata: {str(e)}")
        return {
            'size': 0,
            'last_modified': 'unknown',
            'content_type': 'unknown',
            'etag': 'unknown',
            'metadata': {}
        }


def process_pdf(input_path, metadata):
    """
    Process the PDF file: extract text, analyze content.

    Args:
        input_path: Path to input PDF file
        metadata: File metadata

    Returns:
        dict: Processed content and analysis
    """

    try:
        import PyPDF2

        # Open and read PDF
        with open(input_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)

            num_pages = len(pdf_reader.pages)

            # Extract text from all pages
            full_text = ""
            for page_num in range(num_pages):
                page = pdf_reader.pages[page_num]
                full_text += page.extract_text()

            # Get PDF metadata
            pdf_info = pdf_reader.metadata if pdf_reader.metadata else {}

            return {
                'num_pages': num_pages,
                'text_length': len(full_text),
                'word_count': len(full_text.split()),
                'pdf_title': pdf_info.get('/Title', 'N/A'),
                'pdf_author': pdf_info.get('/Author', 'N/A'),
                'pdf_subject': pdf_info.get('/Subject', 'N/A'),
                'processing_timestamp': datetime.utcnow().isoformat(),
                'file_size_bytes': metadata['size']
            }

    except Exception as e:
        logger.error(f"Error processing PDF: {str(e)}")
        return {
            'num_pages': 0,
            'text_length': 0,
            'word_count': 0,
            'error': str(e),
            'processing_timestamp': datetime.utcnow().isoformat()
        }


def create_enhanced_pdf(input_path, output_path, processing_info):
    """
    Create an enhanced PDF with processing metadata appended.

    Args:
        input_path: Path to input PDF
        output_path: Path to save output PDF
        processing_info: Processing metadata to include
    """

    try:
        from reportlab.pdfgen import canvas
        from reportlab.lib.pagesizes import letter
        from PyPDF2 import PdfReader, PdfWriter
        import io

        # Create a metadata page
        packet = io.BytesIO()
        can = canvas.Canvas(packet, pagesize=letter)

        # Add processing information
        can.setFont("Helvetica-Bold", 16)
        can.drawString(50, 750, "Document Processing Report")

        can.setFont("Helvetica", 12)
        y_position = 720

        report_lines = [
            f"Processed by: Secure Document Pipeline",
            f"Processing Date: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')}",
            f"",
            f"Original Document Analysis:",
            f"  - Number of Pages: {processing_info.get('num_pages', 'N/A')}",
            f"  - Word Count: {processing_info.get('word_count', 'N/A')}",
            f"  - File Size: {processing_info.get('file_size_bytes', 0)} bytes",
            f"  - Title: {processing_info.get('pdf_title', 'N/A')}",
            f"  - Author: {processing_info.get('pdf_author', 'N/A')}",
            f"",
            f"Status: Successfully Processed",
            f"",
            f"--- Original Document Follows ---"
        ]

        for line in report_lines:
            can.drawString(50, y_position, line)
            y_position -= 20

        can.save()

        # Move to the beginning of the BytesIO buffer
        packet.seek(0)

        # Read the metadata page
        metadata_pdf = PdfReader(packet)
        metadata_page = metadata_pdf.pages[0]

        # Read the original PDF
        original_pdf = PdfReader(input_path)

        # Create a PDF writer object
        pdf_writer = PdfWriter()

        # Add the metadata page first
        pdf_writer.add_page(metadata_page)

        # Add all pages from the original PDF
        for page in original_pdf.pages:
            pdf_writer.add_page(page)

        # Add metadata to the PDF
        pdf_writer.add_metadata({
            '/Title': 'Processed Document',
            '/Author': 'Secure Document Pipeline',
            '/Subject': 'Processed PDF with metadata',
            '/Creator': 'AWS Lambda Function',
            '/Producer': 'secure-doc-pipeline'
        })

        # Write to output file
        with open(output_path, 'wb') as output_file:
            pdf_writer.write(output_file)

        logger.info(f"Enhanced PDF created successfully: {output_path}")

    except Exception as e:
        logger.error(f"Error creating enhanced PDF: {str(e)}")
        # Fallback: just copy the original file
        import shutil
        shutil.copy(input_path, output_path)
        logger.info("Fallback: Copied original file as processed output")


def cleanup_temp_files(file_paths):
    """
    Clean up temporary files in /tmp directory.

    Args:
        file_paths: List of file paths to delete
    """

    import os

    for file_path in file_paths:
        try:
            if os.path.exists(file_path):
                os.remove(file_path)
                logger.info(f"Cleaned up temporary file: {file_path}")
        except Exception as e:
            logger.warning(f"Could not delete {file_path}: {str(e)}")

Phase 2: Create Lambda Layer with Dependencies

Lambda Layers allow you to package dependencies separately from your function code, reducing deployment size and enabling reuse.

Option A: Build Layer Locally (Linux/macOS/WSL)

cd /path/to/secure-doc-pipeline/lambda/layer

# Create requirements file
cat > requirements.txt << EOF
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.28.85
EOF

# Install dependencies to python/ directory
pip install -r requirements.txt -t python/ --platform manylinux2014_x86_64 --only-binary=:all:

# Create ZIP file
cd ..
zip -r layer.zip layer/python/

Option B: Build Layer with Docker (Recommended for Windows)

Create file: lambda/layer/Dockerfile

FROM public.ecr.aws/lambda/python:3.11

# Copy requirements file
COPY requirements.txt .

# Install dependencies
RUN pip install -r requirements.txt -t /asset/python/

# Create output directory
RUN mkdir -p /out

# Create ZIP file
RUN cd /asset && zip -r /out/layer.zip python/

CMD ["echo", "Layer built successfully"]

Create file: lambda/layer/requirements.txt

PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.28.85

Build the layer:

cd /path/to/secure-doc-pipeline/lambda/layer

# Build with Docker
docker build -t lambda-layer-builder .

# Extract the layer ZIP
docker create --name temp lambda-layer-builder
docker cp temp:/out/layer.zip ../layer.zip
docker rm temp

cd ..

Option C: Build Layer Using AWS CloudShell (Recommended for Windows)

AWS CloudShell is a browser-based shell that comes pre-installed with AWS CLI and Python. Perfect for Windows users without Docker!

Step-by-Step Instructions:

Open AWS CloudShell:
- Log in to AWS Console
- Click on the CloudShell icon (>_) in the top navigation bar (next to search)
- Wait for CloudShell to initialize (30-60 seconds)

Create Layer Directory in CloudShell:

# Create working directory
mkdir -p lambda-layer/python
cd lambda-layer

# Create requirements file
cat > requirements.txt << 'EOF'
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.28.85
EOF

Install Dependencies:

# Install packages to python/ directory
pip install -r requirements.txt -t python/ --platform manylinux2014_x86_64 --only-binary=:all:

# Check installation
ls -la python/

Create ZIP File:

# Create the layer ZIP
zip -r layer.zip python/

# Verify ZIP contents
unzip -l layer.zip | head -20

# Check file size (should be around 10-15 MB)
ls -lh layer.zip

Download to Your Local Machine:
- In CloudShell, click Actions → Download file
- Enter file path: layer.zip
- Click Download
- Save to your project: secure-doc-pipeline/lambda/layer.zip

Alternative: Upload Directly to S3 from CloudShell:

# Create a temporary S3 bucket for the layer
aws s3 mb s3://my-lambda-layers-temp-bucket-$(date +%s)

# Upload layer ZIP
aws s3 cp layer.zip s3://my-lambda-layers-temp-bucket-XXXXX/

# Note the S3 URL - you'll need this for Terraform
echo "s3://my-lambda-layers-temp-bucket-XXXXX/layer.zip"

Update Terraform to use S3:

# In terraform/main.tf, modify the lambda layer resource:
resource "aws_lambda_layer_version" "pdf_processing_layer" {
  s3_bucket           = "my-lambda-layers-temp-bucket-XXXXX"
  s3_key              = "layer.zip"
  layer_name          = "${var.project_name}-pdf-processing-layer"
  description         = "Dependencies for PDF processing: PyPDF2, reportlab, Pillow"
  compatible_runtimes = ["python3.11"]
}

Clean Up CloudShell (Optional):

# Remove working directory
cd ~
rm -rf lambda-layer

Option D: Create Layer Manually via AWS Console (No CLI Required)

If you prefer a fully GUI approach:

Step 1: Prepare Layer Using Online Tools

Use GitHub Actions or Online Python Environment:

Create a GitHub Repository (free):

Go to github.com and create a new repository
Add a .github/workflows/build-layer.yml file:

name: Build Lambda Layer
on:
  workflow_dispatch:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: |
          mkdir -p python
          pip install PyPDF2==3.0.1 reportlab==4.0.7 Pillow==10.1.0 boto3==1.28.85 -t python/          

      - name: Create ZIP
        run: |
          zip -r layer.zip python/          

      - name: Upload artifact
        uses: actions/upload-artifact@v3
        with:
          name: lambda-layer
          path: layer.zip

Run the workflow manually from GitHub Actions tab
Download the artifact (layer.zip) to your local machine

Step 2: Upload Layer to AWS Lambda Console

Open Lambda Console:
- Go to AWS Console → Lambda
- In left sidebar, click Layers
- Click Create layer
Configure Layer:
- Name: secure-doc-pipeline-pdf-processing-layer
- Description: Dependencies for PDF processing: PyPDF2, reportlab, Pillow
- Upload: Click Upload a .zip file
- Browse and select your layer.zip
- Compatible runtimes: Select Python 3.11
- Click Create
Note the Layer ARN:
- After creation, copy the Layer ARN (looks like: arn:aws:lambda:ap-south-1:123456789012:layer:secure-doc-pipeline-pdf-processing-layer:1)

Update Terraform to Use Console-Created Layer:

In terraform/main.tf, replace the layer resource with a data source:

# Instead of creating the layer, reference the existing one:
# Comment out or remove the aws_lambda_layer_version resource

# Add this data source instead:
data "aws_lambda_layer_version" "pdf_processing_layer" {
  layer_name = "secure-doc-pipeline-pdf-processing-layer"
  version    = 1  # Use the version number from console
}

# Update the lambda function to use the data source:
resource "aws_lambda_function" "document_processor" {
  # ... other configuration ...

  layers = [data.aws_lambda_layer_version.pdf_processing_layer.arn]

  # ... rest of configuration ...
}

Option E: Use Public Lambda Layers (Quickest Method)

Some organizations publish pre-built Lambda Layers. Here are some options:

Klayers (Community-Maintained)

Visit: https://api.klayers.cloud/api/v2/p3.11/layers/latest/ap-south-1/

Find ARNs for:

Pillow: Check the website for latest ARN in ap-south-1
PyPDF2: Check the website for latest ARN in ap-south-1

Example usage in Terraform:

resource "aws_lambda_function" "document_processor" {
  # ... other configuration ...

  layers = [
    "arn:aws:lambda:ap-south-1:770693421928:layer:Klayers-p311-Pillow:1",  # Example ARN
    "arn:aws:lambda:ap-south-1:770693421928:layer:Klayers-p311-PyPDF2:1",  # Example ARN
    # Note: You'll need to find reportlab separately or create a layer with just reportlab
  ]

  # ... rest of configuration ...
}

Note: Public layers may not have all dependencies. You might need to combine multiple layers or create one custom layer for missing packages.

Verify Layer Contents

# Check the ZIP file structure
unzip -l layer.zip | head -20

# You should see:
# python/
# python/PyPDF2/
# python/reportlab/
# python/PIL/
# etc.

Phase 3: Add Lambda Resources to Terraform

Now we’ll update the Terraform configuration to include Lambda function, layer, and S3 trigger.

Update `terraform/main.tf`

Add the following resources to your main.tf file (append to the end):

# ============================================
# Lambda Layer for PDF Processing Dependencies
# ============================================
resource "aws_lambda_layer_version" "pdf_processing_layer" {
  filename            = "../lambda/layer.zip"
  layer_name          = "${var.project_name}-pdf-processing-layer"
  description         = "Dependencies for PDF processing: PyPDF2, reportlab, Pillow"
  compatible_runtimes = ["python3.11"]

  source_code_hash = filebase64sha256("../lambda/layer.zip")
}

# ============================================
# IAM Role for Lambda Function
# ============================================
resource "aws_iam_role" "lambda_execution_role" {
  name = "${var.project_name}-lambda-execution-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Name = "${var.project_name}-lambda-execution-role"
  }
}

# ============================================
# IAM Policy for Lambda Function
# ============================================
resource "aws_iam_policy" "lambda_execution_policy" {
  name        = "${var.project_name}-lambda-execution-policy"
  description = "Policy for Lambda function to access S3 buckets and CloudWatch Logs"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowCloudWatchLogs"
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "arn:aws:logs:${var.aws_region}:*:*"
      },
      {
        Sid    = "AllowS3Read"
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:GetObjectVersion",
          "s3:HeadObject"
        ]
        Resource = "${aws_s3_bucket.doc_buckets[local.bucket_names.internal_processing].arn}/*"
      },
      {
        Sid    = "AllowS3Write"
        Effect = "Allow"
        Action = [
          "s3:PutObject",
          "s3:PutObjectAcl"
        ]
        Resource = "${aws_s3_bucket.doc_buckets[local.bucket_names.processed_output].arn}/*"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_execution_attach" {
  role       = aws_iam_role.lambda_execution_role.name
  policy_arn = aws_iam_policy.lambda_execution_policy.arn
}

# ============================================
# Lambda Function
# ============================================
data "archive_file" "lambda_function_zip" {
  type        = "zip"
  source_dir  = "../lambda/function"
  output_path = "../lambda/function.zip"
}

resource "aws_lambda_function" "document_processor" {
  filename         = data.archive_file.lambda_function_zip.output_path
  function_name    = "${var.project_name}-document-processor"
  role            = aws_iam_role.lambda_execution_role.arn
  handler         = "lambda_function.lambda_handler"
  source_code_hash = data.archive_file.lambda_function_zip.output_base64sha256
  runtime         = "python3.11"
  timeout         = 300  # 5 minutes
  memory_size     = 512  # MB

  layers = [aws_lambda_layer_version.pdf_processing_layer.arn]

  environment {
    variables = {
      OUTPUT_BUCKET = aws_s3_bucket.doc_buckets[local.bucket_names.processed_output].id
      LOG_LEVEL     = "INFO"
    }
  }

  tags = {
    Name = "${var.project_name}-document-processor"
  }
}

# ============================================
# CloudWatch Log Group for Lambda
# ============================================
resource "aws_cloudwatch_log_group" "lambda_log_group" {
  name              = "/aws/lambda/${aws_lambda_function.document_processor.function_name}"
  retention_in_days = 14  # Keep logs for 14 days

  tags = {
    Name = "${var.project_name}-lambda-logs"
  }
}

# ============================================
# S3 Bucket Notification to Trigger Lambda
# ============================================
resource "aws_lambda_permission" "allow_s3_invoke" {
  statement_id  = "AllowS3Invoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.document_processor.function_name
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.doc_buckets[local.bucket_names.internal_processing].arn
}

resource "aws_s3_bucket_notification" "bucket_notification" {
  bucket = aws_s3_bucket.doc_buckets[local.bucket_names.internal_processing].id

  lambda_function {
    lambda_function_arn = aws_lambda_function.document_processor.arn
    events              = ["s3:ObjectCreated:*"]
    filter_suffix       = ".pdf"
  }

  depends_on = [aws_lambda_permission.allow_s3_invoke]
}

Add Lambda Outputs to `outputs.tf`

Add these outputs to terraform/outputs.tf:

output "lambda_function_name" {
  description = "Name of the Lambda function"
  value       = aws_lambda_function.document_processor.function_name
}

output "lambda_function_arn" {
  description = "ARN of the Lambda function"
  value       = aws_lambda_function.document_processor.arn
}

output "lambda_log_group" {
  description = "CloudWatch Log Group for Lambda"
  value       = aws_cloudwatch_log_group.lambda_log_group.name
}

output "lambda_layer_arn" {
  description = "ARN of the Lambda Layer"
  value       = aws_lambda_layer_version.pdf_processing_layer.arn
}

Phase 4: Deploy Lambda Function

Step 4.1: Build the Lambda Layer

Follow the instructions in Phase 2 to create lambda/layer.zip.

Step 4.2: Package Lambda Function

Terraform will automatically package the function code using the archive_file data source.

Step 4.3: Deploy with Terraform

cd terraform
terraform init

# Validate configuration
terraform validate

# Preview changes
terraform plan

# Apply changes
terraform apply

Type yes when prompted.

Deployment time: 2-4 minutes

Phase 5: Test the Complete Pipeline

Test 1: Upload a PDF and Trigger Lambda

# Create a simple test PDF (or use any PDF file)
# For testing, you can download a sample PDF:
curl -o sample-document.pdf https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf

# Upload to uploads bucket using third-party profile
aws s3 cp sample-document.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test

Test 2: Monitor Lambda Execution

Watch the Lambda function process the file in real-time:

# Get the log stream name (wait 30 seconds after upload)
aws logs describe-log-streams \
  --log-group-name /aws/lambda/secure-doc-pipeline-document-processor \
  --order-by LastEventTime \
  --descending \
  --max-items 1 \
  --query 'logStreams[0].logStreamName' \
  --output text

# Replace LOG_STREAM_NAME with the output from above
aws logs get-log-events \
  --log-group-name /aws/lambda/secure-doc-pipeline-document-processor \
  --log-stream-name LOG_STREAM_NAME

Test 3: Verify Processed Output

# Wait 2-3 minutes for processing and replication

# Check processed-output bucket
aws s3 ls s3://secure-doc-pipeline-processed-output/processed/

# Check delivery bucket (after replication)
aws s3 ls s3://secure-doc-pipeline-delivery/processed/

# Download the processed file
aws s3 cp s3://secure-doc-pipeline-delivery/processed/sample-document-processed-20251017-045805.pdf ./downloaded-processed.pdf --profile third-party-test

Test 4: Verify Processing Metadata

# Get metadata of processed file
aws s3api head-object \
  --bucket secure-doc-pipeline-processed-output \
  --key processed/sample-document-processed-20251016-120000.pdf

Look for custom metadata:

{
  "Metadata": {
    "original-file": "sample-document.pdf",
    "processed-timestamp": "20251016-120000",
    "processor": "secure-doc-pipeline-lambda",
    "original-size": "13264"
  }
}

Troubleshooting

Issue: Lambda Function Not Triggering

Symptoms: File uploaded to uploads bucket, replicated to internal-processing, but Lambda doesn’t run

Solutions:

Check S3 notification configuration:

aws s3api get-bucket-notification-configuration \
  --bucket secure-doc-pipeline-internal-processing

Verify Lambda has S3 permission:

aws lambda get-policy \
  --function-name secure-doc-pipeline-document-processor

Check CloudWatch Logs:

aws logs tail /aws/lambda/secure-doc-pipeline-document-processor --follow

Manually invoke Lambda for testing:

aws lambda invoke \
  --function-name secure-doc-pipeline-document-processor \
  --payload file://test-event.json \
  response.json

Create test-event.json:

{
  "Records": [
    {
      "s3": {
        "bucket": {
          "name": "secure-doc-pipeline-internal-processing"
        },
        "object": {
          "key": "sample-document.pdf",
          "size": 13264
        }
      }
    }
  ]
}

Issue: Lambda Timeout

Symptoms: Lambda execution fails with “Task timed out after 300.00 seconds”

Solutions:

Increase timeout in Terraform:

resource "aws_lambda_function" "document_processor" {
  ...
  timeout = 600  # Increase to 10 minutes
}

Apply changes:
```
terraform apply
```
Check file size: Very large PDFs may need more time or memory

Issue: Lambda Out of Memory

Symptoms: Error: “Runtime exited with error: signal: killed”

Solutions:

Increase memory in Terraform:

resource "aws_lambda_function" "document_processor" {
  ...
  memory_size = 1024  # Increase to 1 GB
}

Apply changes:
```
terraform apply
```

Issue: Import Errors in Lambda

Symptoms: “No module named ‘PyPDF2’” or similar

Solutions:

Verify layer is attached:

aws lambda get-function \
  --function-name secure-doc-pipeline-document-processor \
  --query 'Configuration.Layers'

Check layer ZIP structure:

unzip -l layer.zip | grep -E "(PyPDF2|reportlab|PIL)"

Rebuild layer with correct structure: Ensure dependencies are in python/ directory
Update Terraform and redeploy:
```
terraform apply
```

Issue: Pillow Import Error (Common Lambda Layer Issue)

Symptoms:

cannot import name '_imaging' from 'PIL'

Root Cause:

Pillow requires compiled C extensions (_imaging module)
Your layer was likely built on Windows, which creates Windows binaries
AWS Lambda runs on Amazon Linux 2, which needs Linux binaries

Solutions:

Quick Test: Use a Real PDF File

Your current test file might not be a real PDF. Use a real PDF:

# Download a sample PDF
curl -o sample.pdf https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf

# Upload it
aws s3 cp sample.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test

Solution 1: Rebuild Lambda Layer in AWS CloudShell (Recommended)

The Lambda Layer needs to be built on a Linux environment compatible with Lambda.

Step 1: Open AWS CloudShell

Go to AWS Console
Click the CloudShell icon (>_) in the top navigation bar
Wait for the shell to initialize

Step 2: Create the Layer

Run these commands in CloudShell:

# Create directory structure
mkdir -p lambda-layer/python
cd lambda-layer

# Create requirements.txt
cat > requirements.txt << 'EOF'
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.29.7
EOF

# Install dependencies for Lambda runtime
pip install -r requirements.txt \
  -t python/ \
  --platform manylinux2014_x86_64 \
  --only-binary=:all: \
  --python-version 3.11

# Create the layer zip
zip -r layer.zip python/

# Upload to S3 (so you can download it)
aws s3 cp layer.zip s3://secure-doc-pipeline-uploads/lambda-layer.zip

Step 3: Download and Update Terraform

# Download the layer to your local machine
aws s3 cp s3://secure-doc-pipeline-uploads/lambda-layer.zip ./lambda/layer.zip

# Move it to the correct location
mv lambda/layer.zip secure-doc-pipeline/lambda/layer.zip

Step 4: Update Lambda Layer

cd secure-doc-pipeline/terraform

# Apply the update (this will update just the layer)
terraform apply -target=aws_lambda_layer_version.pdf_processing_layer

Solution 2: Use Docker to Build Layer (If you have Docker)

If you have Docker installed:

cd secure-doc-pipeline/lambda

# Create Dockerfile
cat > Dockerfile << 'EOF'
FROM public.ecr.aws/lambda/python:3.11

RUN mkdir /tmp/python
COPY requirements.txt /tmp/
RUN pip install -r /tmp/requirements.txt -t /tmp/python/
EOF

# Create requirements.txt
cat > requirements.txt << 'EOF'
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.29.7
EOF

# Build and extract
docker build -t lambda-layer-builder .
docker create --name temp-container lambda-layer-builder
docker cp temp-container:/tmp/python ./python
docker rm temp-container

# Create layer zip
cd python
zip -r ../layer.zip .
cd ..

Then update Terraform as shown in Step 4 above.

Solution 3: Simplify Lambda Function (Remove Pillow Dependency)

If you don’t want to rebuild the layer, you can modify the Lambda function to skip the enhanced PDF creation and just copy files:

This is actually what’s happening now as a fallback - the function is still working, just without the enhanced PDF features.

Testing After Fix:

# Test with a real PDF
curl -o test-doc.pdf https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf

aws s3 cp test-doc.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test

# Check CloudWatch Logs
aws logs tail /aws/lambda/secure-doc-pipeline-document-processor --follow

# Verify processed output
aws s3 ls s3://secure-doc-pipeline-processed-output/processed/

Recommended Approach:

For now (testing): Use the quick test - test with a real PDF file
For production: Use Solution 1 (CloudShell) or Solution 2 (Docker) to rebuild the layer properly

Why the Current Setup Still Works:

Your Lambda has a fallback mechanism:

Tries to create enhanced PDF with metadata page
If that fails (Pillow error), it copies the original file
Still uploads to processed-output bucket
Pipeline continues to work

So your infrastructure is working correctly - you just need a proper PDF file for testing!

Issue: Access Denied Errors

Symptoms: Lambda can’t read from source or write to destination bucket

Solutions:

Check IAM role permissions:

aws iam get-role-policy \
  --role-name secure-doc-pipeline-lambda-execution-role \
  --policy-name lambda-execution-policy

Verify bucket permissions: Ensure Lambda role has proper S3 permissions in Terraform
Check bucket encryption: If using KMS, Lambda needs KMS permissions (covered in Phase 3)

Monitoring and Debugging

View Lambda Metrics in CloudWatch

# Get Lambda invocation count
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=secure-doc-pipeline-document-processor \
  --start-time 2025-10-16T00:00:00Z \
  --end-time 2025-10-16T23:59:59Z \
  --period 3600 \
  --statistics Sum

# Get Lambda error count
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=secure-doc-pipeline-document-processor \
  --start-time 2025-10-16T00:00:00Z \
  --end-time 2025-10-16T23:59:59Z \
  --period 3600 \
  --statistics Sum

Stream Lambda Logs in Real-Time

# Follow logs (keeps connection open)
aws logs tail /aws/lambda/secure-doc-pipeline-document-processor --follow

Query Logs for Errors

aws logs filter-log-events \
  --log-group-name /aws/lambda/secure-doc-pipeline-document-processor \
  --filter-pattern "ERROR"

Performance Optimization

1. Optimize Lambda Memory

Lambda CPU power scales with memory. Test different configurations:

memory_size = 512   # Baseline
memory_size = 1024  # 2x CPU power
memory_size = 2048  # 4x CPU power

Cost vs Performance:

512 MB: Slower, cheaper per invocation
1024 MB: Balanced
2048 MB: Faster, more expensive but may complete quicker (lower total cost)

2. Reduce Cold Starts

Add provisioned concurrency (costs more but eliminates cold starts):

resource "aws_lambda_provisioned_concurrency_config" "processor_concurrency" {
  function_name                     = aws_lambda_function.document_processor.function_name
  provisioned_concurrent_executions = 1
  qualifier                         = aws_lambda_function.document_processor.version
}

3. Optimize Layer Size

Keep layers under 50 MB for faster cold starts:

# Check layer size
ls -lh lambda/layer.zip

# Remove unnecessary dependencies
pip install --no-deps PyPDF2==3.0.1 -t python/

Cost Estimation

Phase 2 Monthly Costs (Approximate)

Lambda Compute (ap-south-1 pricing):

1,000 invocations/month
512 MB memory, 30 seconds average duration
Request charges: $0.20 per 1M requests = $0.0002
Compute charges: $0.0000166667 per GB-second
- 1,000 × 30 sec × 0.5 GB = 15,000 GB-seconds
- 15,000 × $0.0000166667 = $0.25
Total Lambda: ~$0.25/month

CloudWatch Logs:

Log ingestion: 10 MB/month = $0.01
Log storage: 100 MB/month = $0.01
Total Logs: ~$0.02/month

Data Transfer:

S3 to Lambda (same region): $0.00 (free)
Lambda to S3 (same region): $0.00 (free)

Total Estimated Cost for Phase 2: ~$0.27/month

Combined Phase 1 + Phase 2: ~$0.58/month

With 1,000 documents/month (avg 2 MB each):

Lambda compute: ~$2.50/month
Storage: ~$5.00/month
Total: ~$7.50/month

Testing with Different File Types

Test Edge Cases

# Test 1: Very small PDF
echo "%PDF-1.4" > tiny.pdf
aws s3 cp tiny.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test

# Test 2: Large PDF (create or download a multi-page PDF)
aws s3 cp large-document.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test

# Test 3: Non-PDF file (should be skipped gracefully)
echo "not a pdf" > test.txt
aws s3 cp test.txt s3://secure-doc-pipeline-uploads/ --profile third-party-test

# Test 4: PDF with special characters in filename
aws s3 cp "document with spaces & special!chars.pdf" s3://secure-doc-pipeline-uploads/ --profile third-party-test

Check CloudWatch Logs to verify proper handling of each case.

Verification Checklist

Before moving to Phase 3, verify:

Lambda function deployed successfully
Lambda layer contains all required dependencies
S3 event notification configured on internal-processing bucket
Lambda triggers automatically when file arrives
Lambda can read from internal-processing bucket
Lambda can write to processed-output bucket
Processed files replicate to delivery bucket
CloudWatch logs show successful executions
Processed PDFs contain metadata page
Third party can download from delivery bucket
Error handling works (test with invalid PDF)
Lambda execution time is acceptable (<1 minute)

Next Steps

Congratulations! You’ve built a fully functional serverless document processing pipeline with:

Automatic PDF processing triggered by S3 events
Secure IAM permissions following least privilege
Comprehensive logging for troubleshooting
Scalable Lambda function with proper error handling

Ready for Phase 3?

Phase 3 will enhance security and monitoring with:

Custom KMS encryption for all buckets
CloudTrail for detailed audit logging
CloudWatch alarms for failure detection
SNS notifications for critical events

Proceed to: AWS Secure Document Pipeline - Part 3: Security and Monitoring Here is the Part 3, where we’ll enhance security and monitoring with custom KMS encryption for all buckets, CloudTrail for detailed audit logging, CloudWatch alarms for failure detection, and SNS notifications for critical events!

Additional Resources

Share This Post

Twitter LinkedIn Copy Link

AWS Secure Document Pipeline - Part 2: Lambda Function for Document Processing

Table of Contents

Share This Post

AWS Secure Document Pipeline - Part 2: Lambda Function for Document Processing

Introduction

What You’ll Learn

Prerequisites

Architecture Overview

Key Benefits

Step-by-Step Setup

Phase 1: Create Lambda Function Code

Step 1.1: Directory Structure

Step 1.2: Create Lambda Function Code

Phase 2: Create Lambda Layer with Dependencies

Option A: Build Layer Locally (Linux/macOS/WSL)

Option B: Build Layer with Docker (Recommended for Windows)

Option C: Build Layer Using AWS CloudShell (Recommended for Windows)

Step-by-Step Instructions:

Option D: Create Layer Manually via AWS Console (No CLI Required)

Step 1: Prepare Layer Using Online Tools

Step 2: Upload Layer to AWS Lambda Console

Option E: Use Public Lambda Layers (Quickest Method)

Klayers (Community-Maintained)

Verify Layer Contents

Phase 3: Add Lambda Resources to Terraform

Update terraform/main.tf

Add Lambda Outputs to outputs.tf

Phase 4: Deploy Lambda Function

Step 4.1: Build the Lambda Layer

Step 4.2: Package Lambda Function

Step 4.3: Deploy with Terraform

Phase 5: Test the Complete Pipeline

Test 1: Upload a PDF and Trigger Lambda

Test 2: Monitor Lambda Execution

Test 3: Verify Processed Output

Test 4: Verify Processing Metadata

Troubleshooting

Issue: Lambda Function Not Triggering

Issue: Lambda Timeout

Issue: Lambda Out of Memory

Issue: Import Errors in Lambda

Issue: Pillow Import Error (Common Lambda Layer Issue)

Quick Test: Use a Real PDF File

Solution 1: Rebuild Lambda Layer in AWS CloudShell (Recommended)

Solution 2: Use Docker to Build Layer (If you have Docker)

Solution 3: Simplify Lambda Function (Remove Pillow Dependency)

Issue: Access Denied Errors

Monitoring and Debugging

View Lambda Metrics in CloudWatch

Stream Lambda Logs in Real-Time

Query Logs for Errors

Performance Optimization

1. Optimize Lambda Memory

2. Reduce Cold Starts

3. Optimize Layer Size

Cost Estimation

Phase 2 Monthly Costs (Approximate)

Total Estimated Cost for Phase 2: ~$0.27/month

Combined Phase 1 + Phase 2: ~$0.58/month

Testing with Different File Types

Test Edge Cases

Verification Checklist

Next Steps

Ready for Phase 3?

Additional Resources

Table of Contents

Share This Post

Update `terraform/main.tf`

Add Lambda Outputs to `outputs.tf`