Learn how to build a serverless document processing pipeline using AWS Lambda. Complete setup with Python 3.11, Lambda Layers, S3 triggers, and PDF processing with PyPDF2 and ReportLab.
Building a serverless document processing pipeline is essential for modern applications that need to automatically process documents as they arrive. This comprehensive guide walks you through creating a production-ready Lambda function that automatically triggers when files arrive in your S3 bucket, processes PDFs with metadata extraction, and saves enhanced results for delivery.
Our serverless document processing pipeline provides automated, scalable document processing:
internal-processing bucket
│
│ S3 Event Notification
↓
┌─────────────────────┐
│ Lambda Function │
│ (Python 3.11) │
│ │
│ Dependencies: │
│ - pandas │
│ - PyPDF2 │
│ - Pillow │
│ - reportlab │
└──────────┬──────────┘
│
│ Write processed file
↓
processed-output bucket
│
│ S3 Replication (from Phase 1)
↓
delivery bucket
Create a new directory structure:
cd /path/to/secure-doc-pipeline
# Create Lambda directories
mkdir -p lambda
mkdir -p lambda/function
mkdir -p lambda/layer/python
cd lambda
Your structure should look like:
secure-doc-pipeline/
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── terraform.tfvars
└── lambda/
├── function/
│ └── lambda_function.py
└── layer/
└── python/
└── (dependencies will go here)
Create file: lambda/function/lambda_function.py
import json
import boto3
import os
from datetime import datetime
from urllib.parse import unquote_plus
import logging
# Configure logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize AWS clients
s3_client = boto3.client('s3')
def lambda_handler(event, context):
"""
Main Lambda handler function.
Triggered by S3 PUT events in the internal-processing bucket.
Args:
event: S3 event notification
context: Lambda context object
Returns:
dict: Status response
"""
logger.info(f"Lambda function invoked. Event: {json.dumps(event)}")
try:
# Parse S3 event
for record in event['Records']:
# Get bucket and object information
source_bucket = record['s3']['bucket']['name']
source_key = unquote_plus(record['s3']['object']['key'])
file_size = record['s3']['object']['size']
logger.info(f"Processing file: {source_key} from bucket: {source_bucket}")
logger.info(f"File size: {file_size} bytes")
# Validate file
if not source_key.lower().endswith('.pdf'):
logger.warning(f"Skipping non-PDF file: {source_key}")
continue
if file_size == 0:
logger.warning(f"Skipping empty file: {source_key}")
continue
# Process the document
result = process_document(source_bucket, source_key)
if result['success']:
logger.info(f"Successfully processed: {source_key}")
logger.info(f"Output file: {result['output_key']}")
else:
logger.error(f"Failed to process: {source_key}. Error: {result['error']}")
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Document processing completed',
'processed_files': len(event['Records'])
})
}
except Exception as e:
logger.error(f"Unexpected error in lambda_handler: {str(e)}", exc_info=True)
return {
'statusCode': 500,
'body': json.dumps({
'message': 'Error processing documents',
'error': str(e)
})
}
def process_document(source_bucket, source_key):
"""
Process a PDF document: extract metadata, add processing info, and save to output bucket.
Args:
source_bucket: Source S3 bucket name
source_key: Source object key
Returns:
dict: Processing result with success status
"""
try:
# Get output bucket from environment variable
output_bucket = os.environ.get('OUTPUT_BUCKET')
if not output_bucket:
raise ValueError("OUTPUT_BUCKET environment variable not set")
# Download the source file to /tmp
local_input_path = f"/tmp/{os.path.basename(source_key)}"
logger.info(f"Downloading {source_key} to {local_input_path}")
s3_client.download_file(source_bucket, source_key, local_input_path)
# Get file metadata
file_metadata = get_file_metadata(source_bucket, source_key)
logger.info(f"File metadata: {json.dumps(file_metadata)}")
# Process the PDF
processed_content = process_pdf(local_input_path, file_metadata)
# Generate output filename
timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
base_name = os.path.splitext(os.path.basename(source_key))[0]
output_key = f"processed/{base_name}-processed-{timestamp}.pdf"
local_output_path = f"/tmp/processed-{os.path.basename(source_key)}"
# Create enhanced PDF with processing metadata
create_enhanced_pdf(local_input_path, local_output_path, processed_content)
# Upload to output bucket
logger.info(f"Uploading processed file to {output_bucket}/{output_key}")
s3_client.upload_file(
local_output_path,
output_bucket,
output_key,
ExtraArgs={
'Metadata': {
'original-file': source_key,
'processed-timestamp': timestamp,
'processor': 'secure-doc-pipeline-lambda',
'original-size': str(file_metadata['size']),
}
}
)
# Clean up local files
cleanup_temp_files([local_input_path, local_output_path])
return {
'success': True,
'output_bucket': output_bucket,
'output_key': output_key,
'metadata': processed_content
}
except Exception as e:
logger.error(f"Error processing document: {str(e)}", exc_info=True)
return {
'success': False,
'error': str(e)
}
def get_file_metadata(bucket, key):
"""
Retrieve metadata about the S3 object.
Args:
bucket: S3 bucket name
key: S3 object key
Returns:
dict: File metadata
"""
try:
response = s3_client.head_object(Bucket=bucket, Key=key)
return {
'size': response['ContentLength'],
'last_modified': response['LastModified'].isoformat(),
'content_type': response.get('ContentType', 'unknown'),
'etag': response.get('ETag', '').strip('"'),
'metadata': response.get('Metadata', {})
}
except Exception as e:
logger.error(f"Error getting metadata: {str(e)}")
return {
'size': 0,
'last_modified': 'unknown',
'content_type': 'unknown',
'etag': 'unknown',
'metadata': {}
}
def process_pdf(input_path, metadata):
"""
Process the PDF file: extract text, analyze content.
Args:
input_path: Path to input PDF file
metadata: File metadata
Returns:
dict: Processed content and analysis
"""
try:
import PyPDF2
# Open and read PDF
with open(input_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
num_pages = len(pdf_reader.pages)
# Extract text from all pages
full_text = ""
for page_num in range(num_pages):
page = pdf_reader.pages[page_num]
full_text += page.extract_text()
# Get PDF metadata
pdf_info = pdf_reader.metadata if pdf_reader.metadata else {}
return {
'num_pages': num_pages,
'text_length': len(full_text),
'word_count': len(full_text.split()),
'pdf_title': pdf_info.get('/Title', 'N/A'),
'pdf_author': pdf_info.get('/Author', 'N/A'),
'pdf_subject': pdf_info.get('/Subject', 'N/A'),
'processing_timestamp': datetime.utcnow().isoformat(),
'file_size_bytes': metadata['size']
}
except Exception as e:
logger.error(f"Error processing PDF: {str(e)}")
return {
'num_pages': 0,
'text_length': 0,
'word_count': 0,
'error': str(e),
'processing_timestamp': datetime.utcnow().isoformat()
}
def create_enhanced_pdf(input_path, output_path, processing_info):
"""
Create an enhanced PDF with processing metadata appended.
Args:
input_path: Path to input PDF
output_path: Path to save output PDF
processing_info: Processing metadata to include
"""
try:
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from PyPDF2 import PdfReader, PdfWriter
import io
# Create a metadata page
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
# Add processing information
can.setFont("Helvetica-Bold", 16)
can.drawString(50, 750, "Document Processing Report")
can.setFont("Helvetica", 12)
y_position = 720
report_lines = [
f"Processed by: Secure Document Pipeline",
f"Processing Date: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')}",
f"",
f"Original Document Analysis:",
f" - Number of Pages: {processing_info.get('num_pages', 'N/A')}",
f" - Word Count: {processing_info.get('word_count', 'N/A')}",
f" - File Size: {processing_info.get('file_size_bytes', 0)} bytes",
f" - Title: {processing_info.get('pdf_title', 'N/A')}",
f" - Author: {processing_info.get('pdf_author', 'N/A')}",
f"",
f"Status: Successfully Processed",
f"",
f"--- Original Document Follows ---"
]
for line in report_lines:
can.drawString(50, y_position, line)
y_position -= 20
can.save()
# Move to the beginning of the BytesIO buffer
packet.seek(0)
# Read the metadata page
metadata_pdf = PdfReader(packet)
metadata_page = metadata_pdf.pages[0]
# Read the original PDF
original_pdf = PdfReader(input_path)
# Create a PDF writer object
pdf_writer = PdfWriter()
# Add the metadata page first
pdf_writer.add_page(metadata_page)
# Add all pages from the original PDF
for page in original_pdf.pages:
pdf_writer.add_page(page)
# Add metadata to the PDF
pdf_writer.add_metadata({
'/Title': 'Processed Document',
'/Author': 'Secure Document Pipeline',
'/Subject': 'Processed PDF with metadata',
'/Creator': 'AWS Lambda Function',
'/Producer': 'secure-doc-pipeline'
})
# Write to output file
with open(output_path, 'wb') as output_file:
pdf_writer.write(output_file)
logger.info(f"Enhanced PDF created successfully: {output_path}")
except Exception as e:
logger.error(f"Error creating enhanced PDF: {str(e)}")
# Fallback: just copy the original file
import shutil
shutil.copy(input_path, output_path)
logger.info("Fallback: Copied original file as processed output")
def cleanup_temp_files(file_paths):
"""
Clean up temporary files in /tmp directory.
Args:
file_paths: List of file paths to delete
"""
import os
for file_path in file_paths:
try:
if os.path.exists(file_path):
os.remove(file_path)
logger.info(f"Cleaned up temporary file: {file_path}")
except Exception as e:
logger.warning(f"Could not delete {file_path}: {str(e)}")
Lambda Layers allow you to package dependencies separately from your function code, reducing deployment size and enabling reuse.
cd /path/to/secure-doc-pipeline/lambda/layer
# Create requirements file
cat > requirements.txt << EOF
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.28.85
EOF
# Install dependencies to python/ directory
pip install -r requirements.txt -t python/ --platform manylinux2014_x86_64 --only-binary=:all:
# Create ZIP file
cd ..
zip -r layer.zip layer/python/
Create file: lambda/layer/Dockerfile
FROM public.ecr.aws/lambda/python:3.11
# Copy requirements file
COPY requirements.txt .
# Install dependencies
RUN pip install -r requirements.txt -t /asset/python/
# Create output directory
RUN mkdir -p /out
# Create ZIP file
RUN cd /asset && zip -r /out/layer.zip python/
CMD ["echo", "Layer built successfully"]
Create file: lambda/layer/requirements.txt
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.28.85
Build the layer:
cd /path/to/secure-doc-pipeline/lambda/layer
# Build with Docker
docker build -t lambda-layer-builder .
# Extract the layer ZIP
docker create --name temp lambda-layer-builder
docker cp temp:/out/layer.zip ../layer.zip
docker rm temp
cd ..
AWS CloudShell is a browser-based shell that comes pre-installed with AWS CLI and Python. Perfect for Windows users without Docker!
Open AWS CloudShell:
Create Layer Directory in CloudShell:
# Create working directory
mkdir -p lambda-layer/python
cd lambda-layer
# Create requirements file
cat > requirements.txt << 'EOF'
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.28.85
EOF
Install Dependencies:
# Install packages to python/ directory
pip install -r requirements.txt -t python/ --platform manylinux2014_x86_64 --only-binary=:all:
# Check installation
ls -la python/
Create ZIP File:
# Create the layer ZIP
zip -r layer.zip python/
# Verify ZIP contents
unzip -l layer.zip | head -20
# Check file size (should be around 10-15 MB)
ls -lh layer.zip
Download to Your Local Machine:
layer.zipsecure-doc-pipeline/lambda/layer.zipAlternative: Upload Directly to S3 from CloudShell:
# Create a temporary S3 bucket for the layer
aws s3 mb s3://my-lambda-layers-temp-bucket-$(date +%s)
# Upload layer ZIP
aws s3 cp layer.zip s3://my-lambda-layers-temp-bucket-XXXXX/
# Note the S3 URL - you'll need this for Terraform
echo "s3://my-lambda-layers-temp-bucket-XXXXX/layer.zip"
Update Terraform to use S3:
# In terraform/main.tf, modify the lambda layer resource:
resource "aws_lambda_layer_version" "pdf_processing_layer" {
s3_bucket = "my-lambda-layers-temp-bucket-XXXXX"
s3_key = "layer.zip"
layer_name = "${var.project_name}-pdf-processing-layer"
description = "Dependencies for PDF processing: PyPDF2, reportlab, Pillow"
compatible_runtimes = ["python3.11"]
}
Clean Up CloudShell (Optional):
# Remove working directory
cd ~
rm -rf lambda-layer
If you prefer a fully GUI approach:
Use GitHub Actions or Online Python Environment:
Create a GitHub Repository (free):
.github/workflows/build-layer.yml file:name: Build Lambda Layer
on:
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install dependencies
run: |
mkdir -p python
pip install PyPDF2==3.0.1 reportlab==4.0.7 Pillow==10.1.0 boto3==1.28.85 -t python/
- name: Create ZIP
run: |
zip -r layer.zip python/
- name: Upload artifact
uses: actions/upload-artifact@v3
with:
name: lambda-layer
path: layer.zip
Run the workflow manually from GitHub Actions tab
Download the artifact (layer.zip) to your local machine
Open Lambda Console:
Configure Layer:
secure-doc-pipeline-pdf-processing-layerDependencies for PDF processing: PyPDF2, reportlab, Pillowlayer.zipPython 3.11Note the Layer ARN:
arn:aws:lambda:ap-south-1:123456789012:layer:secure-doc-pipeline-pdf-processing-layer:1)Update Terraform to Use Console-Created Layer:
In terraform/main.tf, replace the layer resource with a data source:
# Instead of creating the layer, reference the existing one:
# Comment out or remove the aws_lambda_layer_version resource
# Add this data source instead:
data "aws_lambda_layer_version" "pdf_processing_layer" {
layer_name = "secure-doc-pipeline-pdf-processing-layer"
version = 1 # Use the version number from console
}
# Update the lambda function to use the data source:
resource "aws_lambda_function" "document_processor" {
# ... other configuration ...
layers = [data.aws_lambda_layer_version.pdf_processing_layer.arn]
# ... rest of configuration ...
}
Some organizations publish pre-built Lambda Layers. Here are some options:
Visit: https://api.klayers.cloud/api/v2/p3.11/layers/latest/ap-south-1/
Find ARNs for:
Example usage in Terraform:
resource "aws_lambda_function" "document_processor" {
# ... other configuration ...
layers = [
"arn:aws:lambda:ap-south-1:770693421928:layer:Klayers-p311-Pillow:1", # Example ARN
"arn:aws:lambda:ap-south-1:770693421928:layer:Klayers-p311-PyPDF2:1", # Example ARN
# Note: You'll need to find reportlab separately or create a layer with just reportlab
]
# ... rest of configuration ...
}
Note: Public layers may not have all dependencies. You might need to combine multiple layers or create one custom layer for missing packages.
# Check the ZIP file structure
unzip -l layer.zip | head -20
# You should see:
# python/
# python/PyPDF2/
# python/reportlab/
# python/PIL/
# etc.
Now we’ll update the Terraform configuration to include Lambda function, layer, and S3 trigger.
terraform/main.tfAdd the following resources to your main.tf file (append to the end):
# ============================================
# Lambda Layer for PDF Processing Dependencies
# ============================================
resource "aws_lambda_layer_version" "pdf_processing_layer" {
filename = "../lambda/layer.zip"
layer_name = "${var.project_name}-pdf-processing-layer"
description = "Dependencies for PDF processing: PyPDF2, reportlab, Pillow"
compatible_runtimes = ["python3.11"]
source_code_hash = filebase64sha256("../lambda/layer.zip")
}
# ============================================
# IAM Role for Lambda Function
# ============================================
resource "aws_iam_role" "lambda_execution_role" {
name = "${var.project_name}-lambda-execution-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
tags = {
Name = "${var.project_name}-lambda-execution-role"
}
}
# ============================================
# IAM Policy for Lambda Function
# ============================================
resource "aws_iam_policy" "lambda_execution_policy" {
name = "${var.project_name}-lambda-execution-policy"
description = "Policy for Lambda function to access S3 buckets and CloudWatch Logs"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowCloudWatchLogs"
Effect = "Allow"
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "arn:aws:logs:${var.aws_region}:*:*"
},
{
Sid = "AllowS3Read"
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:HeadObject"
]
Resource = "${aws_s3_bucket.doc_buckets[local.bucket_names.internal_processing].arn}/*"
},
{
Sid = "AllowS3Write"
Effect = "Allow"
Action = [
"s3:PutObject",
"s3:PutObjectAcl"
]
Resource = "${aws_s3_bucket.doc_buckets[local.bucket_names.processed_output].arn}/*"
}
]
})
}
resource "aws_iam_role_policy_attachment" "lambda_execution_attach" {
role = aws_iam_role.lambda_execution_role.name
policy_arn = aws_iam_policy.lambda_execution_policy.arn
}
# ============================================
# Lambda Function
# ============================================
data "archive_file" "lambda_function_zip" {
type = "zip"
source_dir = "../lambda/function"
output_path = "../lambda/function.zip"
}
resource "aws_lambda_function" "document_processor" {
filename = data.archive_file.lambda_function_zip.output_path
function_name = "${var.project_name}-document-processor"
role = aws_iam_role.lambda_execution_role.arn
handler = "lambda_function.lambda_handler"
source_code_hash = data.archive_file.lambda_function_zip.output_base64sha256
runtime = "python3.11"
timeout = 300 # 5 minutes
memory_size = 512 # MB
layers = [aws_lambda_layer_version.pdf_processing_layer.arn]
environment {
variables = {
OUTPUT_BUCKET = aws_s3_bucket.doc_buckets[local.bucket_names.processed_output].id
LOG_LEVEL = "INFO"
}
}
tags = {
Name = "${var.project_name}-document-processor"
}
}
# ============================================
# CloudWatch Log Group for Lambda
# ============================================
resource "aws_cloudwatch_log_group" "lambda_log_group" {
name = "/aws/lambda/${aws_lambda_function.document_processor.function_name}"
retention_in_days = 14 # Keep logs for 14 days
tags = {
Name = "${var.project_name}-lambda-logs"
}
}
# ============================================
# S3 Bucket Notification to Trigger Lambda
# ============================================
resource "aws_lambda_permission" "allow_s3_invoke" {
statement_id = "AllowS3Invoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.document_processor.function_name
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.doc_buckets[local.bucket_names.internal_processing].arn
}
resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.doc_buckets[local.bucket_names.internal_processing].id
lambda_function {
lambda_function_arn = aws_lambda_function.document_processor.arn
events = ["s3:ObjectCreated:*"]
filter_suffix = ".pdf"
}
depends_on = [aws_lambda_permission.allow_s3_invoke]
}
outputs.tfAdd these outputs to terraform/outputs.tf:
output "lambda_function_name" {
description = "Name of the Lambda function"
value = aws_lambda_function.document_processor.function_name
}
output "lambda_function_arn" {
description = "ARN of the Lambda function"
value = aws_lambda_function.document_processor.arn
}
output "lambda_log_group" {
description = "CloudWatch Log Group for Lambda"
value = aws_cloudwatch_log_group.lambda_log_group.name
}
output "lambda_layer_arn" {
description = "ARN of the Lambda Layer"
value = aws_lambda_layer_version.pdf_processing_layer.arn
}
Follow the instructions in Phase 2 to create lambda/layer.zip.
Terraform will automatically package the function code using the archive_file data source.
cd terraform
terraform init
# Validate configuration
terraform validate
# Preview changes
terraform plan
# Apply changes
terraform apply
Type yes when prompted.
Deployment time: 2-4 minutes
# Create a simple test PDF (or use any PDF file)
# For testing, you can download a sample PDF:
curl -o sample-document.pdf https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
# Upload to uploads bucket using third-party profile
aws s3 cp sample-document.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test
Watch the Lambda function process the file in real-time:
# Get the log stream name (wait 30 seconds after upload)
aws logs describe-log-streams \
--log-group-name /aws/lambda/secure-doc-pipeline-document-processor \
--order-by LastEventTime \
--descending \
--max-items 1 \
--query 'logStreams[0].logStreamName' \
--output text
# Replace LOG_STREAM_NAME with the output from above
aws logs get-log-events \
--log-group-name /aws/lambda/secure-doc-pipeline-document-processor \
--log-stream-name LOG_STREAM_NAME
# Wait 2-3 minutes for processing and replication
# Check processed-output bucket
aws s3 ls s3://secure-doc-pipeline-processed-output/processed/
# Check delivery bucket (after replication)
aws s3 ls s3://secure-doc-pipeline-delivery/processed/
# Download the processed file
aws s3 cp s3://secure-doc-pipeline-delivery/processed/sample-document-processed-20251017-045805.pdf ./downloaded-processed.pdf --profile third-party-test
# Get metadata of processed file
aws s3api head-object \
--bucket secure-doc-pipeline-processed-output \
--key processed/sample-document-processed-20251016-120000.pdf
Look for custom metadata:
{
"Metadata": {
"original-file": "sample-document.pdf",
"processed-timestamp": "20251016-120000",
"processor": "secure-doc-pipeline-lambda",
"original-size": "13264"
}
}
Symptoms: File uploaded to uploads bucket, replicated to internal-processing, but Lambda doesn’t run
Solutions:
Check S3 notification configuration:
aws s3api get-bucket-notification-configuration \
--bucket secure-doc-pipeline-internal-processing
Verify Lambda has S3 permission:
aws lambda get-policy \
--function-name secure-doc-pipeline-document-processor
Check CloudWatch Logs:
aws logs tail /aws/lambda/secure-doc-pipeline-document-processor --follow
Manually invoke Lambda for testing:
aws lambda invoke \
--function-name secure-doc-pipeline-document-processor \
--payload file://test-event.json \
response.json
Create test-event.json:
{
"Records": [
{
"s3": {
"bucket": {
"name": "secure-doc-pipeline-internal-processing"
},
"object": {
"key": "sample-document.pdf",
"size": 13264
}
}
}
]
}
Symptoms: Lambda execution fails with “Task timed out after 300.00 seconds”
Solutions:
Increase timeout in Terraform:
resource "aws_lambda_function" "document_processor" {
...
timeout = 600 # Increase to 10 minutes
}
Apply changes:
terraform apply
Check file size: Very large PDFs may need more time or memory
Symptoms: Error: “Runtime exited with error: signal: killed”
Solutions:
Increase memory in Terraform:
resource "aws_lambda_function" "document_processor" {
...
memory_size = 1024 # Increase to 1 GB
}
Apply changes:
terraform apply
Symptoms: “No module named ‘PyPDF2’” or similar
Solutions:
Verify layer is attached:
aws lambda get-function \
--function-name secure-doc-pipeline-document-processor \
--query 'Configuration.Layers'
Check layer ZIP structure:
unzip -l layer.zip | grep -E "(PyPDF2|reportlab|PIL)"
Rebuild layer with correct structure: Ensure dependencies are in python/ directory
Update Terraform and redeploy:
terraform apply
Symptoms:
cannot import name '_imaging' from 'PIL'
Root Cause:
_imaging module)Solutions:
Your current test file might not be a real PDF. Use a real PDF:
# Download a sample PDF
curl -o sample.pdf https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
# Upload it
aws s3 cp sample.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test
The Lambda Layer needs to be built on a Linux environment compatible with Lambda.
Step 1: Open AWS CloudShell
Step 2: Create the Layer
Run these commands in CloudShell:
# Create directory structure
mkdir -p lambda-layer/python
cd lambda-layer
# Create requirements.txt
cat > requirements.txt << 'EOF'
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.29.7
EOF
# Install dependencies for Lambda runtime
pip install -r requirements.txt \
-t python/ \
--platform manylinux2014_x86_64 \
--only-binary=:all: \
--python-version 3.11
# Create the layer zip
zip -r layer.zip python/
# Upload to S3 (so you can download it)
aws s3 cp layer.zip s3://secure-doc-pipeline-uploads/lambda-layer.zip
Step 3: Download and Update Terraform
# Download the layer to your local machine
aws s3 cp s3://secure-doc-pipeline-uploads/lambda-layer.zip ./lambda/layer.zip
# Move it to the correct location
mv lambda/layer.zip secure-doc-pipeline/lambda/layer.zip
Step 4: Update Lambda Layer
cd secure-doc-pipeline/terraform
# Apply the update (this will update just the layer)
terraform apply -target=aws_lambda_layer_version.pdf_processing_layer
If you have Docker installed:
cd secure-doc-pipeline/lambda
# Create Dockerfile
cat > Dockerfile << 'EOF'
FROM public.ecr.aws/lambda/python:3.11
RUN mkdir /tmp/python
COPY requirements.txt /tmp/
RUN pip install -r /tmp/requirements.txt -t /tmp/python/
EOF
# Create requirements.txt
cat > requirements.txt << 'EOF'
PyPDF2==3.0.1
reportlab==4.0.7
Pillow==10.1.0
boto3==1.29.7
EOF
# Build and extract
docker build -t lambda-layer-builder .
docker create --name temp-container lambda-layer-builder
docker cp temp-container:/tmp/python ./python
docker rm temp-container
# Create layer zip
cd python
zip -r ../layer.zip .
cd ..
Then update Terraform as shown in Step 4 above.
If you don’t want to rebuild the layer, you can modify the Lambda function to skip the enhanced PDF creation and just copy files:
This is actually what’s happening now as a fallback - the function is still working, just without the enhanced PDF features.
Testing After Fix:
# Test with a real PDF
curl -o test-doc.pdf https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
aws s3 cp test-doc.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test
# Check CloudWatch Logs
aws logs tail /aws/lambda/secure-doc-pipeline-document-processor --follow
# Verify processed output
aws s3 ls s3://secure-doc-pipeline-processed-output/processed/
Recommended Approach:
Why the Current Setup Still Works:
Your Lambda has a fallback mechanism:
So your infrastructure is working correctly - you just need a proper PDF file for testing!
Symptoms: Lambda can’t read from source or write to destination bucket
Solutions:
Check IAM role permissions:
aws iam get-role-policy \
--role-name secure-doc-pipeline-lambda-execution-role \
--policy-name lambda-execution-policy
Verify bucket permissions: Ensure Lambda role has proper S3 permissions in Terraform
Check bucket encryption: If using KMS, Lambda needs KMS permissions (covered in Phase 3)
# Get Lambda invocation count
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=secure-doc-pipeline-document-processor \
--start-time 2025-10-16T00:00:00Z \
--end-time 2025-10-16T23:59:59Z \
--period 3600 \
--statistics Sum
# Get Lambda error count
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=secure-doc-pipeline-document-processor \
--start-time 2025-10-16T00:00:00Z \
--end-time 2025-10-16T23:59:59Z \
--period 3600 \
--statistics Sum
# Follow logs (keeps connection open)
aws logs tail /aws/lambda/secure-doc-pipeline-document-processor --follow
aws logs filter-log-events \
--log-group-name /aws/lambda/secure-doc-pipeline-document-processor \
--filter-pattern "ERROR"
Lambda CPU power scales with memory. Test different configurations:
memory_size = 512 # Baseline
memory_size = 1024 # 2x CPU power
memory_size = 2048 # 4x CPU power
Cost vs Performance:
Add provisioned concurrency (costs more but eliminates cold starts):
resource "aws_lambda_provisioned_concurrency_config" "processor_concurrency" {
function_name = aws_lambda_function.document_processor.function_name
provisioned_concurrent_executions = 1
qualifier = aws_lambda_function.document_processor.version
}
Keep layers under 50 MB for faster cold starts:
# Check layer size
ls -lh lambda/layer.zip
# Remove unnecessary dependencies
pip install --no-deps PyPDF2==3.0.1 -t python/
Lambda Compute (ap-south-1 pricing):
CloudWatch Logs:
Data Transfer:
With 1,000 documents/month (avg 2 MB each):
# Test 1: Very small PDF
echo "%PDF-1.4" > tiny.pdf
aws s3 cp tiny.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test
# Test 2: Large PDF (create or download a multi-page PDF)
aws s3 cp large-document.pdf s3://secure-doc-pipeline-uploads/ --profile third-party-test
# Test 3: Non-PDF file (should be skipped gracefully)
echo "not a pdf" > test.txt
aws s3 cp test.txt s3://secure-doc-pipeline-uploads/ --profile third-party-test
# Test 4: PDF with special characters in filename
aws s3 cp "document with spaces & special!chars.pdf" s3://secure-doc-pipeline-uploads/ --profile third-party-test
Check CloudWatch Logs to verify proper handling of each case.
Before moving to Phase 3, verify:
Congratulations! You’ve built a fully functional serverless document processing pipeline with:
Phase 3 will enhance security and monitoring with:
Proceed to: AWS Secure Document Pipeline - Part 3: Security and Monitoring Here is the Part 3, where we’ll enhance security and monitoring with custom KMS encryption for all buckets, CloudTrail for detailed audit logging, CloudWatch alarms for failure detection, and SNS notifications for critical events!