April 2025 · 7 min read

Building Multi-Tenant AI Infrastructure on AWS Without Cutting Corners on Isolation

There's a pattern I keep seeing with AI startups building on Bedrock: they start with a single-tenant prototype, it works great, and then someone says "let's make it multi-tenant" and the team reaches for metadata filters on a shared OpenSearch index. It's the fastest path to multi-tenancy, and it's also the fastest path to a data leak that ends your company.

Metadata filters are application-level isolation. If your Lambda function has a bug, if someone forgets to include the tenant filter in a query, if a new developer doesn't understand the pattern — tenant data leaks. Your SOC 2 auditor will ask "what prevents Tenant A from seeing Tenant B's data?" and "we always remember to add the filter" is not a control.

The three layers that actually work

Real tenant isolation on AWS needs to work at three levels, and if any one of them fails, the other two still prevent a breach:

Layer one is application routing. Every request comes in with a JWT that contains a tenant_id claim. A Lambda authorizer extracts it, looks up the tenant's dedicated resources in a DynamoDB registry, and passes that context downstream. The application code routes to the right S3 bucket, the right OpenSearch collection, the right Knowledge Base. This is the layer most people stop at.

Layer two is IAM enforcement. Instead of using the Lambda's execution role directly for data access, the function assumes a separate "tenant data access" role using STS with the tenant_id injected as a session tag. The tenant's S3 bucket policy checks aws:PrincipalTag/tenant_id — if the session tag doesn't match, IAM denies the request. Even if the application code has a bug and tries to access the wrong bucket, IAM blocks it. This is the layer almost nobody implements, and it's the one that matters most.

Layer three is resource-level isolation. Each tenant gets their own S3 bucket, their own OpenSearch Serverless collection, their own Bedrock Knowledge Base, and their own KMS encryption key. There's no shared index with metadata filters. Tenant data physically cannot leak because it resides in separate AWS resources.

What gets dedicated vs. what gets shared

Going fully dedicated for everything would be wasteful. The trick is knowing which resources need isolation and which can be safely shared.

Dedicated per tenant: S3 buckets (document storage), OpenSearch Serverless collections (vector embeddings), Bedrock Knowledge Bases, and KMS keys. These hold tenant data — they must be isolated.

Shared across tenants: CloudFront, API Gateway, WAF, Lambda functions, Bedrock agents, and DynamoDB tables (with tenant_id as the partition key). These are compute and routing layers that don't store tenant data — they process it in the context of a tenant-scoped session.

DynamoDB is an interesting case. The tables are shared, but DynamoDB partitions are physically isolated by partition key. When every query starts with tenant_id as the partition key, it's structurally impossible to accidentally query across tenants. You'd have to intentionally omit the partition key, which would fail anyway because it's required.

The Knowledge Base architecture that auditors like

Most AI applications need two types of data: the tenant's private documents and some shared reference data. For a regulated SaaS platform, that might be a client's private case files (tenant-specific) plus industry reference data (shared across all tenants).

The pattern that works: a two-tier Knowledge Base setup. One shared KB containing public reference data, managed by the platform team, read-only for all tenants. One dedicated KB per tenant containing their private documents, backed by their dedicated S3 bucket and OpenSearch collection.

The Bedrock agent queries both KBs on every invocation and combines the results. The tenant's private data never touches the shared KB, and the shared reference data doesn't need to be duplicated into every tenant's collection. This saves roughly $350/month per tenant in OpenSearch Serverless costs for the shared data alone.

Automated tenant onboarding

None of this works if onboarding a new tenant requires an engineer to manually create 6 AWS resources and wire them together. The onboarding flow needs to be fully automated — CloudFormation StackSets or Step Functions that provision the S3 bucket, KMS key, OpenSearch collection, Knowledge Base, Cognito configuration, and tenant registry entry in sequence.

On a previous engagement, we got tenant onboarding down from a week of manual work to under an hour of automated provisioning. The infrastructure was more complex, but the operational burden was dramatically lower. That's the tradeoff worth making.

Offboarding matters too. When a tenant leaves, you need to delete their S3 bucket, OpenSearch collection, Knowledge Base, Cognito users, DynamoDB records, and schedule their KMS key for deletion (30-day AWS minimum). Every step gets logged in CloudTrail for compliance evidence. If your auditor asks "how do you handle data deletion when a customer leaves?" you want to point at an automated runbook, not a checklist in Confluence.

The cost reality

The main cost driver is OpenSearch Serverless — each collection has a 2 OCU minimum at roughly $350/month. For a startup with 5 tenants, that's $1,750/month just for vector storage. At 50+ tenants, it makes sense to evaluate migrating to OpenSearch Provisioned with index-per-tenant on shared domains.

Everything else is marginal. KMS keys are $1/month. S3 buckets cost the same whether data is in one bucket or ten. Bedrock Knowledge Bases have no standalone cost. The per-tenant infrastructure overhead is dominated by that one OpenSearch line item.

The question isn't whether you can afford dedicated resources per tenant. It's whether you can afford the alternative — a shared-index architecture that fails an audit, or worse, leaks data between tenants. For most B2B AI companies, especially those handling sensitive, healthcare, or financial data, the $350/month per tenant is table stakes.

Building a multi-tenant AI application? We can design the infrastructure.

Start a conversation