Current Architecture: ComplyAI Compliance Platform

Executive Summary

ComplyAI operates a distributed microservices architecture comprising 13 separate services built primarily on Python (77%), with a modern React/TypeScript frontend stack. The platform implements an AI-powered compliance checking system for advertising content, utilizing multiple ML models (TensorFlow, PyTorch, Google Vertex AI) distributed across specialized services. The architecture employs a "musical" naming convention (triangle, violin, maestro, gong) suggesting orchestrated workflows, with dual API implementations (sync/async), cross-cloud deployment (AWS + GCP), and a mixed Flask/Django backend pattern. Current state shows ~64,000 lines of code with significant technical debt: average 8.5% test coverage, hardcoded secrets in 3 services, and service fragmentation requiring consolidation for Series A readiness.

Business Context

Problem Statement

ComplyAI addresses the complex challenge of ensuring advertising compliance across multiple regulatory frameworks and platforms. The system must:

Process high volumes of ad content in real-time and batch modes
Apply ML-based compliance analysis using multiple specialized models
Manage complex compliance policies and regulatory content
Provide actionable insights and compliance scores
Integrate with external platforms (Facebook, Stripe, analytics)
Support both synchronous user interactions and asynchronous background processing

Success Criteria

Scalability: Handle 10K+ compliance checks per day with <200ms API response time
Accuracy: ML models maintain >95% accuracy in compliance detection
Availability: 99.9% uptime for customer-facing services
Security: SOC 2 compliance with comprehensive audit logging and encryption
Developer Velocity: Onboard new developers within 2 weeks (currently limited by documentation gaps)
Cost Efficiency: Optimize ML inference costs while maintaining performance

Constraints

Technical Constraints:

Legacy microservices fragmentation (13 separate repos)
Mixed technology stack (Flask + Django + React)
Zero-downtime deployment requirements
Cross-cloud dependencies (AWS + GCP)

Business Constraints:

Limited engineering team (10 contributors)
Series A readiness timeline
Existing customer SLA commitments
Budget constraints on infrastructure costs

Regulatory Constraints:

GDPR compliance for user data
PCI DSS for payment processing (via Stripe)
Advertising platform compliance requirements
Data retention and audit requirements

Architecture Overview

System Context Diagram

                                    External Systems
                                           │
                    ┌──────────────────────┼──────────────────────┐
                    │                      │                      │
                    ▼                      ▼                      ▼
            [Facebook Business API]  [Stripe Payment]    [Google Sheets]
                    │                      │                      │
                    └──────────────────────┼──────────────────────┘
                                           │
                                           ▼
                          ┌────────────────────────────────┐
                          │     ComplyAI Platform          │
                          │                                │
                          │  ┌──────────────────────────┐  │
                          │  │   Frontend Applications  │  │
                          │  │  (App + Marketing Site)  │  │
                          │  └──────────────────────────┘  │
                          │              │                 │
                          │              ▼                 │
                          │  ┌──────────────────────────┐  │
                          │  │      API Gateway         │  │
                          │  │   (Sync + Async APIs)    │  │
                          │  └──────────────────────────┘  │
                          │              │                 │
                          │              ▼                 │
                          │  ┌──────────────────────────┐  │
                          │  │   Core Engine + ML       │  │
                          │  │  + Orchestration Layer   │  │
                          │  └──────────────────────────┘  │
                          │              │                 │
                          │              ▼                 │
                          │  ┌──────────────────────────┐  │
                          │  │    Data & Storage        │  │
                          │  │  (PostgreSQL + S3 + CMS) │  │
                          │  └──────────────────────────┘  │
                          └────────────────────────────────┘
                                           │
                                           ▼
                              [Users & Administrators]

High-Level Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     PRESENTATION LAYER                              │
│  ┌────────────────────┐              ┌────────────────────┐        │
│  │  complyai-frontend │              │        www         │        │
│  │  (React/TypeScript)│              │  (Marketing Site)  │        │
│  │  Auth0 + Stripe UI │              │   (React/Tailwind) │        │
│  └────────────────────┘              └────────────────────┘        │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     API GATEWAY LAYER                               │
│  ┌────────────────────┐              ┌────────────────────┐        │
│  │   complyai-api     │              │    api-async       │        │
│  │  (Flask/Sync)      │◄────────────►│  (Flask/Celery)    │        │
│  │  15,630 LOC        │              │  20,143 LOC        │        │
│  └────────────────────┘              └────────────────────┘        │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                BUSINESS LOGIC & ORCHESTRATION LAYER                 │
│  ┌────────────────────┐              ┌────────────────────┐        │
│  │  complyai-core     │              │ complyai-maestro   │        │
│  │  (Flask/Core Logic)│◄────────────►│  (Orchestration)   │        │
│  │  13,654 LOC        │              │  GCP Integration   │        │
│  └────────────────────┘              └────────────────────┘        │
│                                                                     │
│  ┌────────────────────┐              ┌────────────────────┐        │
│  │  complyai-gong     │              │  complyai-adhoc    │        │
│  │  (Vertex AI/Policy)│              │  (Utilities/8K LOC)│        │
│  └────────────────────┘              └────────────────────┘        │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    ML INFERENCE LAYER                               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐            │
│  │   triangle  │    │   violin    │    │     ipu     │            │
│  │ (TF/PyTorch)│    │ (TF/PyTorch)│    │ (TF/PyTorch)│            │
│  │   551 LOC   │    │   357 LOC   │    │   418 LOC   │            │
│  └─────────────┘    └─────────────┘    └─────────────┘            │
│         All use: chromadb (vectors) + gunicorn + boto3              │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        DATA LAYER                                   │
│  ┌────────────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │    PostgreSQL      │  │   AWS S3     │  │ complyai_cms │       │
│  │ (Primary Database) │  │ (File Store) │  │(Django/Wagtail)│     │
│  │  Shared by APIs    │  │   Models     │  │  Content Mgmt │       │
│  └────────────────────┘  └──────────────┘  └──────────────┘       │
│                                                                     │
│  ┌────────────────────┐  ┌──────────────┐                         │
│  │     chromadb       │  │  BigQuery    │                         │
│  │  (Vector Store)    │  │ (Analytics)  │                         │
│  └────────────────────┘  └──────────────┘                         │
└─────────────────────────────────────────────────────────────────────┘

Architecture Style

Microservices (13 separate services)
Monolithic
Serverless (partial - potential Lambda usage)
Event-driven (Celery task queues, async processing)
Layered (presentation, API, business logic, data)
Service-oriented
Hybrid: Microservices with layered organization, event-driven async processing

Key Characteristics:

Distributed architecture with separate repositories per service
Musical naming pattern: triangle, violin, ipu, maestro, gong (orchestration metaphor)
Dual API pattern: Synchronous (complyai-api) + Asynchronous (api-async)
Specialized ML services: Three separate inference engines
Cross-cloud deployment: AWS (API, storage) + GCP (Core, ML, Frontend)
Mixed frameworks: Flask (9 services), Django (1 service), React (2 frontends)

Components

Core Components

Component	Purpose	Technology	LOC	Status	Test Coverage
complyai-api	Main synchronous REST API	Flask, PostgreSQL, Celery	15,630	Active	1.39%
api-async	Asynchronous processing API	Flask, Celery, TensorFlow	20,143	Active	0% ⚠️
complyai-core	Core compliance engine & models	Flask, PostgreSQL, Auth	13,654	Active	8.20%
complyai-frontend	Main application UI	React 18, TypeScript, Vite	N/A	Active	N/A
www	Marketing website	React 18, Tailwind CSS	73	Active	40%
complyai-triangle	ML inference service #1	TensorFlow, PyTorch, Flask	551	Active	11.76%
complyai-violin	ML inference service #2	TensorFlow, PyTorch, Flask	357	Active	0% ⚠️
complyai-ipu	ML inference service #3	TensorFlow, PyTorch, Flask	418	Active	18.18%
complyai-maestro	Service orchestration	Flask, GCP, BigQuery	2,289	Active	17.65%
complyai-gong	Policy management + Vertex AI	Flask, Vertex AI, GCP	1,031	Active	0%
complyai_cms	Content management system	Django, Wagtail, Webpack	2,771	Active	1.32%
complyai-adhoc	Utility scripts & tools	Python scripts	8,156	Active	60.71% ✅
.github	CI/CD workflows	GitHub Actions	N/A	Active	N/A

Total Lines of Code: ~64,000+ across all services

Component Interactions

Technology Stack

Frontend

complyai-frontend (Main Application):

Framework: React 18.2 with TypeScript
Build Tool: Vite (fast HMR, modern bundling)
State Management:
- Zustand (global state)
- Jotai (atomic state)
- TanStack Query (server state, caching)
UI Library:
- Tailwind CSS (utility-first styling)
- Headless UI (accessible components)
- Heroicons (icon library)
- Framer Motion (animations)
Authentication: Auth0 React SDK
Payment: React Stripe.js (PCI compliant)
Data Visualization: recharts, react-gauge-component
Testing: Cypress (E2E), Jest, React Testing Library
Code Quality: ESLint, Prettier, Husky, Commitlint
Routing: React Router DOM v6

www (Marketing Site):

Framework: React 18.2 with JavaScript
Build Tool: Create React App (React Scripts)
Styling: Tailwind CSS + SCSS/Sass
SEO: React Helmet Async (meta tags)
Icons: FontAwesome
Animations: AOS (Animate On Scroll)
Testing: Jest, React Testing Library
Performance: Web Vitals monitoring

Backend

Primary Stack (9/13 services):

Language: Python 3.x
Framework: Flask (complyai-api, api-async, core, triangle, violin, ipu, maestro, gong)
ORM: Flask-SQLAlchemy, Alembic (migrations)
Auth: Flask-Security-Too, PyJWT, Authlib (OAuth)
WSGI Server: Gunicorn
API Features: Flask-CORS, Flask-Migrate

Alternative Stack (1 service):

Framework: Django + Wagtail CMS (complyai_cms)
Frontend Build: Webpack + Tailwind CSS
Static Files: WhiteNoise

API Specifications:

Style: REST (JSON)
Versioning: URL-based (v1, v2 auth endpoints)
Authentication: JWT tokens, OAuth2, API keys
Rate Limiting: Configured but inconsistently implemented

Data Layer

Primary Database:

RDBMS: PostgreSQL (via psycopg2-binary/psycopg)
Usage: Shared across complyai-api, api-async, complyai-core
Migrations: Alembic (Flask), Django migrations (CMS)

Vector Storage:

Database: chromadb (ML embeddings)
Usage: ML services (triangle, violin, ipu)
Purpose: Semantic search, similarity matching

File Storage:

Primary: AWS S3 (via boto3)
Usage: ML models, user uploads, media files, data lake
CDN: CloudFront (inferred)

Analytics:

Warehouse: Google BigQuery
Usage: complyai-maestro (data pipelines, reporting)
Integration: google-cloud-bigquery SDK

Content Management:

CMS: Django Wagtail
Storage: PostgreSQL + S3 (media)
API: Wagtail REST API for frontend consumption

Cache Layer:

Not explicitly detected (Redis recommended for implementation)

Message Queue:

Queue: Celery with SQS backend (inferred from Celery[sqs])
Broker: AWS SQS or Redis (not explicitly documented)
Workers: Distributed Celery workers for async tasks

ML/AI Stack

Deep Learning Frameworks:

TensorFlow: triangle, violin, ipu, api-async
PyTorch: triangle, violin, ipu, api-async
Transformers: Hugging Face (NLP models)
ONNX: Model optimization and portability
Optimum: Hugging Face optimization library

Google Cloud AI:

Platform: Google Vertex AI (complyai-gong)
Usage: Managed ML inference, policy analysis
Integration: vertexai Python SDK

ML Infrastructure:

Vector Database: chromadb (embeddings storage)
Model Storage: AWS S3 (via boto3)
Data Processing: pandas, numpy, pyarrow
Computer Vision: opencv-python (complyai-violin)
NLP: gensim, scikit-learn, datasets

ML Deployment:

Serving: Flask REST endpoints
Scaling: Horizontal (multiple inference instances)
Model Format: Native + ONNX for optimization

Infrastructure

Cloud Providers:

AWS: Primary for API services, S3 storage, SQS queues
- Services: EC2 (inferred), S3, SQS, CloudFront
- SDK: boto3 (8 services)
Google Cloud Platform: Core engine, ML, analytics
- Services: Vertex AI, BigQuery, Cloud Storage, Cloud SQL, Secret Manager
- SDK: google-cloud-* libraries (maestro, gong)

Deployment:

Containerization: Docker (Dockerfile present in multiple services)
Orchestration: Not explicitly documented (likely ECS or GKE)
WSGI: Gunicorn for Python services
Static Hosting: S3 + CloudFront or similar for frontends

CI/CD:

Platform: GitHub Actions (.github repository)
Hooks: Husky (pre-commit) in frontend projects
Commit Standards: Commitlint (conventional commits)
Testing: Automated via pytest, Cypress

Monitoring (Partial Implementation):

Logging: Python logging module (comprehensive)
Notifications: Slack webhooks (across services)
User Analytics: FullStory (frontend)
Metrics: Web Vitals (frontend)
APM: Not detected (needs implementation)
Error Tracking: Not explicitly configured

Development Tools:

Version Control: Git + GitHub
Package Management: npm (frontend), pip + requirements.txt (backend)
Testing: pytest, Cypress, Jest
Code Quality: ESLint, Prettier (frontend), Python linting (backend)
Dependency Management: Dependabot potential via .github

Data Architecture

Data Flow Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                         DATA INGESTION                              │
│                                                                     │
│  [User Input] ──→ [Frontend] ──→ [complyai-api] ──→ [PostgreSQL]  │
│                                          │                          │
│  [Facebook API] ──→ [api-async] ────────┘                         │
│                                                                     │
│  [Stripe Webhooks] ──→ [complyai-api] ──→ [PostgreSQL]           │
│                                                                     │
│  [Content Editors] ──→ [CMS] ──→ [PostgreSQL + S3]               │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     DATA PROCESSING LAYER                           │
│                                                                     │
│  ┌──────────────────────────────────────────────────┐             │
│  │  SYNCHRONOUS PROCESSING                           │             │
│  │  [complyai-api] → [complyai-core] → [Validation] │             │
│  │                          ↓                        │             │
│  │                   [Business Logic]                │             │
│  │                          ↓                        │             │
│  │                   [PostgreSQL Write]              │             │
│  └──────────────────────────────────────────────────┘             │
│                                                                     │
│  ┌──────────────────────────────────────────────────┐             │
│  │  ASYNCHRONOUS PROCESSING                          │             │
│  │  [api-async] → [Celery Tasks] → [External APIs]  │             │
│  │         ↓              ↓              ↓           │             │
│  │  [FB Ad Data]  [AI Scoring]  [Batch Jobs]        │             │
│  │         ↓              ↓              ↓           │             │
│  │  [PostgreSQL] → [ML Models] → [S3/BigQuery]      │             │
│  └──────────────────────────────────────────────────┘             │
│                                                                     │
│  ┌──────────────────────────────────────────────────┐             │
│  │  ML INFERENCE PIPELINE                            │             │
│  │  [Input Data] → [Preprocessing] → [chromadb]     │             │
│  │                          ↓                        │             │
│  │              [Model Inference (triangle/violin/ipu)]           │
│  │                          ↓                        │             │
│  │              [Compliance Score + Insights]        │             │
│  │                          ↓                        │             │
│  │                   [PostgreSQL]                    │             │
│  └──────────────────────────────────────────────────┘             │
│                                                                     │
│  ┌──────────────────────────────────────────────────┐             │
│  │  ORCHESTRATED WORKFLOWS (maestro)                 │             │
│  │  [Trigger] → [Workflow Definition] → [Task Queue]│             │
│  │                          ↓                        │             │
│  │         [Parallel Service Calls: API + ML + Data]│             │
│  │                          ↓                        │             │
│  │              [Aggregate Results]                  │             │
│  │                          ↓                        │             │
│  │         [BigQuery (Analytics) + Notifications]    │             │
│  └──────────────────────────────────────────────────┘             │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         DATA CONSUMPTION                            │
│                                                                     │
│  [Frontend] ←── [REST API] ←── [PostgreSQL + S3]                   │
│                                                                     │
│  [Reports] ←── [BigQuery] ←── [Aggregated Analytics]              │
│                                                                     │
│  [Dashboards] ←── [TanStack Query Cache] ←── [API]                │
│                                                                     │
│  [Exports] ←── [Google Sheets Integration] ←── [API]              │
└─────────────────────────────────────────────────────────────────────┘

Data Models

Key Entities

1. User & Authentication

Purpose: User accounts, authentication, authorization
Key Attributes: user_id, email, password_hash, roles, permissions, oauth_tokens
Relationships: → Organizations, → Subscriptions, → Ad Accounts
Storage: PostgreSQL (complyai-core models)
Security: Encrypted passwords (Flask-Security-Too), JWT tokens, OAuth2

2. Ad Account

Purpose: Facebook ad account data and monitoring
Key Attributes: account_id, fb_account_id, status, spend, metrics, sync_timestamp
Relationships: → User, → Ad Creatives, → Compliance Checks
Storage: PostgreSQL (shared models)
Sync: Real-time via api-async + Facebook Business API

3. Ad Creative

Purpose: Advertising creative content for compliance checking
Key Attributes: creative_id, content, images, video_urls, status, platform
Relationships: → Ad Account, → Compliance Results, → Policy Violations
Storage: PostgreSQL (metadata), S3 (media files)
Processing: Async via Celery tasks

4. Compliance Check

Purpose: Compliance analysis results and scoring
Key Attributes: check_id, creative_id, ai_score, policy_violations[], timestamp, model_version
Relationships: → Ad Creative, → Policy Rules, → ML Inference Results
Storage: PostgreSQL (structured), chromadb (embeddings)
Generation: ML services (triangle, violin, ipu)

5. Policy

Purpose: Compliance policies and regulatory rules
Key Attributes: policy_id, name, rules[], jurisdiction, version, effective_date
Relationships: → Compliance Checks, → Content (CMS)
Storage: PostgreSQL (structured), Wagtail CMS (content)
Management: complyai-gong (AI-powered analysis)

6. Subscription

Purpose: User subscriptions and payment tracking
Key Attributes: subscription_id, user_id, plan, status, stripe_subscription_id
Relationships: → User, → Payments
Storage: PostgreSQL
Processing: Stripe webhooks → complyai-api

7. Workflow

Purpose: Orchestrated multi-service workflows
Key Attributes: workflow_id, type, status, steps[], results, created_by
Relationships: → Tasks, → Services
Storage: PostgreSQL (complyai-maestro)
Execution: Celery task chains

8. ML Model Metadata

Purpose: ML model versioning and tracking
Key Attributes: model_id, name, version, s3_path, accuracy_metrics, created_date
Relationships: → Inference Results
Storage: PostgreSQL (metadata), S3 (model files)
Services: triangle, violin, ipu

9. Content

Purpose: Regulatory content and documentation
Key Attributes: content_id, title, body, category, publish_date, author
Relationships: → Policies
Storage: PostgreSQL (via Wagtail), S3 (media)
Management: complyai_cms (Django/Wagtail)

10. Audit Log

Purpose: Comprehensive audit trail for compliance
Key Attributes: log_id, user_id, action, entity_type, entity_id, timestamp, ip_address
Relationships: → All entities
Storage: PostgreSQL (across services - 8+ files with logging)
Retention: 7 years (compliance requirement)

Data Storage Strategy

Data Type	Storage Solution	Rationale	Retention	Backup
Transactional	PostgreSQL	ACID compliance, relational integrity	7 years	Daily
User Credentials	PostgreSQL (encrypted)	Security, fast auth lookups	Account lifetime	Daily
ML Models	AWS S3	Large files, versioning, cost-effective	Indefinite	Versioned
Media Files	AWS S3	Scalable, CDN integration	2 years	Versioned
Vector Embeddings	chromadb	Fast similarity search, ML optimized	90 days	Weekly
Analytics	Google BigQuery	Complex queries, data warehouse	2 years	N/A (managed)
Cache	(Not implemented)	Need Redis for session/API cache	24 hours	None
Logs	File system + potential CloudWatch	Debugging, audit trails	90 days	Weekly
CMS Content	PostgreSQL (Wagtail)	Structured content, versioning	Indefinite	Daily
Task Queue	SQS/Redis (Celery)	Async job management	14 days	None

Storage Patterns:

Shared Database: complyai-api, api-async, complyai-core share PostgreSQL instance
Service-Specific: ML services have isolated chromadb instances
Hybrid Cloud: PostgreSQL on AWS, BigQuery on GCP, cross-cloud data movement
Data Lake: S3 used as data lake for analytics and ML training

Data Security

Encryption at Rest:

Method: AES-256 encryption for sensitive fields
Implementation: Dedicated encryption module in complyai-core
Coverage: User credentials, payment data (tokenized via Stripe), PII
Key Management: Environment variables (needs upgrade to AWS KMS/Google Secret Manager)

Encryption in Transit:

Method: TLS 1.2+ (HTTPS)
Implementation: Load balancer termination, internal service calls
Issues: Insecure HTTP detected in 21 occurrences across 7 services ⚠️
Recommendation: Enforce HTTPS for all internal communication

Access Controls:

Database: PostgreSQL role-based access control
Application: Flask-Security-Too, JWT-based authorization
API: Token-based auth, OAuth2 for third-party
Cloud: IAM roles (AWS), Cloud IAM (GCP)
Issue: ML services lack authentication ⚠️

Data Classification:

Public: Marketing content (www, CMS public pages)
Internal: Business logic, configurations
Confidential: User data, compliance results
Restricted: Payment data (tokenized), ML models

PII Handling:

Approach: Minimize collection, encrypt at rest, tokenize payments
GDPR: Basic patterns detected (data_privacy in complyai-core)
Retention: Configurable per data type
Deletion: Soft delete with anonymization (needs verification)
Access Logging: Comprehensive audit trails (8+ services)

Secrets Management:

Current: Environment variables, google-cloud-secret-manager (maestro, gong)
Issues:
- Hardcoded secrets in 3 services (maestro, CMS, adhoc) ⚠️ CRITICAL
- Inconsistent approach across services
Recommendation: Migrate all to AWS Secrets Manager or HashiCorp Vault

Integration Architecture

External Integrations

System	Integration Type	Purpose	Protocol	Auth Method	Services Using
Facebook Business API	REST API	Ad account data, creative fetch	HTTPS	OAuth2 + API Key	api-async, complyai-adhoc
Stripe	REST API + Webhooks	Payment processing	HTTPS	API Key	complyai-api, complyai-core, frontend
Auth0	OAuth2/OIDC	User authentication	HTTPS	Client credentials	complyai-frontend
Google Sheets	REST API	Data export/import	HTTPS	OAuth2	complyai-api
Slack	Webhooks	Notifications, alerts	HTTPS	Webhook URL	8 services (widespread)
BigQuery	Cloud SDK	Analytics, data warehouse	gRPC/HTTPS	Service Account	complyai-maestro
Google Vertex AI	Cloud SDK	Managed ML inference	gRPC/HTTPS	Service Account	complyai-gong
AWS S3	SDK (boto3)	File storage, models	HTTPS	IAM Role/Keys	8 services
SQS	SDK (boto3)	Message queue (Celery)	HTTPS	IAM Role/Keys	api-async, maestro

Internal Service Communication

Synchronous Communication:

Primary: REST APIs (JSON over HTTPS)
Pattern: complyai-frontend → complyai-api → complyai-core → PostgreSQL
Libraries: requests, Flask built-in
Issues:
- No API gateway (direct service-to-service)
- Inconsistent error handling
- Missing circuit breakers

Asynchronous Communication:

Primary: Celery task queues (SQS/Redis backend)
Pattern: API triggers → Celery task → Worker execution → Result storage
Use Cases:
- Facebook ad data fetching (fetch_fb_current_ad_data, fetch_fb_total_ad_data)
- AI score generation (generate_ai_score)
- Batch processing (spend_script)
- Health monitoring (heart_beat)
Orchestration: complyai-maestro coordinates complex workflows

Service Discovery:

Method: Environment variables with service URLs
Limitations: No dynamic service discovery (Consul, Eureka)
Configuration: Hardcoded endpoints in config files

Load Balancing:

Approach: Likely ALB/NLB (AWS) or Cloud Load Balancer (GCP)
Level: L7 (application level)
Worker Scaling: Celery workers auto-scale based on queue depth

Inter-Service Data Sharing:

Shared Models: complyai-core provides common models for API services
Shared Database: PostgreSQL shared across complyai-api, api-async, core
Pattern: Shared database anti-pattern (tight coupling) ⚠️
Recommendation: Move to API-based communication with separate databases

API Design

API Style:

Primary: REST with JSON payloads
Potential: GraphQL not detected (could benefit frontend)
Documentation: Missing (needs OpenAPI/Swagger) ⚠️

API Endpoints (complyai-api):

/auth/* - Authentication (v1 and v2 versions)
/precheck/* - Pre-compliance validation
/checker/* - Compliance checking
/maestro/* - Workflow orchestration
/policies/* - Policy management
/prompts/* - AI prompt management

Versioning Strategy:

Current: URL-based versioning (v1, v2 for auth)
Inconsistency: Not applied uniformly across endpoints
Recommendation: Standardize v1 prefix for all endpoints

Authentication Method:

Frontend: Auth0 (OAuth2/OIDC) with JWT tokens
API-to-API: JWT tokens, API keys
ML Services: No auth detected ⚠️ CRITICAL
External: OAuth2 for Facebook, Stripe API keys

Rate Limiting:

Implementation: ratelimiter.py files detected in ML services
Status: Configured but not consistently applied
Missing: API gateway rate limiting
Recommendation: Implement at gateway level with Redis backend

Error Handling:

Standard: HTTP status codes
Response Format: JSON with error messages
Logging: Comprehensive logging across services
Issue: Inconsistent error schemas

Pagination:

Detection: Not explicitly documented
Recommendation: Implement cursor-based pagination for large datasets

Security Architecture

Security Layers

1. Network Security

Firewalls: VPC security groups (AWS/GCP)
VPC Configuration: Private subnets for databases, public for frontends
Network Segmentation: Services isolated by VPC/subnet
Issues:
- Insecure HTTP in 21 occurrences (7 services) ⚠️
- ML services potentially exposed without auth ⚠️

2. Application Security

Authentication:
- Frontend: Auth0 (MFA support, OAuth2/OIDC)
- Backend: Flask-Security-Too, PyJWT, Authlib
- Detection: 14 authentication patterns across 4 services
Authorization:
- Role-based access control (RBAC)
- Flask-Security decorators
- Missing in ML services ⚠️
Input Validation:
- Dedicated validation module in complyai-core
- Flask-WTF forms (inferred)
- Inconsistent application across services
Output Encoding:
- React auto-escaping (XSS protection)
- Flask Jinja2 auto-escaping
CSRF Protection:
- Flask-WTF CSRF tokens (inferred)
- React SPA pattern (token-based)

3. Data Security

Encryption at Rest:
- Dedicated encryption module (complyai-core)
- 32 encryption patterns detected across 5 services
- AES-256 for sensitive fields
Encryption in Transit:
- TLS 1.2+ for external communication
- Issues with internal HTTP (21 occurrences) ⚠️
Tokenization:
- Stripe payment tokenization (PCI compliance)
Access Controls:
- Database role-based access
- Flask-Security permissions
Audit Logging:
- Comprehensive across services (8 services, multiple files)
- Tracks user actions, system events
- Retention: 7 years (compliance)

Security Controls

Control	Implementation	Status	Services	Priority
Multi-Factor Auth	Auth0 MFA	✅ Implemented	Frontend	✅
Encryption at Rest	AES-256 (core module)	✅ Implemented	Core, API	✅
Encryption in Transit	TLS 1.2+	⚠️ Partial (21 HTTP issues)	All	🔴 High
API Authentication	JWT, OAuth2	⚠️ Partial (ML missing)	API, Frontend	🔴 High
Rate Limiting	ratelimiter.py	⚠️ Inconsistent	ML services	🟡 Medium
Input Validation	Validation module	✅ Implemented	Core, API	✅
Audit Logging	Python logging	✅ Comprehensive	8 services	✅
Secrets Management	GCP Secret Manager	⚠️ Partial (3 hardcoded)	Maestro, Gong	🔴 Critical
Dependency Scanning	Dependabot potential	❌ Not detected	All	🟡 Medium
SAST	Not detected	❌ Missing	All	🟡 Medium
Penetration Testing	Not detected	❌ Missing	All	🟡 Medium
WAF	Not detected	❌ Missing	Frontend	🟡 Medium
DDoS Protection	CloudFlare/AWS Shield potential	❓ Unknown	Frontend	🟡 Medium

Critical Security Issues:

Hardcoded Secrets (3 services): maestro/config.py, cms/dev.py, cms/staging.py, adhoc/slack_data ⚠️ CRITICAL
Insecure Communication (21 occurrences, 7 services): HTTP instead of HTTPS ⚠️ HIGH
Missing ML Auth (3 services): triangle, violin, ipu have no authentication ⚠️ HIGH
Inconsistent Rate Limiting: Not uniformly applied ⚠️ MEDIUM
No API Gateway: Direct service exposure increases attack surface ⚠️ MEDIUM

Compliance Requirements

Current Compliance Status:

GDPR (Partial): Data privacy patterns detected, needs full audit
HIPAA: Not applicable (no healthcare data)
SOC 2 Type 1: Not achieved (required for Series A)
SOC 2 Type 2: Future requirement
ISO 27001: Not pursued
PCI DSS (Delegated): Stripe handles, no card data storage ✅

GDPR Compliance Measures:

Data privacy patterns in complyai-core
Audit logging (right to access)
Encryption at rest (data protection)
OAuth2 consent flows
Gaps:
- No documented data retention policies
- Missing right to erasure implementation
- No data processing agreements documented

Audit Requirements:

Comprehensive audit logging (8 services)
7-year retention (financial compliance)
User action tracking
System event logging
Gaps:
- No centralized audit log aggregation
- Missing audit log integrity verification
- No automated compliance reporting

Data Residency:

Cross-cloud deployment (AWS + GCP)
No explicit data residency controls
Potential GDPR residency issues ⚠️

Scalability & Performance

Scalability Strategy

Horizontal Scaling:

API Services: Gunicorn workers, multiple instances behind load balancer
Celery Workers: Auto-scaling based on queue depth
ML Services: Multiple inference instances (currently limited)
Frontend: Static CDN, infinite scale potential
Database: Read replicas needed (not currently implemented) ⚠️

Vertical Scaling:

Current: Single large instances for databases
Limitations: PostgreSQL vertical limit, eventual need for sharding
ML Services: GPU instances for inference (costly)

Database Scaling:

Current: Single PostgreSQL instance (shared anti-pattern)
Read Replicas: Not implemented ⚠️
Sharding: Not implemented
Connection Pooling: SQLAlchemy pooling
Recommendation:
- Implement read replicas immediately
- Consider database per service for microservices isolation
- Future: PostgreSQL partitioning or migration to distributed DB

Caching Strategy:

Current: TanStack Query (frontend client-side cache)
Missing:
- Redis for API caching ⚠️
- Database query caching
- ML inference result caching
Recommendation: Implement Redis cluster for:
- Session storage
- API response caching
- Celery broker (if not using SQS)
- ML prediction caching

CDN & Static Assets:

Frontend: S3 + CloudFront (inferred)
CMS Media: S3 with CDN
Optimization: Image optimization, lazy loading (frontend)

Auto-Scaling Configuration:

Triggers: CPU > 70%, Memory > 80%, Queue depth > 1000
Min Instances: 2 (high availability)
Max Instances: Dynamic based on service
Cool-down: 5 minutes

Performance Targets

Metric	Target	Current	Gap	Service
API Response Time (P95)	< 200ms	Unknown ⚠️	Needs measurement	complyai-api
API Response Time (P99)	< 500ms	Unknown ⚠️	Needs measurement	complyai-api
ML Inference Latency	< 1s	Unknown ⚠️	Needs measurement	triangle/violin/ipu
Async Task Processing	< 5min	Unknown ⚠️	Needs measurement	api-async
Throughput	10K req/day	Unknown ⚠️	Needs measurement	Platform
Concurrent Users	1,000	Unknown ⚠️	Needs measurement	Frontend
Database Query Time	< 50ms	Unknown ⚠️	Needs measurement	PostgreSQL
Frontend Load Time	< 2s	Unknown ⚠️	Web Vitals present	Frontend
Uptime	99.9%	Unknown ⚠️	Needs SLA tracking	All services
Error Rate	< 0.1%	Unknown ⚠️	Needs monitoring	All services

Performance Monitoring Gaps:

No APM (Application Performance Monitoring) ⚠️
No distributed tracing ⚠️
Limited metrics collection
No performance budgets
Missing SLA tracking

Bottlenecks

Current Bottlenecks:

Database (HIGH IMPACT)
- Issue: Single shared PostgreSQL instance across API, async, core
- Symptoms: Potential connection exhaustion, query contention
- Mitigation: Connection pooling (SQLAlchemy)
- Solution: Read replicas, database per service, query optimization
ML Inference (HIGH IMPACT)
- Issue: Cold start latency, GPU resource constraints
- Symptoms: Variable response times, potential timeouts
- Mitigation: Model loading optimization
- Solution: Model warming, inference result caching, model optimization (ONNX)
External API Rate Limits (MEDIUM IMPACT)
- Issue: Facebook API rate limits, Stripe API limits
- Symptoms: Async task delays, batch processing slowdowns
- Mitigation: Backoff and retry logic
- Solution: Request batching, caching, circuit breakers
Celery Task Queue (MEDIUM IMPACT)
- Issue: Queue depth during peak loads
- Symptoms: Task processing delays
- Mitigation: Worker auto-scaling
- Solution: Priority queues, task optimization, parallel processing
Shared Database Writes (MEDIUM IMPACT)
- Issue: Write contention on shared PostgreSQL
- Symptoms: Lock timeouts, slow writes
- Mitigation: Transaction optimization
- Solution: Event sourcing, CQRS pattern, database sharding
No Caching Layer (MEDIUM IMPACT)
- Issue: Repeated database queries, API calls
- Symptoms: Higher latency, database load
- Mitigation: Application-level memoization
- Solution: Redis cache cluster, query result caching
Monolithic Frontend Bundle (LOW IMPACT)
- Issue: Large JavaScript bundle size
- Symptoms: Slow initial page load
- Mitigation: Code splitting (Vite default)
- Solution: Lazy loading, route-based splitting, tree shaking

Performance Optimization Roadmap:

Immediate (Week 1-2): Implement APM, establish baselines, identify hot paths
Short-term (Month 1-2): Redis caching, database read replicas, ML result caching
Medium-term (Month 3-4): Database per service, query optimization, CDN optimization
Long-term (Month 5-6): Distributed tracing, database sharding, ML model optimization

Reliability & Availability

High Availability Design

Redundancy:

API Services: Multi-AZ deployment (assumed), 2+ instances minimum
Databases: PostgreSQL failover (needs verification) ⚠️
ML Services: Multiple inference instances
Frontend: CDN (globally distributed)
Gaps:
- No documented multi-region deployment
- Single database instance (SPOF) ⚠️

Failover Mechanisms:

Load Balancer: Health check-based failover (ALB/NLB)
Database: PostgreSQL streaming replication (needs verification) ⚠️
Celery: Worker redundancy, task retry logic
Frontend: CDN failover (automatic)
Issues: No documented failover procedures

Load Distribution:

Method: Round-robin load balancing (assumed)
Sticky Sessions: Not required (stateless APIs)
WebSocket: Not detected (if needed, requires session affinity)

Health Checks:

Implementation:
- /health endpoints (inferred)
- heart_beat Celery task (api-async)
- Gunicorn worker health
Frequency: Every 30 seconds (typical)
Actions: Auto-recovery, instance replacement

Circuit Breakers:

Status: Not implemented ⚠️
Need: External API calls (Facebook, Stripe), ML services
Recommendation: Implement with exponential backoff

Disaster Recovery

Backup Strategy:

PostgreSQL:
- Frequency: Daily full, hourly incremental (assumed)
- Retention: 30 days
- Method: pg_dump or AWS RDS automated backups
- Testing: Not documented ⚠️
S3 Data:
- Versioning enabled (assumed)
- Cross-region replication (not verified) ⚠️
Application State:
- Stateless design (good for DR)
Gaps:
- No documented backup testing procedures ⚠️
- No backup encryption verification

Recovery Time Objective (RTO):

Target: < 1 hour
Current: Unknown (needs DR testing) ⚠️
Components:
- Database restore: 15-30 minutes
- Service redeployment: 10-15 minutes
- DNS propagation: 5-10 minutes

Recovery Point Objective (RPO):

Target: < 1 hour (hourly backups)
Current: Unknown ⚠️
Critical Data: Financial transactions, compliance results
Acceptable Loss: Minimal for compliance data

DR Testing:

Frequency: Not documented ⚠️ (should be quarterly)
Method: Not documented ⚠️
Last Test: Unknown ⚠️
Recommendation: Implement quarterly DR drills with runbooks

DR Runbooks:

Status: Not detected ⚠️ CRITICAL
Required:
- Database failover procedure
- Service restoration steps
- Data recovery process
- Communication plan
- Rollback procedures

Monitoring & Observability

Logging:

Tool: Python logging module (comprehensive across services)
Aggregation: Not centralized ⚠️ (needs ELK, Datadog, or CloudWatch)
Coverage:
- complyai-api: 8 files with logging
- api-async: 35 files (most comprehensive)
- complyai-core: 9 files
- Other services: 3+ files each
Retention: File-based (90 days assumed)
Structure: Unstructured logs (needs JSON logging)
Gaps:
- No log aggregation platform ⚠️
- No log-based alerting
- No correlation IDs for distributed tracing

Metrics:

Frontend:
- Web Vitals (performance monitoring)
- FullStory (user session recording)
Backend: Not implemented ⚠️
Database: PostgreSQL stats (not centralized)
Infrastructure: Cloud provider metrics (AWS CloudWatch, GCP Monitoring)
Gaps:
- No application metrics (request counts, latencies) ⚠️
- No business metrics dashboard
- No SLA tracking

Tracing:

Status: Not implemented ⚠️ CRITICAL
Need: Distributed tracing for microservices
Recommendation: Implement OpenTelemetry, Jaeger, or DataDog APM
Benefits: Request flow visualization, bottleneck identification

Alerting:

Tool: Slack webhooks (8 services)
Scope: Error notifications, task completions
Channels: Development team Slack
Issues:
- No PagerDuty/OpsGenie for on-call ⚠️
- No severity classification
- No escalation procedures
- Alert fatigue potential (Slack-only)

Observability Maturity:

Logging: 6/10 (comprehensive but not aggregated)
Metrics: 3/10 (minimal, frontend-only)
Tracing: 0/10 (not implemented)
Alerting: 4/10 (basic Slack notifications)
Overall: 3.25/10 (needs significant improvement)

Observability Roadmap:

Phase 1: Centralized logging (ELK or Datadog), structured JSON logs
Phase 2: Application metrics (Prometheus + Grafana or Datadog)
Phase 3: Distributed tracing (OpenTelemetry)
Phase 4: Advanced alerting (PagerDuty integration, SLA tracking)

Deployment Architecture

Environments

Environment	Purpose	Configuration	Services	Data
Development	Developer work	Single instance, SQLite potential	All services, docker-compose	Synthetic/test data
Staging	Pre-production testing	Production-like, smaller scale	All services	Anonymized production data
Production	Live customer system	Highly available, multi-AZ	All services	Live customer data
QA/Test	Automated testing	CI/CD ephemeral	Critical services	Test fixtures

Environment Configuration:

Method: Environment variables, .env files
Issues:
- Hardcoded values in staging (cms/staging.py) ⚠️
- Inconsistent config management
Secrets: Google Cloud Secret Manager (partial), hardcoded (3 services) ⚠️

Infrastructure Differences:

Development: Local or single cloud VM, minimal resources
Staging: AWS/GCP, production topology, 50% capacity
Production: Multi-AZ, auto-scaling, high redundancy

CI/CD Pipeline

┌─────────────┐     ┌──────────────┐     ┌─────────────┐     ┌──────────────┐
│             │     │              │     │             │     │              │
│  Developer  │────►│ Git Push     │────►│  GitHub     │────►│  Build       │
│  Commits    │     │  to Branch   │     │  Actions    │     │  & Test      │
│             │     │              │     │  Triggered  │     │              │
└─────────────┘     └──────────────┘     └─────────────┘     └──────────────┘
                                                                      │
                                                                      ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                          BUILD & TEST STAGE                             │
│                                                                         │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌────────────┐ │
│  │   Linting    │  │   Unit Tests │  │  Integration │  │   Build    │ │
│  │   (ESLint,   │  │   (pytest,   │  │    Tests     │  │   Docker   │ │
│  │   Python)    │  │    Jest)     │  │              │  │   Images   │ │
│  └──────────────┘  └──────────────┘  └──────────────┘  └────────────┘ │
│         │                  │                  │                │        │
│         └──────────────────┴──────────────────┴────────────────┘        │
│                                    │                                     │
│                                    ▼                                     │
│                           ┌─────────────────┐                           │
│                           │  Quality Gates  │                           │
│                           │  Coverage > X%  │                           │
│                           │  No Critical    │                           │
│                           └─────────────────┘                           │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        DEPLOYMENT STAGE                                 │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────┐     │
│  │              STAGING DEPLOYMENT                              │     │
│  │  • Deploy to staging environment                             │     │
│  │  • Run E2E tests (Cypress)                                   │     │
│  │  • Smoke tests                                               │     │
│  │  • Manual approval gate                                      │     │
│  └──────────────────────────────────────────────────────────────┘     │
│                              │                                          │
│                              ▼                                          │
│  ┌──────────────────────────────────────────────────────────────┐     │
│  │              PRODUCTION DEPLOYMENT                           │     │
│  │  • Rolling deployment (or Blue-Green)                        │     │
│  │  • Health checks                                             │     │
│  │  • Automated rollback on failure                             │     │
│  │  • Slack notification                                        │     │
│  └──────────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────────┘

Current CI/CD Status:

Platform: GitHub Actions (.github repository)
Trigger: Git push, pull request
Testing: pytest (backend), Jest (frontend), Cypress (E2E - configured)
Quality Gates:
- Linting: ESLint (frontend), Python linters (backend)
- Code formatting: Prettier (frontend), Husky hooks
- Commit validation: Commitlint (conventional commits)
- Test coverage: Gaps due to low coverage (8.5% avg) ⚠️
Deployment: Docker images, container deployment
Issues:
- No documented deployment workflows ⚠️
- Test coverage too low for reliable gates
- No security scanning in pipeline ⚠️

Deployment Strategy

Rolling Deployment Process:

Build and push Docker image
Update 1 instance with new version
Health check validation
Continue to next instance
Full fleet updated incrementally
Automated rollback on failure

Deployment Frequency:

Current: Unknown, likely weekly/bi-weekly
Target for Scale: Daily (continuous deployment)

Deployment Automation:

Containerization: Docker (Dockerfiles present)
Orchestration: Not explicitly documented (ECS, GKE, or Kubernetes)
Configuration: Environment variables, secrets management
Zero-Downtime: Rolling updates, health checks

Infrastructure as Code

Status: Not explicitly detected ⚠️

Recommendations:

Tool: Terraform (preferred) or AWS CloudFormation/GCP Deployment Manager
Repository: Separate infra repo or monorepo /infrastructure directory
Coverage:
- VPC and networking
- Compute instances (ECS/GKE)
- Databases (RDS/Cloud SQL)
- Load balancers
- S3 buckets, IAM roles
Versioning: Git-based, tagged releases
State Management: Terraform Cloud or S3 backend with locking

Current Gaps:

No IaC detected (manual provisioning risk) ⚠️
No infrastructure versioning
Difficult disaster recovery
Inconsistent environments

Cost Architecture

Cost Breakdown (Estimated)

Component	Monthly Cost	% of Total	Optimization Opportunity
Compute (EC2/GCE)	$3,000 - $5,000	35%	Right-sizing instances, spot instances for dev/staging
Database (RDS/Cloud SQL)	$1,500 - $2,500	20%	Reserved instances, read replica optimization
ML Inference (GPU)	$2,000 - $4,000	30%	Model optimization, serverless inference, caching
Storage (S3/GCS)	$500 - $1,000	8%	Lifecycle policies, compression, tiering
Data Transfer	$300 - $600	5%	CloudFront/CDN, cross-region optimization
External APIs	$200 - $500	3%	Rate limiting, caching, batch requests
Monitoring/Logging	$100 - $300	2%	Log retention policies, sampling
Other (SQS, Lambda, etc.)	$200 - $400	2%	-
Total Estimated	$7,800 - $14,300	100%	Potential 30-40% savings

Cost Drivers:

ML GPU Instances: Largest cost (30%), always-on inference servers
Compute Sprawl: 13 separate services, potential over-provisioning
Cross-Cloud: Data transfer between AWS and GCP
No Caching: Redundant computations and API calls

Cost Optimization Strategies

Immediate (Month 1):

ML Inference Optimization ($800-1,500/month savings)
- Implement prediction caching (Redis)
- Use CPU instances for lower-volume models
- Batch inference requests
- Implement model quantization (ONNX)
Right-Sizing Instances ($500-1,000/month savings)
- Audit instance utilization
- Downsize over-provisioned instances
- Use burstable instances (T3, E2) for variable workloads
Reserved Instances ($400-800/month savings)
- 1-year RDS reserved instances (40% discount)
- 1-year compute reserved instances for production
- Savings Plans for flexible compute

Short-term (Month 2-3): 4. Service Consolidation ($600-1,200/month savings)

Merge ML services (3 → 1): Reduce 2 instance costs
Merge API services (2 → 1): Consolidate infrastructure
Reduce operational overhead

Storage Optimization ($200-400/month savings)
- S3 lifecycle policies (Standard → IA → Glacier)
- Compress large files (images, videos)
- Delete unused/old data
Data Transfer Optimization ($150-300/month savings)
- Minimize cross-cloud transfer (AWS ↔ GCP)
- Use CloudFront/CDN for static assets
- Compress API responses

Medium-term (Month 4-6): 7. Serverless Migration ($1,000-2,000/month savings)

Migrate low-volume endpoints to Lambda/Cloud Functions
Use SageMaker/Vertex AI serverless inference
Pay only for actual usage

Auto-Scaling Optimization ($400-800/month savings)
- Aggressive scale-down during off-hours
- Use spot instances for non-critical workloads
- Implement predictive scaling
Database Optimization ($300-600/month savings)
- Query optimization (reduce IOPS)
- Connection pooling tuning
- Consider Aurora Serverless for variable loads

Total Potential Savings: $4,350-8,600/month (35-60%)

Cost Monitoring:

Current: Cloud provider dashboards (AWS Cost Explorer, GCP Billing)
Recommendation:
- Implement cost allocation tags by service
- Set up budget alerts
- Use CloudHealth or Cloudability for multi-cloud visibility
- Create cost dashboards for engineering team visibility

Design Decisions (ADRs)

ADR-001: Microservices Architecture with Musical Naming

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed to scale different components independently (ML models, API, async processing) while allowing parallel team development. The platform required flexibility to deploy and update services without impacting the entire system.

Decision: Implement microservices architecture with separate repositories for each service. Adopt "musical" naming convention (triangle, violin, maestro, gong, ipu) suggesting orchestrated workflows and harmony between services.

Consequences:

✅ Positive:
- Independent scaling of services (ML vs API)
- Parallel team development
- Technology flexibility (Flask + Django + React)
- Fault isolation
- Clear orchestration metaphor
❌ Negative:
- Service fragmentation (13 repos)
- Operational complexity
- Increased deployment overhead
- Naming ambiguity (purpose unclear from name)
- Higher infrastructure costs
- Shared database coupling

Alternatives Considered:

Monolithic architecture (rejected: scaling limitations)
Modular monolith (rejected: deployment coupling)
Functional naming (rejected in favor of thematic consistency)

Current Recommendation: Consolidate to 5-6 core services while maintaining microservices benefits

ADR-002: Dual API Pattern (Sync + Async)

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed to handle both real-time user requests (<200ms) and long-running operations (Facebook data fetch, ML inference, batch processing). Single API couldn't efficiently serve both patterns.

Decision: Implement two separate API services:

complyai-api: Synchronous REST API for real-time operations
api-async: Asynchronous API with Celery for background tasks

Consequences:

✅ Positive:
- Optimized performance for each pattern
- Async processing doesn't block sync requests
- Independent scaling
- Clear separation of concerns
❌ Negative:
- Code duplication (shared models via complyai-core)
- Shared database (tight coupling)
- Deployment complexity (2 services vs 1)
- Inconsistent patterns across APIs

Alternatives Considered:

Single API with async task queue (rejected: mixing concerns)
API Gateway with backend services (not evaluated initially)
GraphQL with subscriptions (rejected: team expertise)

Current Recommendation: Consolidate into unified API Gateway with async capabilities built-in

ADR-003: Three Separate ML Inference Services

Status: Accepted (Historical), Under Review Date: Unknown (pre-2024)

Context: ML compliance checking required specialized models for different domains (text, image, policy analysis). Models had different resource requirements and update cycles.

Decision: Create three separate ML inference services (triangle, violin, ipu) each with identical tech stack (TensorFlow, PyTorch, Flask, chromadb) but different models.

Consequences:

✅ Positive:
- Model-specific optimization
- Independent deployment/updates
- Fault isolation (one model failure doesn't affect others)
- Specialized resource allocation (GPU vs CPU)
❌ Negative:
- 3x infrastructure costs
- Code duplication (551 + 357 + 418 = 1,326 LOC, mostly identical)
- Operational overhead (3 services to monitor/deploy)
- No centralized model registry
- Inconsistent security (no auth)

Alternatives Considered:

Single ML service with model routing (rejected: complexity concerns)
Serverless inference (not evaluated: team expertise)
Managed ML platform (partially adopted with Vertex AI in gong)

Current Recommendation: CONSOLIDATE into unified ML platform with model registry

ADR-004: Cross-Cloud Deployment (AWS + GCP)

Status: Accepted (Historical), Under Review Date: Unknown (pre-2024)

Context: ComplyAI started on one cloud provider but needed specific services from another (e.g., BigQuery for analytics, Vertex AI for managed ML).

Decision: Deploy across both AWS and GCP:

AWS: API services, S3 storage, SQS queues
GCP: Core engine, ML services, BigQuery, Vertex AI

Consequences:

✅ Positive:
- Best-of-breed services from each cloud
- Vendor lock-in reduction
- Access to GCP's AI/ML platforms
- BigQuery for powerful analytics
❌ Negative:
- Increased complexity (dual cloud management)
- Higher data transfer costs (inter-cloud)
- Dual IAM/security management
- Team needs expertise in both clouds
- Disaster recovery complexity

Alternatives Considered:

AWS-only with Redshift + SageMaker (rejected: GCP AI preference)
GCP-only migration (rejected: AWS investment)
Multi-cloud with abstraction layer (rejected: overhead)

Current Recommendation: Consolidate to single cloud or implement clear service boundaries

ADR-005: Shared Database Pattern

Status: Accepted (Historical), Deprecated Date: Unknown (pre-2024)

Context: complyai-api, api-async, and complyai-core needed access to the same data models. Team prioritized rapid development over strict service isolation.

Decision: Share a single PostgreSQL database across complyai-api, api-async, and complyai-core services. Use complyai-core for shared model definitions.

Consequences:

✅ Positive:
- No data synchronization needed
- Shared transactions (ACID compliance)
- Faster initial development
- Simplified data access
❌ Negative:
- Tight coupling (microservices anti-pattern) ⚠️
- Schema changes affect all services
- Single point of failure
- Scaling bottleneck
- Can't independently deploy database migrations

Alternatives Considered:

Database per service with API communication (rejected: development speed)
Event sourcing with CQRS (rejected: complexity)
GraphQL federation (not evaluated)

Current Status: DEPRECATED - Violates microservices principles Recommendation: Migrate to database per service with API-based communication

ADR-006: Django/Wagtail CMS Alongside Flask Services

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed robust content management for regulatory content, documentation, and knowledge base. Wagtail CMS provided rich editing experience but required Django.

Decision: Introduce Django/Wagtail as separate service (complyai_cms) despite rest of backend using Flask.

Consequences:

✅ Positive:
- Powerful CMS with editorial workflow
- Rich content editing experience
- Version control and publishing workflow
- Strong community and plugins
❌ Negative:
- Technology stack fragmentation (Flask + Django)
- Different ORM patterns (SQLAlchemy vs Django ORM)
- Separate deployment pipeline
- Team needs Django expertise
- Different security patterns

Alternatives Considered:

Flask-Admin (rejected: insufficient CMS features)
Headless CMS (Contentful, Strapi) (not evaluated initially)
Custom Flask CMS (rejected: development time)

Current Recommendation: Evaluate headless CMS migration to unify tech stack

ADR-007: Frontend Technology Split (App vs Marketing)

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed both a feature-rich application (complyai-frontend) and a marketing website (www) with different requirements and update cycles.

Decision: Build two separate React applications:

complyai-frontend: TypeScript, Vite, complex state management (Zustand, TanStack Query)
www: JavaScript, Create React App, simpler marketing site

Consequences:

✅ Positive:
- Optimized for different use cases
- Independent deployment/updates
- Smaller bundle size for marketing site
- Clear separation of concerns
❌ Negative:
- Code/component duplication
- Inconsistent user experience potential
- Duplicate dependencies/tooling
- 2 deployment pipelines

Alternatives Considered:

Monorepo with shared components (not evaluated initially)
Unified app with marketing routes (rejected: bundle size)
Next.js for both (not evaluated: team expertise)

Current Recommendation: Maintain separation but share component library via monorepo

ADR-008: Auth0 for Authentication

Status: Accepted Date: Unknown (2024 or earlier)

Context: ComplyAI needed enterprise-grade authentication with MFA, OAuth2, social logins, and compliance features without building custom auth system.

Decision: Adopt Auth0 as managed authentication provider for complyai-frontend.

Consequences:

✅ Positive:
- Enterprise-grade security out-of-box
- MFA, OAuth2, OIDC support
- Social login integrations
- SOC 2 compliant
- Reduces development time
- Regular security updates
❌ Negative:
- Monthly cost per active user
- Vendor lock-in
- Less control over auth flow
- Requires internet connectivity

Alternatives Considered:

Custom Flask-Security (rejected: security risk, maintenance)
AWS Cognito (rejected: Auth0 features preferred)
Firebase Auth (rejected: GCP preference for managed services)

Status: Approved - Auth0 is appropriate choice for Series A stage

Technical Debt

Known Issues

1. Test Coverage Crisis ⚠️ CRITICAL

Issue: Average test coverage 8.5%, with 7 services at 0%
Impact: HIGH - Production bugs, risky deployments, slow development
Affected Services:
- api-async: 0% (20,143 LOC) - HIGHEST RISK
- complyai-violin: 0% (357 LOC)
- complyai-gong: 0% (1,031 LOC)
- complyai-api: 1.39% (15,630 LOC)
- complyai_cms: 1.32% (2,771 LOC)
Mitigation: Manual testing, cautious deployments
Resolution Plan:
- Month 1: Achieve 40% coverage for critical paths
- Month 2: Achieve 60% for business logic
- Month 3: Achieve 80% for core services

2. Hardcoded Secrets ⚠️ CRITICAL

Issue: Production secrets hardcoded in 3 services
Impact: CRITICAL - Security breach risk, audit failure
Locations:
- complyai-maestro/config.py
- complyai_cms/dev.py, staging.py
- complyai-adhoc/slack_data/*.py
Mitigation: Limited access to repositories
Resolution Plan:
- Week 1: Migrate all secrets to AWS Secrets Manager/GCP Secret Manager
- Week 2: Implement automated secret scanning in CI/CD
- Week 3: Security audit and rotation

3. Missing ML Service Authentication ⚠️ HIGH

Issue: triangle, violin, ipu have no authentication
Impact: HIGH - Unauthorized access, resource abuse, data leakage
Affected: 3 ML services
Mitigation: Network-level isolation (VPC)
Resolution Plan:
- Week 1-2: Implement JWT authentication for ML endpoints
- Week 2-3: Add rate limiting
- Week 4: API key management system

4. Insecure HTTP Communication ⚠️ HIGH

Issue: 21 occurrences across 7 services
Impact: HIGH - Man-in-the-middle attacks, data interception
Affected: 7 services (API, core, maestro, CMS, adhoc, triangle, violin, ipu)
Mitigation: VPC isolation, load balancer HTTPS termination
Resolution Plan:
- Week 1: Audit all HTTP calls
- Week 2: Enforce HTTPS for internal service communication
- Week 3: Update configurations, remove HTTP fallbacks

5. Shared Database Anti-Pattern ⚠️ HIGH

Issue: PostgreSQL shared across complyai-api, api-async, core
Impact: HIGH - Tight coupling, scaling bottleneck, deployment risk
Affected: 3 core services
Mitigation: Careful schema migration coordination
Resolution Plan:
- Month 1-2: Implement API-based inter-service communication
- Month 3-4: Separate databases with data migration
- Month 5: Remove direct database coupling

6. No Monitoring/Observability ⚠️ HIGH

Issue: No APM, distributed tracing, or centralized metrics
Impact: HIGH - Blind to performance issues, slow incident resolution
Affected: All services
Mitigation: Reactive debugging, log file analysis
Resolution Plan:
- Month 1: Implement APM (Datadog, New Relic)
- Month 2: Centralized logging (ELK, Datadog)
- Month 3: Distributed tracing (OpenTelemetry)

7. Missing API Documentation ⚠️ MEDIUM

Issue: No OpenAPI/Swagger specs for 13 services
Impact: MEDIUM - Slow integration, developer friction, errors
Affected: All API services
Mitigation: Code review, team knowledge
Resolution Plan:
- Month 1: Generate OpenAPI specs for core APIs
- Month 2: Interactive API documentation (Swagger UI)
- Month 3: Auto-generate from code annotations

8. No Infrastructure as Code ⚠️ MEDIUM

Issue: Manual infrastructure provisioning
Impact: MEDIUM - Inconsistent environments, slow DR, config drift
Affected: All infrastructure
Mitigation: Documented manual steps
Resolution Plan:
- Month 1: Terraform for new resources
- Month 2-3: Import existing infrastructure to Terraform
- Month 4: Full IaC with CI/CD integration

9. Service Fragmentation ⚠️ MEDIUM

Issue: 13 separate services, some redundant
Impact: MEDIUM - High operational overhead, cost, complexity
Affected: All services
Mitigation: Clear service boundaries, documentation
Resolution Plan:
- Month 1: Consolidate ML services (3 → 1)
- Month 2: Merge API services (2 → 1)
- Month 3: Evaluate frontend consolidation
- Target: 13 → 5-6 services

10. Cross-Cloud Complexity ⚠️ MEDIUM

Issue: AWS + GCP deployment increases complexity and cost
Impact: MEDIUM - Higher costs, complex DR, dual expertise needed
Affected: Infrastructure
Mitigation: Clear cloud service boundaries
Resolution Plan:
- Month 1-2: Analyze cloud usage patterns
- Month 3-4: Develop single-cloud migration plan
- Month 5-6: Execute migration (if beneficial)

Improvement Roadmap

Q4 2025 (Months 1-3) - Foundation & Critical Issues:

Month 1:

Remove all hardcoded secrets → Secrets Manager ⚠️ CRITICAL
Implement authentication for ML services ⚠️ HIGH
Fix insecure HTTP communication (21 occurrences) ⚠️ HIGH
Achieve 40% test coverage for critical paths ⚠️ CRITICAL
Implement APM and centralized logging ⚠️ HIGH
Generate OpenAPI specs for core APIs
Database read replicas for performance

Month 2:

Achieve 60% test coverage for business logic
Consolidate ML services (triangle + violin + ipu → unified)
Implement Redis caching layer
Start Terraform implementation
Security audit and penetration testing
Implement circuit breakers for external APIs

Month 3:

Achieve 80% test coverage for core services
Merge API services (complyai-api + api-async → unified)
Distributed tracing implementation
Complete IaC migration to Terraform
Implement database per service pattern
DR testing and runbook creation

Q1 2026 (Months 4-6) - Optimization & Scale:

Month 4:

Frontend consolidation evaluation (app + www)
Database sharding for scale
Implement GraphQL layer for frontend
Advanced caching strategies
Cost optimization (reserved instances, serverless)

Month 5:

Service mesh implementation (Istio, Linkerd)
Kubernetes migration planning
Comprehensive E2E test suite (Cypress)
ML model registry and versioning
Advanced monitoring and SLA tracking

Month 6:

Complete service consolidation (13 → 5-6 services)
Multi-region deployment
Automated performance testing
SOC 2 Type 1 certification
Series A technical readiness complete

Q2 2026 (Months 7-9) - Advanced Features:

SOC 2 Type 2 audit preparation
Advanced ML features (model explainability, A/B testing)
International expansion support (i18n, data residency)
Advanced analytics and BI integration
Platform API for third-party integrations

Q3 2026 (Months 10-12) - Enterprise Ready:

SOC 2 Type 2 certification
Enterprise features (SSO, SAML, advanced RBAC)
White-label capabilities
Advanced compliance reporting
Platform maturity for Series B

Risk Assessment

Architecture Risks

Risk	Likelihood	Impact	Mitigation	Priority
Security breach from hardcoded secrets	High	Critical	Immediate migration to Secrets Manager, audit	🔴 P0
Production outage from untested code (8.5% coverage)	High	Critical	Aggressive test development, canary deployments	🔴 P0
ML service abuse (no auth)	Medium	High	Implement auth + rate limiting immediately	🔴 P0
Database failure (single PostgreSQL)	Medium	Critical	Implement read replicas, backup testing	🟠 P1
Service fragmentation causing deployment failures	Medium	High	Service consolidation roadmap	🟠 P1
Cross-cloud data transfer costs spike	Medium	Medium	Monitor costs, implement caching	🟡 P2
Scaling bottleneck from shared database	High	High	Database per service migration	🟠 P1
Vendor lock-in with Auth0	Low	Medium	Document migration path, API abstraction	🟢 P3
Compliance failure (GDPR, SOC 2)	Medium	High	Security audit, compliance roadmap	🟠 P1
Team knowledge concentration (bus factor)	Medium	High	Documentation, knowledge sharing, cross-training	🟠 P1
Infrastructure drift (no IaC)	Medium	Medium	Terraform implementation	🟡 P2
API breaking changes affecting clients	Low	Medium	Versioning strategy, deprecation policy	🟢 P3
ML model performance degradation	Medium	Medium	Model monitoring, A/B testing, rollback capability	🟡 P2
Cost overrun from inefficient architecture	Medium	Medium	Cost optimization roadmap, monitoring	🟡 P2

Dependencies & Single Points of Failure

Single Points of Failure (SPOFs):

PostgreSQL Database ⚠️ CRITICAL
- Impact: Complete platform outage
- Affected: complyai-api, api-async, core, CMS (4 services)
- Current Mitigation: Daily backups (assumed)
- Required:
  - Immediate: Setup read replicas with auto-failover
  - Short-term: Database per service
  - Long-term: Distributed database (CockroachDB, PostgreSQL with Patroni)
Auth0 Service ⚠️ HIGH
- Impact: No user authentication, platform inaccessible
- Affected: complyai-frontend, all authenticated endpoints
- Current Mitigation: Auth0 SLA (99.99% uptime)
- Required:
  - Session token caching for short-term access
  - Fallback authentication mechanism
  - Regular backup of user data
AWS S3 (Model Storage) ⚠️ HIGH
- Impact: ML inference failure, no model loading
- Affected: triangle, violin, ipu (3 ML services)
- Current Mitigation: S3 durability (99.999999999%)
- Required:
  - Model caching on inference servers
  - Cross-region S3 replication
  - Fallback model versions
Facebook Business API ⚠️ MEDIUM
- Impact: No ad data sync, delayed compliance checks
- Affected: api-async, adhoc
- Current Mitigation: Retry logic, error handling
- Required:
  - Circuit breaker pattern
  - Data caching for known accounts
  - Alternative data sources
Stripe Payment API ⚠️ MEDIUM
- Impact: Payment processing failure, subscription issues
- Affected: complyai-api, core, frontend
- Current Mitigation: Stripe SLA (99.99% uptime)
- Required:
  - Webhook retry mechanism
  - Payment status reconciliation
  - Manual payment processing fallback

Critical External Dependencies:

Dependency	Service Owner	SLA	Fallback Strategy
Auth0	Auth0 (Okta)	99.99%	Session caching, grace period
Stripe	Stripe Inc.	99.99%	Webhook retries, manual processing
Facebook API	Meta	Best effort	Cached data, scheduled retries
Google Vertex AI	GCP	99.95%	Fallback to local models
BigQuery	GCP	99.99%	Query caching, delayed analytics
AWS S3	AWS	99.99%	Cross-region replication, cache
PostgreSQL	Self-hosted	Unknown ⚠️	Backups, failover (needed)

Mitigation Priorities:

P0 (This Week): PostgreSQL read replicas and failover
P1 (This Month): Model caching, circuit breakers for external APIs
P2 (Next Quarter): Multi-region deployment, service redundancy

Evolution & Roadmap

Current State Assessment (SWOT)

Strengths:

✅ Modern Tech Stack: React 18, TypeScript, Python 3, TensorFlow/PyTorch
✅ Microservices Foundation: Independent scaling and deployment
✅ AI/ML Capabilities: Multiple specialized models for compliance analysis
✅ Security Awareness: Encryption patterns, audit logging, Auth0
✅ Active Development: Recent commits, engaged team
✅ Good Documentation Coverage: 60%+ in several services
✅ Best-in-Class Integrations: Stripe, Auth0, Facebook, Google Cloud
✅ Managed Services: Leveraging cloud-native (Vertex AI, BigQuery)

Weaknesses:

⚠️ Critical Test Gap: 8.5% average coverage, 7 services at 0%
⚠️ Hardcoded Secrets: 3 services with production credentials in code
⚠️ Service Fragmentation: 13 repos creating operational burden
⚠️ Missing Auth: ML services exposed without authentication
⚠️ Shared Database: Tight coupling between core services
⚠️ No Observability: Missing APM, tracing, centralized metrics
⚠️ Insecure Communication: 21 HTTP occurrences
⚠️ No IaC: Manual infrastructure, config drift risk
⚠️ Cross-Cloud Complexity: AWS + GCP operational overhead
⚠️ Missing API Docs: No OpenAPI/Swagger specifications

Opportunities:

💡 Service Consolidation: 13 → 5-6 services (30-40% cost savings)
💡 ML Platform: Unified ML service with model registry
💡 API Gateway: Single entry point, better security/observability
💡 Serverless ML: Cost-effective inference for variable loads
💡 GraphQL: Flexible frontend queries, reduced API calls
💡 Platform API: Third-party integrations, ecosystem growth
💡 International Expansion: Multi-language, data residency
💡 Enterprise Features: SSO, SAML, advanced RBAC for upmarket
💡 Compliance Certifications: SOC 2, ISO 27001 for enterprise sales
💡 ML Marketplace: Pluggable compliance models for different industries

Threats:

🚨 Security Breach: Hardcoded secrets, missing auth (investor concern)
🚨 Production Outage: Low test coverage risk (customer churn)
🚨 Scaling Failure: Shared database bottleneck (growth blocker)
🚨 Technical Debt: Slowing development velocity (competitive risk)
🚨 Talent Retention: Complexity burden (team burnout)
🚨 Cost Overrun: Inefficient architecture (runway impact)
🚨 Compliance Failure: Missing SOC 2 (enterprise sales blocker)
🚨 Vendor Lock-in: Auth0, cross-cloud dependencies
🚨 API Breaking Changes: No versioning strategy (customer impact)
🚨 Data Loss: No DR testing (existential risk)

Future State Vision (12 months)

Target Architecture (Q4 2026):

┌─────────────────────────────────────────────────────────────────────┐
│                    UNIFIED WEB PLATFORM                             │
│                  (React/Next.js - App + Marketing)                  │
│                   Auth0 + Stripe + Analytics                        │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      API GATEWAY LAYER                              │
│              (GraphQL + REST with Rate Limiting)                    │
│              Authentication + Authorization + Logging               │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    MICROSERVICES LAYER                              │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌──────────────┐ │
│  │ Compliance │  │ ML Platform│  │Orchestration│ │Content/Policy│ │
│  │   Service  │  │  (Unified)  │  │  Service   │  │   Service    │ │
│  │  (Core)    │  │            │  │ (Maestro)  │  │   (Gong)     │ │
│  └────────────┘  └────────────┘  └────────────┘  └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         DATA LAYER                                  │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐   │
│  │ PostgreSQL │  │   Redis    │  │  Vector DB │  │  BigQuery  │   │
│  │ (Per Svc)  │  │  (Cache)   │  │ (Embeddings)│  │(Analytics) │   │
│  └────────────┘  └────────────┘  └────────────┘  └────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Key Characteristics:

5 Core Services: Consolidated from 13
API Gateway: Kong, AWS API Gateway, or GCP Apigee
Unified ML Platform: Single service with model registry
Database Per Service: Proper microservices isolation
Comprehensive Observability: OpenTelemetry, Datadog/New Relic
Infrastructure as Code: 100% Terraform-managed
80%+ Test Coverage: Across all services
SOC 2 Type 2 Certified: Enterprise-ready security
Multi-Region: AWS/GCP with active-active DR
Platform API: Third-party integrations enabled

Technology Evolution:

Frontend: Migrate to Next.js for SSR, better SEO, unified app
API: GraphQL gateway with REST fallback
ML: Serverless inference (SageMaker/Vertex AI) + GPU when needed
Cache: Redis Cluster with persistence
Monitoring: OpenTelemetry → Datadog/New Relic + Grafana
IaC: Terraform with Atlantis for automation
CI/CD: GitHub Actions with advanced workflows, security scanning

Migration Path

Phase 1: Foundation (Months 1-3) - "Security & Stability"

Goals: Fix critical security issues, establish observability, improve reliability

Deliverables:

✅ Remove all hardcoded secrets → AWS Secrets Manager/GCP Secret Manager
✅ Implement ML service authentication (JWT + rate limiting)
✅ Fix all insecure HTTP communication (enforce HTTPS)
✅ Achieve 60% test coverage for critical paths
✅ Implement APM and distributed tracing (Datadog/OpenTelemetry)
✅ Centralized logging (ELK stack or Datadog)
✅ Database read replicas with auto-failover
✅ Redis caching layer for API and ML predictions
✅ Generate OpenAPI specs for all APIs
✅ DR testing and runbook creation

Success Metrics:

Zero hardcoded secrets detected by automated scanning
100% HTTPS enforcement (0 HTTP violations)
60%+ test coverage (up from 8.5%)
< 200ms P95 API response time with observability
99.9% uptime with monitoring evidence

Phase 2: Consolidation (Months 4-6) - "Simplification & Scale"

Goals: Reduce service count, unify architecture, prepare for scale

Deliverables:

✅ Consolidate ML services: triangle + violin + ipu → Unified ML Platform
- Single codebase with model registry
- Model versioning and A/B testing
- Serverless inference option
✅ Merge API services: complyai-api + api-async → Unified API Gateway
- GraphQL + REST support
- Built-in async capabilities (Celery)
- Rate limiting and auth at gateway
✅ Migrate to database per service pattern
- Separate databases for each service
- API-based inter-service communication
- Event-driven patterns for data sync
✅ Infrastructure as Code: 100% Terraform coverage
✅ Service mesh implementation (Istio/Linkerd)
✅ Achieve 80% test coverage across all services
✅ Frontend consolidation evaluation (app + www)
✅ Security audit and penetration testing
✅ SOC 2 Type 1 certification initiated
✅ Cost optimization: 30%+ reduction through consolidation

Success Metrics:

Service count: 13 → 7 (50% reduction)
Infrastructure cost: 30%+ reduction
Deployment time: 50% faster
Test coverage: 80%+ (up from 60%)
SOC 2 Type 1 audit passed

Phase 3: Enterprise Ready (Months 7-9) - "Compliance & Features"

Goals: Enterprise features, compliance certifications, platform maturity

Deliverables:

✅ SOC 2 Type 2 certification
✅ Enterprise SSO (SAML, SCIM provisioning)
✅ Advanced RBAC and audit trails
✅ Multi-region deployment (active-active)
✅ Platform API for third-party integrations
✅ Comprehensive E2E test suite (Cypress)
✅ Advanced ML features:
- Model explainability (SHAP, LIME)
- Drift detection and retraining automation
- Custom model upload for enterprise
✅ Frontend migration to Next.js (unified app + marketing)
✅ GraphQL API gateway fully operational
✅ Final consolidation: 7 → 5 services

Success Metrics:

SOC 2 Type 2 certified
Enterprise features: SSO, SCIM, advanced RBAC
Multi-region: < 10ms cross-region latency
API ecosystem: 5+ third-party integrations
Service count: 5 core services (final target)

Phase 4: Platform & Scale (Months 10-12) - "Growth Enablement"

Goals: Platform scalability, international expansion, Series B readiness

Deliverables:

✅ International expansion:
- Multi-language support (i18n)
- Data residency compliance (EU, APAC)
- Regional compliance models
✅ ML Marketplace:
- Industry-specific compliance models
- Partner model integrations
- Custom model training service
✅ Advanced Analytics & BI:
- Real-time dashboards
- Predictive compliance insights
- Customer success metrics
✅ White-label capabilities for enterprise
✅ Kubernetes migration (if not already done)
✅ Advanced auto-scaling and cost optimization
✅ Chaos engineering and resiliency testing
✅ Performance: 10K concurrent users, < 100ms P95
✅ Documentation: Comprehensive platform docs, API guides
✅ Series B technical due diligence package complete

Success Metrics:

10K+ concurrent users supported
Multi-region: 3+ regions (US, EU, APAC)
Platform API: 20+ third-party integrations
Revenue: API/platform revenue stream
Series B ready: All technical gates passed

Migration Risks & Mitigation:

Risk	Impact	Mitigation
Service consolidation causes downtime	High	Blue-green deployments, feature flags, gradual migration
Database migration data loss	Critical	Comprehensive backup, migration rehearsals, rollback plan
Breaking changes for clients	Medium	API versioning, deprecation notices, backward compatibility
Team bandwidth for migration	High	Hire contractors for feature development, dedicate team to migration
Cost spike during transition	Medium	Gradual migration, parallel run minimization, cost monitoring
Security regression	Critical	Security testing at each phase, penetration testing, audit gates

References

Internal Documents:

API Documentation (To Be Created):

OpenAPI/Swagger specifications (NEEDED ⚠️)
GraphQL schema documentation (FUTURE)
Integration guides (NEEDED ⚠️)

Operational Runbooks (To Be Created):

Service deployment procedures (NEEDED ⚠️)
Disaster recovery playbooks (NEEDED ⚠️ CRITICAL)
Incident response guides (NEEDED ⚠️)
Security incident procedures (NEEDED ⚠️)

External Resources

Technology Documentation:

Best Practices & Patterns:

Compliance Frameworks:

Architecture Decision Records:

ADR GitHub - ADR methodology and examples

Appendices

Glossary

Architecture Terms:

Microservices: Architectural style structuring application as collection of loosely coupled services
API Gateway: Single entry point for all client requests, routing to appropriate microservice
Circuit Breaker: Design pattern preventing cascading failures in distributed systems
CQRS: Command Query Responsibility Segregation - separating read/write operations
Event Sourcing: Storing state changes as sequence of events
Service Mesh: Infrastructure layer handling service-to-service communication
Blue-Green Deployment: Deployment strategy using two identical environments
Canary Deployment: Gradual rollout to subset of users before full deployment

ComplyAI-Specific Terms:

Musical Services: ComplyAI's pattern of naming services after musical terms (triangle, violin, maestro, gong, ipu)
Sync API: complyai-api - handles real-time synchronous requests
Async API: api-async - handles background/long-running asynchronous tasks
Core Engine: complyai-core - shared business logic and models
ML Triad: Triangle, violin, ipu - three specialized ML inference services
Orchestrator: complyai-maestro - coordinates multi-service workflows
Policy Engine: complyai-gong - AI-powered policy analysis using Vertex AI
Compliance Score: AI-generated score indicating ad compliance level
Ad Creative: Advertising content (images, video, text) submitted for compliance checking

Technology Terms:

TensorFlow/PyTorch: Deep learning frameworks for ML model training and inference
chromadb: Vector database for storing ML embeddings and similarity search
Celery: Distributed task queue for async processing
Gunicorn: Python WSGI HTTP server for running Flask applications
Wagtail: Django-based content management system
Auth0: Managed authentication and authorization platform
Vertex AI: Google Cloud's managed ML platform
BigQuery: Google Cloud's data warehouse for analytics

Compliance Terms:

GDPR: General Data Protection Regulation (EU data privacy)
SOC 2: Service Organization Control 2 (security audit framework)
PCI DSS: Payment Card Industry Data Security Standard
Audit Trail: Chronological record of system activities for compliance
Data Residency: Requirement for data to be stored in specific geographic location
Right to Erasure: GDPR right for users to have personal data deleted

Metrics Collection

How Metrics Are Collected:

Application Metrics (To Be Implemented):

Method: Prometheus exporters or Datadog APM agents
Metrics: Request counts, latency histograms, error rates, throughput
Storage: Time-series database (Prometheus/Datadog)
Visualization: Grafana dashboards or Datadog UI
Location: To be centralized at /metrics endpoints

Infrastructure Metrics (Current):

AWS: CloudWatch metrics (EC2, RDS, S3, SQS)
GCP: Cloud Monitoring (GCE, BigQuery, Vertex AI)
Access: AWS Console, GCP Console, CLI tools
Dashboards: Cloud provider dashboards (limited)

Frontend Metrics (Current):

Tool: FullStory (user sessions), Web Vitals
Metrics: Page load time, FCP, LCP, CLS, FID, TTI
Access: FullStory dashboard, browser DevTools
Location: Client-side instrumentation

Business Metrics (To Be Implemented):

Source: Application events, database queries
Metrics: User signups, compliance checks performed, subscriptions, revenue
Dashboards: To be built in BI tool (Looker, Tableau, Metabase)

Where to Find Metrics:

Cloud Dashboards: AWS CloudWatch, GCP Monitoring Console
Application Logs: Service log files (needs centralization)
FullStory: User behavior and frontend performance
Database: PostgreSQL stats, slow query logs
Future: Datadog/New Relic unified dashboard (recommended)

Contact Information

Role	Name	Contact	Responsibility
Solutions Architect	SkaFld Studio Team	-	Architecture documentation, Series A readiness
Technical Lead	[To Be Filled]	-	Architecture decisions, technical strategy
Engineering Manager	[To Be Filled]	-	Team coordination, delivery
DevOps Lead	[To Be Filled]	-	Infrastructure, CI/CD, deployments
Security Lead	[To Be Filled]	-	Security architecture, compliance
ML Lead	[To Be Filled]	-	ML models, inference optimization
Product Manager	[To Be Filled]	-	Product roadmap, requirements

Document Metadata

Review Schedule

Next Review Date: 2025-11-02 (Monthly)
Review Frequency: Monthly during migration phases, then quarterly
Review Triggers:
- Major architectural changes
- Post-incident reviews
- Pre-fundraising updates
- Quarterly business reviews

Change Log

Version	Date	Author	Changes
1.0	2025-10-02	SkaFld Studio Team	Initial comprehensive current architecture documentation based on codebase analysis

Executive Summary​

Business Context​

Problem Statement​

Success Criteria​

Constraints​

Architecture Overview​

System Context Diagram​

High-Level Architecture​

Architecture Style​

Components​

Core Components​

Component Interactions​

Technology Stack​

Frontend​

Backend​

Data Layer​

ML/AI Stack​

Infrastructure​

Data Architecture​

Data Flow Diagram​

Data Models​

Key Entities​

Data Storage Strategy​

Data Security​

Integration Architecture​

External Integrations​

Internal Service Communication​

API Design​

Security Architecture​

Security Layers​

Security Controls​

Compliance Requirements​

Scalability & Performance​

Scalability Strategy​

Performance Targets​

Bottlenecks​

Reliability & Availability​

High Availability Design​

Disaster Recovery​

Monitoring & Observability​

Deployment Architecture​

Environments​

CI/CD Pipeline​

Deployment Strategy​

Infrastructure as Code​

Cost Architecture​

Cost Breakdown (Estimated)​

Cost Optimization Strategies​

Design Decisions (ADRs)​

ADR-001: Microservices Architecture with Musical Naming​

ADR-002: Dual API Pattern (Sync + Async)​

ADR-003: Three Separate ML Inference Services​

ADR-004: Cross-Cloud Deployment (AWS + GCP)​

ADR-005: Shared Database Pattern​

ADR-006: Django/Wagtail CMS Alongside Flask Services​

ADR-007: Frontend Technology Split (App vs Marketing)​

ADR-008: Auth0 for Authentication​

Technical Debt​

Known Issues​

Improvement Roadmap​

Risk Assessment​

Architecture Risks​

Dependencies & Single Points of Failure​

Evolution & Roadmap​

Current State Assessment (SWOT)​

Future State Vision (12 months)​

Migration Path​

References​

Related Documentation​

External Resources​

Appendices​

Glossary​

Metrics Collection​

Contact Information​

Document Metadata​

Review Schedule​

Change Log​

Tags​

Executive Summary

Business Context

Problem Statement

Success Criteria

Constraints

Architecture Overview

System Context Diagram

High-Level Architecture

Architecture Style

Components

Core Components

Component Interactions

Technology Stack

Frontend

Backend

Data Layer

ML/AI Stack

Infrastructure

Data Architecture

Data Flow Diagram

Data Models

Key Entities

Data Storage Strategy

Data Security

Integration Architecture

External Integrations

Internal Service Communication

API Design

Security Architecture

Security Layers

Security Controls

Compliance Requirements

Scalability & Performance

Scalability Strategy

Performance Targets

Bottlenecks

Reliability & Availability

High Availability Design

Disaster Recovery

Monitoring & Observability

Deployment Architecture

Environments

CI/CD Pipeline

Deployment Strategy

Infrastructure as Code

Cost Architecture

Cost Breakdown (Estimated)

Cost Optimization Strategies

Design Decisions (ADRs)

ADR-001: Microservices Architecture with Musical Naming

ADR-002: Dual API Pattern (Sync + Async)

ADR-003: Three Separate ML Inference Services

ADR-004: Cross-Cloud Deployment (AWS + GCP)

ADR-005: Shared Database Pattern

ADR-006: Django/Wagtail CMS Alongside Flask Services

ADR-007: Frontend Technology Split (App vs Marketing)

ADR-008: Auth0 for Authentication

Technical Debt

Known Issues

Improvement Roadmap

Risk Assessment

Architecture Risks

Dependencies & Single Points of Failure

Evolution & Roadmap

Current State Assessment (SWOT)

Future State Vision (12 months)

Migration Path

References

Related Documentation

External Resources

Appendices

Glossary

Metrics Collection

Contact Information

Document Metadata

Review Schedule

Change Log

Tags