Skip to main content

Current Architecture: ComplyAI Compliance Platform

Executive Summary​

ComplyAI operates a distributed microservices architecture comprising 13 separate services built primarily on Python (77%), with a modern React/TypeScript frontend stack. The platform implements an AI-powered compliance checking system for advertising content, utilizing multiple ML models (TensorFlow, PyTorch, Google Vertex AI) distributed across specialized services. The architecture employs a "musical" naming convention (triangle, violin, maestro, gong) suggesting orchestrated workflows, with dual API implementations (sync/async), cross-cloud deployment (AWS + GCP), and a mixed Flask/Django backend pattern. Current state shows ~64,000 lines of code with significant technical debt: average 8.5% test coverage, hardcoded secrets in 3 services, and service fragmentation requiring consolidation for Series A readiness.

Business Context​

Problem Statement​

ComplyAI addresses the complex challenge of ensuring advertising compliance across multiple regulatory frameworks and platforms. The system must:

  • Process high volumes of ad content in real-time and batch modes
  • Apply ML-based compliance analysis using multiple specialized models
  • Manage complex compliance policies and regulatory content
  • Provide actionable insights and compliance scores
  • Integrate with external platforms (Facebook, Stripe, analytics)
  • Support both synchronous user interactions and asynchronous background processing

Success Criteria​

  • Scalability: Handle 10K+ compliance checks per day with <200ms API response time
  • Accuracy: ML models maintain >95% accuracy in compliance detection
  • Availability: 99.9% uptime for customer-facing services
  • Security: SOC 2 compliance with comprehensive audit logging and encryption
  • Developer Velocity: Onboard new developers within 2 weeks (currently limited by documentation gaps)
  • Cost Efficiency: Optimize ML inference costs while maintaining performance

Constraints​

Technical Constraints:

  • Legacy microservices fragmentation (13 separate repos)
  • Mixed technology stack (Flask + Django + React)
  • Zero-downtime deployment requirements
  • Cross-cloud dependencies (AWS + GCP)

Business Constraints:

  • Limited engineering team (10 contributors)
  • Series A readiness timeline
  • Existing customer SLA commitments
  • Budget constraints on infrastructure costs

Regulatory Constraints:

  • GDPR compliance for user data
  • PCI DSS for payment processing (via Stripe)
  • Advertising platform compliance requirements
  • Data retention and audit requirements

Architecture Overview​

System Context Diagram​

                                    External Systems
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
[Facebook Business API] [Stripe Payment] [Google Sheets]
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ComplyAI Platform β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Frontend Applications β”‚ β”‚
β”‚ β”‚ (App + Marketing Site) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ API Gateway β”‚ β”‚
β”‚ β”‚ (Sync + Async APIs) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Core Engine + ML β”‚ β”‚
β”‚ β”‚ + Orchestration Layer β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Data & Storage β”‚ β”‚
β”‚ β”‚ (PostgreSQL + S3 + CMS) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
[Users & Administrators]

High-Level Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PRESENTATION LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ complyai-frontend β”‚ β”‚ www β”‚ β”‚
β”‚ β”‚ (React/TypeScript)β”‚ β”‚ (Marketing Site) β”‚ β”‚
β”‚ β”‚ Auth0 + Stripe UI β”‚ β”‚ (React/Tailwind) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API GATEWAY LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ complyai-api β”‚ β”‚ api-async β”‚ β”‚
β”‚ β”‚ (Flask/Sync) │◄────────────►│ (Flask/Celery) β”‚ β”‚
β”‚ β”‚ 15,630 LOC β”‚ β”‚ 20,143 LOC β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BUSINESS LOGIC & ORCHESTRATION LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ complyai-core β”‚ β”‚ complyai-maestro β”‚ β”‚
β”‚ β”‚ (Flask/Core Logic)│◄────────────►│ (Orchestration) β”‚ β”‚
β”‚ β”‚ 13,654 LOC β”‚ β”‚ GCP Integration β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ complyai-gong β”‚ β”‚ complyai-adhoc β”‚ β”‚
β”‚ β”‚ (Vertex AI/Policy)β”‚ β”‚ (Utilities/8K LOC)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ML INFERENCE LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ triangle β”‚ β”‚ violin β”‚ β”‚ ipu β”‚ β”‚
β”‚ β”‚ (TF/PyTorch)β”‚ β”‚ (TF/PyTorch)β”‚ β”‚ (TF/PyTorch)β”‚ β”‚
β”‚ β”‚ 551 LOC β”‚ β”‚ 357 LOC β”‚ β”‚ 418 LOC β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ All use: chromadb (vectors) + gunicorn + boto3 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DATA LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ PostgreSQL β”‚ β”‚ AWS S3 β”‚ β”‚ complyai_cms β”‚ β”‚
β”‚ β”‚ (Primary Database) β”‚ β”‚ (File Store) β”‚ β”‚(Django/Wagtail)β”‚ β”‚
β”‚ β”‚ Shared by APIs β”‚ β”‚ Models β”‚ β”‚ Content Mgmt β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ chromadb β”‚ β”‚ BigQuery β”‚ β”‚
β”‚ β”‚ (Vector Store) β”‚ β”‚ (Analytics) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture Style​

  • Microservices (13 separate services)
  • Monolithic
  • Serverless (partial - potential Lambda usage)
  • Event-driven (Celery task queues, async processing)
  • Layered (presentation, API, business logic, data)
  • Service-oriented
  • Hybrid: Microservices with layered organization, event-driven async processing

Key Characteristics:

  • Distributed architecture with separate repositories per service
  • Musical naming pattern: triangle, violin, ipu, maestro, gong (orchestration metaphor)
  • Dual API pattern: Synchronous (complyai-api) + Asynchronous (api-async)
  • Specialized ML services: Three separate inference engines
  • Cross-cloud deployment: AWS (API, storage) + GCP (Core, ML, Frontend)
  • Mixed frameworks: Flask (9 services), Django (1 service), React (2 frontends)

Components​

Core Components​

ComponentPurposeTechnologyLOCStatusTest Coverage
complyai-apiMain synchronous REST APIFlask, PostgreSQL, Celery15,630Active1.39%
api-asyncAsynchronous processing APIFlask, Celery, TensorFlow20,143Active0% ⚠️
complyai-coreCore compliance engine & modelsFlask, PostgreSQL, Auth13,654Active8.20%
complyai-frontendMain application UIReact 18, TypeScript, ViteN/AActiveN/A
wwwMarketing websiteReact 18, Tailwind CSS73Active40%
complyai-triangleML inference service #1TensorFlow, PyTorch, Flask551Active11.76%
complyai-violinML inference service #2TensorFlow, PyTorch, Flask357Active0% ⚠️
complyai-ipuML inference service #3TensorFlow, PyTorch, Flask418Active18.18%
complyai-maestroService orchestrationFlask, GCP, BigQuery2,289Active17.65%
complyai-gongPolicy management + Vertex AIFlask, Vertex AI, GCP1,031Active0%
complyai_cmsContent management systemDjango, Wagtail, Webpack2,771Active1.32%
complyai-adhocUtility scripts & toolsPython scripts8,156Active60.71% βœ…
.githubCI/CD workflowsGitHub ActionsN/AActiveN/A

Total Lines of Code: ~64,000+ across all services

Component Interactions​

Technology Stack​

Frontend​

complyai-frontend (Main Application):

  • Framework: React 18.2 with TypeScript
  • Build Tool: Vite (fast HMR, modern bundling)
  • State Management:
    • Zustand (global state)
    • Jotai (atomic state)
    • TanStack Query (server state, caching)
  • UI Library:
    • Tailwind CSS (utility-first styling)
    • Headless UI (accessible components)
    • Heroicons (icon library)
    • Framer Motion (animations)
  • Authentication: Auth0 React SDK
  • Payment: React Stripe.js (PCI compliant)
  • Data Visualization: recharts, react-gauge-component
  • Testing: Cypress (E2E), Jest, React Testing Library
  • Code Quality: ESLint, Prettier, Husky, Commitlint
  • Routing: React Router DOM v6

www (Marketing Site):

  • Framework: React 18.2 with JavaScript
  • Build Tool: Create React App (React Scripts)
  • Styling: Tailwind CSS + SCSS/Sass
  • SEO: React Helmet Async (meta tags)
  • Icons: FontAwesome
  • Animations: AOS (Animate On Scroll)
  • Testing: Jest, React Testing Library
  • Performance: Web Vitals monitoring

Backend​

Primary Stack (9/13 services):

  • Language: Python 3.x
  • Framework: Flask (complyai-api, api-async, core, triangle, violin, ipu, maestro, gong)
  • ORM: Flask-SQLAlchemy, Alembic (migrations)
  • Auth: Flask-Security-Too, PyJWT, Authlib (OAuth)
  • WSGI Server: Gunicorn
  • API Features: Flask-CORS, Flask-Migrate

Alternative Stack (1 service):

  • Framework: Django + Wagtail CMS (complyai_cms)
  • Frontend Build: Webpack + Tailwind CSS
  • Static Files: WhiteNoise

API Specifications:

  • Style: REST (JSON)
  • Versioning: URL-based (v1, v2 auth endpoints)
  • Authentication: JWT tokens, OAuth2, API keys
  • Rate Limiting: Configured but inconsistently implemented

Data Layer​

Primary Database:

  • RDBMS: PostgreSQL (via psycopg2-binary/psycopg)
  • Usage: Shared across complyai-api, api-async, complyai-core
  • Migrations: Alembic (Flask), Django migrations (CMS)

Vector Storage:

  • Database: chromadb (ML embeddings)
  • Usage: ML services (triangle, violin, ipu)
  • Purpose: Semantic search, similarity matching

File Storage:

  • Primary: AWS S3 (via boto3)
  • Usage: ML models, user uploads, media files, data lake
  • CDN: CloudFront (inferred)

Analytics:

  • Warehouse: Google BigQuery
  • Usage: complyai-maestro (data pipelines, reporting)
  • Integration: google-cloud-bigquery SDK

Content Management:

  • CMS: Django Wagtail
  • Storage: PostgreSQL + S3 (media)
  • API: Wagtail REST API for frontend consumption

Cache Layer:

  • Not explicitly detected (Redis recommended for implementation)

Message Queue:

  • Queue: Celery with SQS backend (inferred from Celery[sqs])
  • Broker: AWS SQS or Redis (not explicitly documented)
  • Workers: Distributed Celery workers for async tasks

ML/AI Stack​

Deep Learning Frameworks:

  • TensorFlow: triangle, violin, ipu, api-async
  • PyTorch: triangle, violin, ipu, api-async
  • Transformers: Hugging Face (NLP models)
  • ONNX: Model optimization and portability
  • Optimum: Hugging Face optimization library

Google Cloud AI:

  • Platform: Google Vertex AI (complyai-gong)
  • Usage: Managed ML inference, policy analysis
  • Integration: vertexai Python SDK

ML Infrastructure:

  • Vector Database: chromadb (embeddings storage)
  • Model Storage: AWS S3 (via boto3)
  • Data Processing: pandas, numpy, pyarrow
  • Computer Vision: opencv-python (complyai-violin)
  • NLP: gensim, scikit-learn, datasets

ML Deployment:

  • Serving: Flask REST endpoints
  • Scaling: Horizontal (multiple inference instances)
  • Model Format: Native + ONNX for optimization

Infrastructure​

Cloud Providers:

  • AWS: Primary for API services, S3 storage, SQS queues
    • Services: EC2 (inferred), S3, SQS, CloudFront
    • SDK: boto3 (8 services)
  • Google Cloud Platform: Core engine, ML, analytics
    • Services: Vertex AI, BigQuery, Cloud Storage, Cloud SQL, Secret Manager
    • SDK: google-cloud-* libraries (maestro, gong)

Deployment:

  • Containerization: Docker (Dockerfile present in multiple services)
  • Orchestration: Not explicitly documented (likely ECS or GKE)
  • WSGI: Gunicorn for Python services
  • Static Hosting: S3 + CloudFront or similar for frontends

CI/CD:

  • Platform: GitHub Actions (.github repository)
  • Hooks: Husky (pre-commit) in frontend projects
  • Commit Standards: Commitlint (conventional commits)
  • Testing: Automated via pytest, Cypress

Monitoring (Partial Implementation):

  • Logging: Python logging module (comprehensive)
  • Notifications: Slack webhooks (across services)
  • User Analytics: FullStory (frontend)
  • Metrics: Web Vitals (frontend)
  • APM: Not detected (needs implementation)
  • Error Tracking: Not explicitly configured

Development Tools:

  • Version Control: Git + GitHub
  • Package Management: npm (frontend), pip + requirements.txt (backend)
  • Testing: pytest, Cypress, Jest
  • Code Quality: ESLint, Prettier (frontend), Python linting (backend)
  • Dependency Management: Dependabot potential via .github

Data Architecture​

Data Flow Diagram​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DATA INGESTION β”‚
β”‚ β”‚
β”‚ [User Input] ──→ [Frontend] ──→ [complyai-api] ──→ [PostgreSQL] β”‚
β”‚ β”‚ β”‚
β”‚ [Facebook API] ──→ [api-async] β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ [Stripe Webhooks] ──→ [complyai-api] ──→ [PostgreSQL] β”‚
β”‚ β”‚
β”‚ [Content Editors] ──→ [CMS] ──→ [PostgreSQL + S3] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DATA PROCESSING LAYER β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ SYNCHRONOUS PROCESSING β”‚ β”‚
β”‚ β”‚ [complyai-api] β†’ [complyai-core] β†’ [Validation] β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [Business Logic] β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [PostgreSQL Write] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ ASYNCHRONOUS PROCESSING β”‚ β”‚
β”‚ β”‚ [api-async] β†’ [Celery Tasks] β†’ [External APIs] β”‚ β”‚
β”‚ β”‚ ↓ ↓ ↓ β”‚ β”‚
β”‚ β”‚ [FB Ad Data] [AI Scoring] [Batch Jobs] β”‚ β”‚
β”‚ β”‚ ↓ ↓ ↓ β”‚ β”‚
β”‚ β”‚ [PostgreSQL] β†’ [ML Models] β†’ [S3/BigQuery] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ ML INFERENCE PIPELINE β”‚ β”‚
β”‚ β”‚ [Input Data] β†’ [Preprocessing] β†’ [chromadb] β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [Model Inference (triangle/violin/ipu)] β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [Compliance Score + Insights] β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [PostgreSQL] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ ORCHESTRATED WORKFLOWS (maestro) β”‚ β”‚
β”‚ β”‚ [Trigger] β†’ [Workflow Definition] β†’ [Task Queue]β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [Parallel Service Calls: API + ML + Data]β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [Aggregate Results] β”‚ β”‚
β”‚ β”‚ ↓ β”‚ β”‚
β”‚ β”‚ [BigQuery (Analytics) + Notifications] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DATA CONSUMPTION β”‚
β”‚ β”‚
β”‚ [Frontend] ←── [REST API] ←── [PostgreSQL + S3] β”‚
β”‚ β”‚
β”‚ [Reports] ←── [BigQuery] ←── [Aggregated Analytics] β”‚
β”‚ β”‚
β”‚ [Dashboards] ←── [TanStack Query Cache] ←── [API] β”‚
β”‚ β”‚
β”‚ [Exports] ←── [Google Sheets Integration] ←── [API] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Models​

Key Entities​

1. User & Authentication

  • Purpose: User accounts, authentication, authorization
  • Key Attributes: user_id, email, password_hash, roles, permissions, oauth_tokens
  • Relationships: β†’ Organizations, β†’ Subscriptions, β†’ Ad Accounts
  • Storage: PostgreSQL (complyai-core models)
  • Security: Encrypted passwords (Flask-Security-Too), JWT tokens, OAuth2

2. Ad Account

  • Purpose: Facebook ad account data and monitoring
  • Key Attributes: account_id, fb_account_id, status, spend, metrics, sync_timestamp
  • Relationships: β†’ User, β†’ Ad Creatives, β†’ Compliance Checks
  • Storage: PostgreSQL (shared models)
  • Sync: Real-time via api-async + Facebook Business API

3. Ad Creative

  • Purpose: Advertising creative content for compliance checking
  • Key Attributes: creative_id, content, images, video_urls, status, platform
  • Relationships: β†’ Ad Account, β†’ Compliance Results, β†’ Policy Violations
  • Storage: PostgreSQL (metadata), S3 (media files)
  • Processing: Async via Celery tasks

4. Compliance Check

  • Purpose: Compliance analysis results and scoring
  • Key Attributes: check_id, creative_id, ai_score, policy_violations[], timestamp, model_version
  • Relationships: β†’ Ad Creative, β†’ Policy Rules, β†’ ML Inference Results
  • Storage: PostgreSQL (structured), chromadb (embeddings)
  • Generation: ML services (triangle, violin, ipu)

5. Policy

  • Purpose: Compliance policies and regulatory rules
  • Key Attributes: policy_id, name, rules[], jurisdiction, version, effective_date
  • Relationships: β†’ Compliance Checks, β†’ Content (CMS)
  • Storage: PostgreSQL (structured), Wagtail CMS (content)
  • Management: complyai-gong (AI-powered analysis)

6. Subscription

  • Purpose: User subscriptions and payment tracking
  • Key Attributes: subscription_id, user_id, plan, status, stripe_subscription_id
  • Relationships: β†’ User, β†’ Payments
  • Storage: PostgreSQL
  • Processing: Stripe webhooks β†’ complyai-api

7. Workflow

  • Purpose: Orchestrated multi-service workflows
  • Key Attributes: workflow_id, type, status, steps[], results, created_by
  • Relationships: β†’ Tasks, β†’ Services
  • Storage: PostgreSQL (complyai-maestro)
  • Execution: Celery task chains

8. ML Model Metadata

  • Purpose: ML model versioning and tracking
  • Key Attributes: model_id, name, version, s3_path, accuracy_metrics, created_date
  • Relationships: β†’ Inference Results
  • Storage: PostgreSQL (metadata), S3 (model files)
  • Services: triangle, violin, ipu

9. Content

  • Purpose: Regulatory content and documentation
  • Key Attributes: content_id, title, body, category, publish_date, author
  • Relationships: β†’ Policies
  • Storage: PostgreSQL (via Wagtail), S3 (media)
  • Management: complyai_cms (Django/Wagtail)

10. Audit Log

  • Purpose: Comprehensive audit trail for compliance
  • Key Attributes: log_id, user_id, action, entity_type, entity_id, timestamp, ip_address
  • Relationships: β†’ All entities
  • Storage: PostgreSQL (across services - 8+ files with logging)
  • Retention: 7 years (compliance requirement)

Data Storage Strategy​

Data TypeStorage SolutionRationaleRetentionBackup
TransactionalPostgreSQLACID compliance, relational integrity7 yearsDaily
User CredentialsPostgreSQL (encrypted)Security, fast auth lookupsAccount lifetimeDaily
ML ModelsAWS S3Large files, versioning, cost-effectiveIndefiniteVersioned
Media FilesAWS S3Scalable, CDN integration2 yearsVersioned
Vector EmbeddingschromadbFast similarity search, ML optimized90 daysWeekly
AnalyticsGoogle BigQueryComplex queries, data warehouse2 yearsN/A (managed)
Cache(Not implemented)Need Redis for session/API cache24 hoursNone
LogsFile system + potential CloudWatchDebugging, audit trails90 daysWeekly
CMS ContentPostgreSQL (Wagtail)Structured content, versioningIndefiniteDaily
Task QueueSQS/Redis (Celery)Async job management14 daysNone

Storage Patterns:

  • Shared Database: complyai-api, api-async, complyai-core share PostgreSQL instance
  • Service-Specific: ML services have isolated chromadb instances
  • Hybrid Cloud: PostgreSQL on AWS, BigQuery on GCP, cross-cloud data movement
  • Data Lake: S3 used as data lake for analytics and ML training

Data Security​

Encryption at Rest:

  • Method: AES-256 encryption for sensitive fields
  • Implementation: Dedicated encryption module in complyai-core
  • Coverage: User credentials, payment data (tokenized via Stripe), PII
  • Key Management: Environment variables (needs upgrade to AWS KMS/Google Secret Manager)

Encryption in Transit:

  • Method: TLS 1.2+ (HTTPS)
  • Implementation: Load balancer termination, internal service calls
  • Issues: Insecure HTTP detected in 21 occurrences across 7 services ⚠️
  • Recommendation: Enforce HTTPS for all internal communication

Access Controls:

  • Database: PostgreSQL role-based access control
  • Application: Flask-Security-Too, JWT-based authorization
  • API: Token-based auth, OAuth2 for third-party
  • Cloud: IAM roles (AWS), Cloud IAM (GCP)
  • Issue: ML services lack authentication ⚠️

Data Classification:

  • Public: Marketing content (www, CMS public pages)
  • Internal: Business logic, configurations
  • Confidential: User data, compliance results
  • Restricted: Payment data (tokenized), ML models

PII Handling:

  • Approach: Minimize collection, encrypt at rest, tokenize payments
  • GDPR: Basic patterns detected (data_privacy in complyai-core)
  • Retention: Configurable per data type
  • Deletion: Soft delete with anonymization (needs verification)
  • Access Logging: Comprehensive audit trails (8+ services)

Secrets Management:

  • Current: Environment variables, google-cloud-secret-manager (maestro, gong)
  • Issues:
    • Hardcoded secrets in 3 services (maestro, CMS, adhoc) ⚠️ CRITICAL
    • Inconsistent approach across services
  • Recommendation: Migrate all to AWS Secrets Manager or HashiCorp Vault

Integration Architecture​

External Integrations​

SystemIntegration TypePurposeProtocolAuth MethodServices Using
Facebook Business APIREST APIAd account data, creative fetchHTTPSOAuth2 + API Keyapi-async, complyai-adhoc
StripeREST API + WebhooksPayment processingHTTPSAPI Keycomplyai-api, complyai-core, frontend
Auth0OAuth2/OIDCUser authenticationHTTPSClient credentialscomplyai-frontend
Google SheetsREST APIData export/importHTTPSOAuth2complyai-api
SlackWebhooksNotifications, alertsHTTPSWebhook URL8 services (widespread)
BigQueryCloud SDKAnalytics, data warehousegRPC/HTTPSService Accountcomplyai-maestro
Google Vertex AICloud SDKManaged ML inferencegRPC/HTTPSService Accountcomplyai-gong
AWS S3SDK (boto3)File storage, modelsHTTPSIAM Role/Keys8 services
SQSSDK (boto3)Message queue (Celery)HTTPSIAM Role/Keysapi-async, maestro

Internal Service Communication​

Synchronous Communication:

  • Primary: REST APIs (JSON over HTTPS)
  • Pattern: complyai-frontend β†’ complyai-api β†’ complyai-core β†’ PostgreSQL
  • Libraries: requests, Flask built-in
  • Issues:
    • No API gateway (direct service-to-service)
    • Inconsistent error handling
    • Missing circuit breakers

Asynchronous Communication:

  • Primary: Celery task queues (SQS/Redis backend)
  • Pattern: API triggers β†’ Celery task β†’ Worker execution β†’ Result storage
  • Use Cases:
    • Facebook ad data fetching (fetch_fb_current_ad_data, fetch_fb_total_ad_data)
    • AI score generation (generate_ai_score)
    • Batch processing (spend_script)
    • Health monitoring (heart_beat)
  • Orchestration: complyai-maestro coordinates complex workflows

Service Discovery:

  • Method: Environment variables with service URLs
  • Limitations: No dynamic service discovery (Consul, Eureka)
  • Configuration: Hardcoded endpoints in config files

Load Balancing:

  • Approach: Likely ALB/NLB (AWS) or Cloud Load Balancer (GCP)
  • Level: L7 (application level)
  • Worker Scaling: Celery workers auto-scale based on queue depth

Inter-Service Data Sharing:

  • Shared Models: complyai-core provides common models for API services
  • Shared Database: PostgreSQL shared across complyai-api, api-async, core
  • Pattern: Shared database anti-pattern (tight coupling) ⚠️
  • Recommendation: Move to API-based communication with separate databases

API Design​

API Style:

  • Primary: REST with JSON payloads
  • Potential: GraphQL not detected (could benefit frontend)
  • Documentation: Missing (needs OpenAPI/Swagger) ⚠️

API Endpoints (complyai-api):

  • /auth/* - Authentication (v1 and v2 versions)
  • /precheck/* - Pre-compliance validation
  • /checker/* - Compliance checking
  • /maestro/* - Workflow orchestration
  • /policies/* - Policy management
  • /prompts/* - AI prompt management

Versioning Strategy:

  • Current: URL-based versioning (v1, v2 for auth)
  • Inconsistency: Not applied uniformly across endpoints
  • Recommendation: Standardize v1 prefix for all endpoints

Authentication Method:

  • Frontend: Auth0 (OAuth2/OIDC) with JWT tokens
  • API-to-API: JWT tokens, API keys
  • ML Services: No auth detected ⚠️ CRITICAL
  • External: OAuth2 for Facebook, Stripe API keys

Rate Limiting:

  • Implementation: ratelimiter.py files detected in ML services
  • Status: Configured but not consistently applied
  • Missing: API gateway rate limiting
  • Recommendation: Implement at gateway level with Redis backend

Error Handling:

  • Standard: HTTP status codes
  • Response Format: JSON with error messages
  • Logging: Comprehensive logging across services
  • Issue: Inconsistent error schemas

Pagination:

  • Detection: Not explicitly documented
  • Recommendation: Implement cursor-based pagination for large datasets

Security Architecture​

Security Layers​

1. Network Security

  • Firewalls: VPC security groups (AWS/GCP)
  • VPC Configuration: Private subnets for databases, public for frontends
  • Network Segmentation: Services isolated by VPC/subnet
  • Issues:
    • Insecure HTTP in 21 occurrences (7 services) ⚠️
    • ML services potentially exposed without auth ⚠️

2. Application Security

  • Authentication:
    • Frontend: Auth0 (MFA support, OAuth2/OIDC)
    • Backend: Flask-Security-Too, PyJWT, Authlib
    • Detection: 14 authentication patterns across 4 services
  • Authorization:
    • Role-based access control (RBAC)
    • Flask-Security decorators
    • Missing in ML services ⚠️
  • Input Validation:
    • Dedicated validation module in complyai-core
    • Flask-WTF forms (inferred)
    • Inconsistent application across services
  • Output Encoding:
    • React auto-escaping (XSS protection)
    • Flask Jinja2 auto-escaping
  • CSRF Protection:
    • Flask-WTF CSRF tokens (inferred)
    • React SPA pattern (token-based)

3. Data Security

  • Encryption at Rest:
    • Dedicated encryption module (complyai-core)
    • 32 encryption patterns detected across 5 services
    • AES-256 for sensitive fields
  • Encryption in Transit:
    • TLS 1.2+ for external communication
    • Issues with internal HTTP (21 occurrences) ⚠️
  • Tokenization:
    • Stripe payment tokenization (PCI compliance)
  • Access Controls:
    • Database role-based access
    • Flask-Security permissions
  • Audit Logging:
    • Comprehensive across services (8 services, multiple files)
    • Tracks user actions, system events
    • Retention: 7 years (compliance)

Security Controls​

ControlImplementationStatusServicesPriority
Multi-Factor AuthAuth0 MFAβœ… ImplementedFrontendβœ…
Encryption at RestAES-256 (core module)βœ… ImplementedCore, APIβœ…
Encryption in TransitTLS 1.2+⚠️ Partial (21 HTTP issues)AllπŸ”΄ High
API AuthenticationJWT, OAuth2⚠️ Partial (ML missing)API, FrontendπŸ”΄ High
Rate Limitingratelimiter.py⚠️ InconsistentML services🟑 Medium
Input ValidationValidation moduleβœ… ImplementedCore, APIβœ…
Audit LoggingPython loggingβœ… Comprehensive8 servicesβœ…
Secrets ManagementGCP Secret Manager⚠️ Partial (3 hardcoded)Maestro, GongπŸ”΄ Critical
Dependency ScanningDependabot potential❌ Not detectedAll🟑 Medium
SASTNot detected❌ MissingAll🟑 Medium
Penetration TestingNot detected❌ MissingAll🟑 Medium
WAFNot detected❌ MissingFrontend🟑 Medium
DDoS ProtectionCloudFlare/AWS Shield potential❓ UnknownFrontend🟑 Medium

Critical Security Issues:

  1. Hardcoded Secrets (3 services): maestro/config.py, cms/dev.py, cms/staging.py, adhoc/slack_data ⚠️ CRITICAL
  2. Insecure Communication (21 occurrences, 7 services): HTTP instead of HTTPS ⚠️ HIGH
  3. Missing ML Auth (3 services): triangle, violin, ipu have no authentication ⚠️ HIGH
  4. Inconsistent Rate Limiting: Not uniformly applied ⚠️ MEDIUM
  5. No API Gateway: Direct service exposure increases attack surface ⚠️ MEDIUM

Compliance Requirements​

Current Compliance Status:

  • GDPR (Partial): Data privacy patterns detected, needs full audit
  • HIPAA: Not applicable (no healthcare data)
  • SOC 2 Type 1: Not achieved (required for Series A)
  • SOC 2 Type 2: Future requirement
  • ISO 27001: Not pursued
  • PCI DSS (Delegated): Stripe handles, no card data storage βœ…

GDPR Compliance Measures:

  • Data privacy patterns in complyai-core
  • Audit logging (right to access)
  • Encryption at rest (data protection)
  • OAuth2 consent flows
  • Gaps:
    • No documented data retention policies
    • Missing right to erasure implementation
    • No data processing agreements documented

Audit Requirements:

  • Comprehensive audit logging (8 services)
  • 7-year retention (financial compliance)
  • User action tracking
  • System event logging
  • Gaps:
    • No centralized audit log aggregation
    • Missing audit log integrity verification
    • No automated compliance reporting

Data Residency:

  • Cross-cloud deployment (AWS + GCP)
  • No explicit data residency controls
  • Potential GDPR residency issues ⚠️

Scalability & Performance​

Scalability Strategy​

Horizontal Scaling:

  • API Services: Gunicorn workers, multiple instances behind load balancer
  • Celery Workers: Auto-scaling based on queue depth
  • ML Services: Multiple inference instances (currently limited)
  • Frontend: Static CDN, infinite scale potential
  • Database: Read replicas needed (not currently implemented) ⚠️

Vertical Scaling:

  • Current: Single large instances for databases
  • Limitations: PostgreSQL vertical limit, eventual need for sharding
  • ML Services: GPU instances for inference (costly)

Database Scaling:

  • Current: Single PostgreSQL instance (shared anti-pattern)
  • Read Replicas: Not implemented ⚠️
  • Sharding: Not implemented
  • Connection Pooling: SQLAlchemy pooling
  • Recommendation:
    • Implement read replicas immediately
    • Consider database per service for microservices isolation
    • Future: PostgreSQL partitioning or migration to distributed DB

Caching Strategy:

  • Current: TanStack Query (frontend client-side cache)
  • Missing:
    • Redis for API caching ⚠️
    • Database query caching
    • ML inference result caching
  • Recommendation: Implement Redis cluster for:
    • Session storage
    • API response caching
    • Celery broker (if not using SQS)
    • ML prediction caching

CDN & Static Assets:

  • Frontend: S3 + CloudFront (inferred)
  • CMS Media: S3 with CDN
  • Optimization: Image optimization, lazy loading (frontend)

Auto-Scaling Configuration:

  • Triggers: CPU > 70%, Memory > 80%, Queue depth > 1000
  • Min Instances: 2 (high availability)
  • Max Instances: Dynamic based on service
  • Cool-down: 5 minutes

Performance Targets​

MetricTargetCurrentGapService
API Response Time (P95)< 200msUnknown ⚠️Needs measurementcomplyai-api
API Response Time (P99)< 500msUnknown ⚠️Needs measurementcomplyai-api
ML Inference Latency< 1sUnknown ⚠️Needs measurementtriangle/violin/ipu
Async Task Processing< 5minUnknown ⚠️Needs measurementapi-async
Throughput10K req/dayUnknown ⚠️Needs measurementPlatform
Concurrent Users1,000Unknown ⚠️Needs measurementFrontend
Database Query Time< 50msUnknown ⚠️Needs measurementPostgreSQL
Frontend Load Time< 2sUnknown ⚠️Web Vitals presentFrontend
Uptime99.9%Unknown ⚠️Needs SLA trackingAll services
Error Rate< 0.1%Unknown ⚠️Needs monitoringAll services

Performance Monitoring Gaps:

  • No APM (Application Performance Monitoring) ⚠️
  • No distributed tracing ⚠️
  • Limited metrics collection
  • No performance budgets
  • Missing SLA tracking

Bottlenecks​

Current Bottlenecks:

  1. Database (HIGH IMPACT)

    • Issue: Single shared PostgreSQL instance across API, async, core
    • Symptoms: Potential connection exhaustion, query contention
    • Mitigation: Connection pooling (SQLAlchemy)
    • Solution: Read replicas, database per service, query optimization
  2. ML Inference (HIGH IMPACT)

    • Issue: Cold start latency, GPU resource constraints
    • Symptoms: Variable response times, potential timeouts
    • Mitigation: Model loading optimization
    • Solution: Model warming, inference result caching, model optimization (ONNX)
  3. External API Rate Limits (MEDIUM IMPACT)

    • Issue: Facebook API rate limits, Stripe API limits
    • Symptoms: Async task delays, batch processing slowdowns
    • Mitigation: Backoff and retry logic
    • Solution: Request batching, caching, circuit breakers
  4. Celery Task Queue (MEDIUM IMPACT)

    • Issue: Queue depth during peak loads
    • Symptoms: Task processing delays
    • Mitigation: Worker auto-scaling
    • Solution: Priority queues, task optimization, parallel processing
  5. Shared Database Writes (MEDIUM IMPACT)

    • Issue: Write contention on shared PostgreSQL
    • Symptoms: Lock timeouts, slow writes
    • Mitigation: Transaction optimization
    • Solution: Event sourcing, CQRS pattern, database sharding
  6. No Caching Layer (MEDIUM IMPACT)

    • Issue: Repeated database queries, API calls
    • Symptoms: Higher latency, database load
    • Mitigation: Application-level memoization
    • Solution: Redis cache cluster, query result caching
  7. Monolithic Frontend Bundle (LOW IMPACT)

    • Issue: Large JavaScript bundle size
    • Symptoms: Slow initial page load
    • Mitigation: Code splitting (Vite default)
    • Solution: Lazy loading, route-based splitting, tree shaking

Performance Optimization Roadmap:

  1. Immediate (Week 1-2): Implement APM, establish baselines, identify hot paths
  2. Short-term (Month 1-2): Redis caching, database read replicas, ML result caching
  3. Medium-term (Month 3-4): Database per service, query optimization, CDN optimization
  4. Long-term (Month 5-6): Distributed tracing, database sharding, ML model optimization

Reliability & Availability​

High Availability Design​

Redundancy:

  • API Services: Multi-AZ deployment (assumed), 2+ instances minimum
  • Databases: PostgreSQL failover (needs verification) ⚠️
  • ML Services: Multiple inference instances
  • Frontend: CDN (globally distributed)
  • Gaps:
    • No documented multi-region deployment
    • Single database instance (SPOF) ⚠️

Failover Mechanisms:

  • Load Balancer: Health check-based failover (ALB/NLB)
  • Database: PostgreSQL streaming replication (needs verification) ⚠️
  • Celery: Worker redundancy, task retry logic
  • Frontend: CDN failover (automatic)
  • Issues: No documented failover procedures

Load Distribution:

  • Method: Round-robin load balancing (assumed)
  • Sticky Sessions: Not required (stateless APIs)
  • WebSocket: Not detected (if needed, requires session affinity)

Health Checks:

  • Implementation:
    • /health endpoints (inferred)
    • heart_beat Celery task (api-async)
    • Gunicorn worker health
  • Frequency: Every 30 seconds (typical)
  • Actions: Auto-recovery, instance replacement

Circuit Breakers:

  • Status: Not implemented ⚠️
  • Need: External API calls (Facebook, Stripe), ML services
  • Recommendation: Implement with exponential backoff

Disaster Recovery​

Backup Strategy:

  • PostgreSQL:
    • Frequency: Daily full, hourly incremental (assumed)
    • Retention: 30 days
    • Method: pg_dump or AWS RDS automated backups
    • Testing: Not documented ⚠️
  • S3 Data:
    • Versioning enabled (assumed)
    • Cross-region replication (not verified) ⚠️
  • Application State:
    • Stateless design (good for DR)
  • Gaps:
    • No documented backup testing procedures ⚠️
    • No backup encryption verification

Recovery Time Objective (RTO):

  • Target: < 1 hour
  • Current: Unknown (needs DR testing) ⚠️
  • Components:
    • Database restore: 15-30 minutes
    • Service redeployment: 10-15 minutes
    • DNS propagation: 5-10 minutes

Recovery Point Objective (RPO):

  • Target: < 1 hour (hourly backups)
  • Current: Unknown ⚠️
  • Critical Data: Financial transactions, compliance results
  • Acceptable Loss: Minimal for compliance data

DR Testing:

  • Frequency: Not documented ⚠️ (should be quarterly)
  • Method: Not documented ⚠️
  • Last Test: Unknown ⚠️
  • Recommendation: Implement quarterly DR drills with runbooks

DR Runbooks:

  • Status: Not detected ⚠️ CRITICAL
  • Required:
    • Database failover procedure
    • Service restoration steps
    • Data recovery process
    • Communication plan
    • Rollback procedures

Monitoring & Observability​

Logging:

  • Tool: Python logging module (comprehensive across services)
  • Aggregation: Not centralized ⚠️ (needs ELK, Datadog, or CloudWatch)
  • Coverage:
    • complyai-api: 8 files with logging
    • api-async: 35 files (most comprehensive)
    • complyai-core: 9 files
    • Other services: 3+ files each
  • Retention: File-based (90 days assumed)
  • Structure: Unstructured logs (needs JSON logging)
  • Gaps:
    • No log aggregation platform ⚠️
    • No log-based alerting
    • No correlation IDs for distributed tracing

Metrics:

  • Frontend:
    • Web Vitals (performance monitoring)
    • FullStory (user session recording)
  • Backend: Not implemented ⚠️
  • Database: PostgreSQL stats (not centralized)
  • Infrastructure: Cloud provider metrics (AWS CloudWatch, GCP Monitoring)
  • Gaps:
    • No application metrics (request counts, latencies) ⚠️
    • No business metrics dashboard
    • No SLA tracking

Tracing:

  • Status: Not implemented ⚠️ CRITICAL
  • Need: Distributed tracing for microservices
  • Recommendation: Implement OpenTelemetry, Jaeger, or DataDog APM
  • Benefits: Request flow visualization, bottleneck identification

Alerting:

  • Tool: Slack webhooks (8 services)
  • Scope: Error notifications, task completions
  • Channels: Development team Slack
  • Issues:
    • No PagerDuty/OpsGenie for on-call ⚠️
    • No severity classification
    • No escalation procedures
    • Alert fatigue potential (Slack-only)

Observability Maturity:

  • Logging: 6/10 (comprehensive but not aggregated)
  • Metrics: 3/10 (minimal, frontend-only)
  • Tracing: 0/10 (not implemented)
  • Alerting: 4/10 (basic Slack notifications)
  • Overall: 3.25/10 (needs significant improvement)

Observability Roadmap:

  1. Phase 1: Centralized logging (ELK or Datadog), structured JSON logs
  2. Phase 2: Application metrics (Prometheus + Grafana or Datadog)
  3. Phase 3: Distributed tracing (OpenTelemetry)
  4. Phase 4: Advanced alerting (PagerDuty integration, SLA tracking)

Deployment Architecture​

Environments​

EnvironmentPurposeConfigurationServicesData
DevelopmentDeveloper workSingle instance, SQLite potentialAll services, docker-composeSynthetic/test data
StagingPre-production testingProduction-like, smaller scaleAll servicesAnonymized production data
ProductionLive customer systemHighly available, multi-AZAll servicesLive customer data
QA/TestAutomated testingCI/CD ephemeralCritical servicesTest fixtures

Environment Configuration:

  • Method: Environment variables, .env files
  • Issues:
    • Hardcoded values in staging (cms/staging.py) ⚠️
    • Inconsistent config management
  • Secrets: Google Cloud Secret Manager (partial), hardcoded (3 services) ⚠️

Infrastructure Differences:

  • Development: Local or single cloud VM, minimal resources
  • Staging: AWS/GCP, production topology, 50% capacity
  • Production: Multi-AZ, auto-scaling, high redundancy

CI/CD Pipeline​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Developer │────►│ Git Push │────►│ GitHub │────►│ Build β”‚
β”‚ Commits β”‚ β”‚ to Branch β”‚ β”‚ Actions β”‚ β”‚ & Test β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ Triggered β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BUILD & TEST STAGE β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Linting β”‚ β”‚ Unit Tests β”‚ β”‚ Integration β”‚ β”‚ Build β”‚ β”‚
β”‚ β”‚ (ESLint, β”‚ β”‚ (pytest, β”‚ β”‚ Tests β”‚ β”‚ Docker β”‚ β”‚
β”‚ β”‚ Python) β”‚ β”‚ Jest) β”‚ β”‚ β”‚ β”‚ Images β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Quality Gates β”‚ β”‚
β”‚ β”‚ Coverage > X% β”‚ β”‚
β”‚ β”‚ No Critical β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DEPLOYMENT STAGE β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ STAGING DEPLOYMENT β”‚ β”‚
β”‚ β”‚ β€’ Deploy to staging environment β”‚ β”‚
β”‚ β”‚ β€’ Run E2E tests (Cypress) β”‚ β”‚
β”‚ β”‚ β€’ Smoke tests β”‚ β”‚
β”‚ β”‚ β€’ Manual approval gate β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ PRODUCTION DEPLOYMENT β”‚ β”‚
β”‚ β”‚ β€’ Rolling deployment (or Blue-Green) β”‚ β”‚
β”‚ β”‚ β€’ Health checks β”‚ β”‚
β”‚ β”‚ β€’ Automated rollback on failure β”‚ β”‚
β”‚ β”‚ β€’ Slack notification β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Current CI/CD Status:

  • Platform: GitHub Actions (.github repository)
  • Trigger: Git push, pull request
  • Testing: pytest (backend), Jest (frontend), Cypress (E2E - configured)
  • Quality Gates:
    • Linting: ESLint (frontend), Python linters (backend)
    • Code formatting: Prettier (frontend), Husky hooks
    • Commit validation: Commitlint (conventional commits)
    • Test coverage: Gaps due to low coverage (8.5% avg) ⚠️
  • Deployment: Docker images, container deployment
  • Issues:
    • No documented deployment workflows ⚠️
    • Test coverage too low for reliable gates
    • No security scanning in pipeline ⚠️

Deployment Strategy​

  • Blue-Green
  • Canary (10% β†’ 50% β†’ 100%)
  • Rolling (likely current approach)
  • Recreation
  • A/B Testing

Rolling Deployment Process:

  1. Build and push Docker image
  2. Update 1 instance with new version
  3. Health check validation
  4. Continue to next instance
  5. Full fleet updated incrementally
  6. Automated rollback on failure

Deployment Frequency:

  • Current: Unknown, likely weekly/bi-weekly
  • Target for Scale: Daily (continuous deployment)

Deployment Automation:

  • Containerization: Docker (Dockerfiles present)
  • Orchestration: Not explicitly documented (ECS, GKE, or Kubernetes)
  • Configuration: Environment variables, secrets management
  • Zero-Downtime: Rolling updates, health checks

Infrastructure as Code​

Status: Not explicitly detected ⚠️

Recommendations:

  • Tool: Terraform (preferred) or AWS CloudFormation/GCP Deployment Manager
  • Repository: Separate infra repo or monorepo /infrastructure directory
  • Coverage:
    • VPC and networking
    • Compute instances (ECS/GKE)
    • Databases (RDS/Cloud SQL)
    • Load balancers
    • S3 buckets, IAM roles
  • Versioning: Git-based, tagged releases
  • State Management: Terraform Cloud or S3 backend with locking

Current Gaps:

  • No IaC detected (manual provisioning risk) ⚠️
  • No infrastructure versioning
  • Difficult disaster recovery
  • Inconsistent environments

Cost Architecture​

Cost Breakdown (Estimated)​

ComponentMonthly Cost% of TotalOptimization Opportunity
Compute (EC2/GCE)$3,000 - $5,00035%Right-sizing instances, spot instances for dev/staging
Database (RDS/Cloud SQL)$1,500 - $2,50020%Reserved instances, read replica optimization
ML Inference (GPU)$2,000 - $4,00030%Model optimization, serverless inference, caching
Storage (S3/GCS)$500 - $1,0008%Lifecycle policies, compression, tiering
Data Transfer$300 - $6005%CloudFront/CDN, cross-region optimization
External APIs$200 - $5003%Rate limiting, caching, batch requests
Monitoring/Logging$100 - $3002%Log retention policies, sampling
Other (SQS, Lambda, etc.)$200 - $4002%-
Total Estimated$7,800 - $14,300100%Potential 30-40% savings

Cost Drivers:

  1. ML GPU Instances: Largest cost (30%), always-on inference servers
  2. Compute Sprawl: 13 separate services, potential over-provisioning
  3. Cross-Cloud: Data transfer between AWS and GCP
  4. No Caching: Redundant computations and API calls

Cost Optimization Strategies​

Immediate (Month 1):

  1. ML Inference Optimization ($800-1,500/month savings)

    • Implement prediction caching (Redis)
    • Use CPU instances for lower-volume models
    • Batch inference requests
    • Implement model quantization (ONNX)
  2. Right-Sizing Instances ($500-1,000/month savings)

    • Audit instance utilization
    • Downsize over-provisioned instances
    • Use burstable instances (T3, E2) for variable workloads
  3. Reserved Instances ($400-800/month savings)

    • 1-year RDS reserved instances (40% discount)
    • 1-year compute reserved instances for production
    • Savings Plans for flexible compute

Short-term (Month 2-3): 4. Service Consolidation ($600-1,200/month savings)

  • Merge ML services (3 β†’ 1): Reduce 2 instance costs
  • Merge API services (2 β†’ 1): Consolidate infrastructure
  • Reduce operational overhead
  1. Storage Optimization ($200-400/month savings)

    • S3 lifecycle policies (Standard β†’ IA β†’ Glacier)
    • Compress large files (images, videos)
    • Delete unused/old data
  2. Data Transfer Optimization ($150-300/month savings)

    • Minimize cross-cloud transfer (AWS ↔ GCP)
    • Use CloudFront/CDN for static assets
    • Compress API responses

Medium-term (Month 4-6): 7. Serverless Migration ($1,000-2,000/month savings)

  • Migrate low-volume endpoints to Lambda/Cloud Functions
  • Use SageMaker/Vertex AI serverless inference
  • Pay only for actual usage
  1. Auto-Scaling Optimization ($400-800/month savings)

    • Aggressive scale-down during off-hours
    • Use spot instances for non-critical workloads
    • Implement predictive scaling
  2. Database Optimization ($300-600/month savings)

    • Query optimization (reduce IOPS)
    • Connection pooling tuning
    • Consider Aurora Serverless for variable loads

Total Potential Savings: $4,350-8,600/month (35-60%)

Cost Monitoring:

  • Current: Cloud provider dashboards (AWS Cost Explorer, GCP Billing)
  • Recommendation:
    • Implement cost allocation tags by service
    • Set up budget alerts
    • Use CloudHealth or Cloudability for multi-cloud visibility
    • Create cost dashboards for engineering team visibility

Design Decisions (ADRs)​

ADR-001: Microservices Architecture with Musical Naming​

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed to scale different components independently (ML models, API, async processing) while allowing parallel team development. The platform required flexibility to deploy and update services without impacting the entire system.

Decision: Implement microservices architecture with separate repositories for each service. Adopt "musical" naming convention (triangle, violin, maestro, gong, ipu) suggesting orchestrated workflows and harmony between services.

Consequences:

  • βœ… Positive:
    • Independent scaling of services (ML vs API)
    • Parallel team development
    • Technology flexibility (Flask + Django + React)
    • Fault isolation
    • Clear orchestration metaphor
  • ❌ Negative:
    • Service fragmentation (13 repos)
    • Operational complexity
    • Increased deployment overhead
    • Naming ambiguity (purpose unclear from name)
    • Higher infrastructure costs
    • Shared database coupling

Alternatives Considered:

  1. Monolithic architecture (rejected: scaling limitations)
  2. Modular monolith (rejected: deployment coupling)
  3. Functional naming (rejected in favor of thematic consistency)

Current Recommendation: Consolidate to 5-6 core services while maintaining microservices benefits


ADR-002: Dual API Pattern (Sync + Async)​

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed to handle both real-time user requests (<200ms) and long-running operations (Facebook data fetch, ML inference, batch processing). Single API couldn't efficiently serve both patterns.

Decision: Implement two separate API services:

  • complyai-api: Synchronous REST API for real-time operations
  • api-async: Asynchronous API with Celery for background tasks

Consequences:

  • βœ… Positive:
    • Optimized performance for each pattern
    • Async processing doesn't block sync requests
    • Independent scaling
    • Clear separation of concerns
  • ❌ Negative:
    • Code duplication (shared models via complyai-core)
    • Shared database (tight coupling)
    • Deployment complexity (2 services vs 1)
    • Inconsistent patterns across APIs

Alternatives Considered:

  1. Single API with async task queue (rejected: mixing concerns)
  2. API Gateway with backend services (not evaluated initially)
  3. GraphQL with subscriptions (rejected: team expertise)

Current Recommendation: Consolidate into unified API Gateway with async capabilities built-in


ADR-003: Three Separate ML Inference Services​

Status: Accepted (Historical), Under Review Date: Unknown (pre-2024)

Context: ML compliance checking required specialized models for different domains (text, image, policy analysis). Models had different resource requirements and update cycles.

Decision: Create three separate ML inference services (triangle, violin, ipu) each with identical tech stack (TensorFlow, PyTorch, Flask, chromadb) but different models.

Consequences:

  • βœ… Positive:
    • Model-specific optimization
    • Independent deployment/updates
    • Fault isolation (one model failure doesn't affect others)
    • Specialized resource allocation (GPU vs CPU)
  • ❌ Negative:
    • 3x infrastructure costs
    • Code duplication (551 + 357 + 418 = 1,326 LOC, mostly identical)
    • Operational overhead (3 services to monitor/deploy)
    • No centralized model registry
    • Inconsistent security (no auth)

Alternatives Considered:

  1. Single ML service with model routing (rejected: complexity concerns)
  2. Serverless inference (not evaluated: team expertise)
  3. Managed ML platform (partially adopted with Vertex AI in gong)

Current Recommendation: CONSOLIDATE into unified ML platform with model registry


ADR-004: Cross-Cloud Deployment (AWS + GCP)​

Status: Accepted (Historical), Under Review Date: Unknown (pre-2024)

Context: ComplyAI started on one cloud provider but needed specific services from another (e.g., BigQuery for analytics, Vertex AI for managed ML).

Decision: Deploy across both AWS and GCP:

  • AWS: API services, S3 storage, SQS queues
  • GCP: Core engine, ML services, BigQuery, Vertex AI

Consequences:

  • βœ… Positive:
    • Best-of-breed services from each cloud
    • Vendor lock-in reduction
    • Access to GCP's AI/ML platforms
    • BigQuery for powerful analytics
  • ❌ Negative:
    • Increased complexity (dual cloud management)
    • Higher data transfer costs (inter-cloud)
    • Dual IAM/security management
    • Team needs expertise in both clouds
    • Disaster recovery complexity

Alternatives Considered:

  1. AWS-only with Redshift + SageMaker (rejected: GCP AI preference)
  2. GCP-only migration (rejected: AWS investment)
  3. Multi-cloud with abstraction layer (rejected: overhead)

Current Recommendation: Consolidate to single cloud or implement clear service boundaries


ADR-005: Shared Database Pattern​

Status: Accepted (Historical), Deprecated Date: Unknown (pre-2024)

Context: complyai-api, api-async, and complyai-core needed access to the same data models. Team prioritized rapid development over strict service isolation.

Decision: Share a single PostgreSQL database across complyai-api, api-async, and complyai-core services. Use complyai-core for shared model definitions.

Consequences:

  • βœ… Positive:
    • No data synchronization needed
    • Shared transactions (ACID compliance)
    • Faster initial development
    • Simplified data access
  • ❌ Negative:
    • Tight coupling (microservices anti-pattern) ⚠️
    • Schema changes affect all services
    • Single point of failure
    • Scaling bottleneck
    • Can't independently deploy database migrations

Alternatives Considered:

  1. Database per service with API communication (rejected: development speed)
  2. Event sourcing with CQRS (rejected: complexity)
  3. GraphQL federation (not evaluated)

Current Status: DEPRECATED - Violates microservices principles Recommendation: Migrate to database per service with API-based communication


ADR-006: Django/Wagtail CMS Alongside Flask Services​

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed robust content management for regulatory content, documentation, and knowledge base. Wagtail CMS provided rich editing experience but required Django.

Decision: Introduce Django/Wagtail as separate service (complyai_cms) despite rest of backend using Flask.

Consequences:

  • βœ… Positive:
    • Powerful CMS with editorial workflow
    • Rich content editing experience
    • Version control and publishing workflow
    • Strong community and plugins
  • ❌ Negative:
    • Technology stack fragmentation (Flask + Django)
    • Different ORM patterns (SQLAlchemy vs Django ORM)
    • Separate deployment pipeline
    • Team needs Django expertise
    • Different security patterns

Alternatives Considered:

  1. Flask-Admin (rejected: insufficient CMS features)
  2. Headless CMS (Contentful, Strapi) (not evaluated initially)
  3. Custom Flask CMS (rejected: development time)

Current Recommendation: Evaluate headless CMS migration to unify tech stack


ADR-007: Frontend Technology Split (App vs Marketing)​

Status: Accepted (Historical) Date: Unknown (pre-2024)

Context: ComplyAI needed both a feature-rich application (complyai-frontend) and a marketing website (www) with different requirements and update cycles.

Decision: Build two separate React applications:

  • complyai-frontend: TypeScript, Vite, complex state management (Zustand, TanStack Query)
  • www: JavaScript, Create React App, simpler marketing site

Consequences:

  • βœ… Positive:
    • Optimized for different use cases
    • Independent deployment/updates
    • Smaller bundle size for marketing site
    • Clear separation of concerns
  • ❌ Negative:
    • Code/component duplication
    • Inconsistent user experience potential
    • Duplicate dependencies/tooling
    • 2 deployment pipelines

Alternatives Considered:

  1. Monorepo with shared components (not evaluated initially)
  2. Unified app with marketing routes (rejected: bundle size)
  3. Next.js for both (not evaluated: team expertise)

Current Recommendation: Maintain separation but share component library via monorepo


ADR-008: Auth0 for Authentication​

Status: Accepted Date: Unknown (2024 or earlier)

Context: ComplyAI needed enterprise-grade authentication with MFA, OAuth2, social logins, and compliance features without building custom auth system.

Decision: Adopt Auth0 as managed authentication provider for complyai-frontend.

Consequences:

  • βœ… Positive:
    • Enterprise-grade security out-of-box
    • MFA, OAuth2, OIDC support
    • Social login integrations
    • SOC 2 compliant
    • Reduces development time
    • Regular security updates
  • ❌ Negative:
    • Monthly cost per active user
    • Vendor lock-in
    • Less control over auth flow
    • Requires internet connectivity

Alternatives Considered:

  1. Custom Flask-Security (rejected: security risk, maintenance)
  2. AWS Cognito (rejected: Auth0 features preferred)
  3. Firebase Auth (rejected: GCP preference for managed services)

Status: Approved - Auth0 is appropriate choice for Series A stage

Technical Debt​

Known Issues​

1. Test Coverage Crisis ⚠️ CRITICAL

  • Issue: Average test coverage 8.5%, with 7 services at 0%
  • Impact: HIGH - Production bugs, risky deployments, slow development
  • Affected Services:
    • api-async: 0% (20,143 LOC) - HIGHEST RISK
    • complyai-violin: 0% (357 LOC)
    • complyai-gong: 0% (1,031 LOC)
    • complyai-api: 1.39% (15,630 LOC)
    • complyai_cms: 1.32% (2,771 LOC)
  • Mitigation: Manual testing, cautious deployments
  • Resolution Plan:
    • Month 1: Achieve 40% coverage for critical paths
    • Month 2: Achieve 60% for business logic
    • Month 3: Achieve 80% for core services

2. Hardcoded Secrets ⚠️ CRITICAL

  • Issue: Production secrets hardcoded in 3 services
  • Impact: CRITICAL - Security breach risk, audit failure
  • Locations:
    • complyai-maestro/config.py
    • complyai_cms/dev.py, staging.py
    • complyai-adhoc/slack_data/*.py
  • Mitigation: Limited access to repositories
  • Resolution Plan:
    • Week 1: Migrate all secrets to AWS Secrets Manager/GCP Secret Manager
    • Week 2: Implement automated secret scanning in CI/CD
    • Week 3: Security audit and rotation

3. Missing ML Service Authentication ⚠️ HIGH

  • Issue: triangle, violin, ipu have no authentication
  • Impact: HIGH - Unauthorized access, resource abuse, data leakage
  • Affected: 3 ML services
  • Mitigation: Network-level isolation (VPC)
  • Resolution Plan:
    • Week 1-2: Implement JWT authentication for ML endpoints
    • Week 2-3: Add rate limiting
    • Week 4: API key management system

4. Insecure HTTP Communication ⚠️ HIGH

  • Issue: 21 occurrences across 7 services
  • Impact: HIGH - Man-in-the-middle attacks, data interception
  • Affected: 7 services (API, core, maestro, CMS, adhoc, triangle, violin, ipu)
  • Mitigation: VPC isolation, load balancer HTTPS termination
  • Resolution Plan:
    • Week 1: Audit all HTTP calls
    • Week 2: Enforce HTTPS for internal service communication
    • Week 3: Update configurations, remove HTTP fallbacks

5. Shared Database Anti-Pattern ⚠️ HIGH

  • Issue: PostgreSQL shared across complyai-api, api-async, core
  • Impact: HIGH - Tight coupling, scaling bottleneck, deployment risk
  • Affected: 3 core services
  • Mitigation: Careful schema migration coordination
  • Resolution Plan:
    • Month 1-2: Implement API-based inter-service communication
    • Month 3-4: Separate databases with data migration
    • Month 5: Remove direct database coupling

6. No Monitoring/Observability ⚠️ HIGH

  • Issue: No APM, distributed tracing, or centralized metrics
  • Impact: HIGH - Blind to performance issues, slow incident resolution
  • Affected: All services
  • Mitigation: Reactive debugging, log file analysis
  • Resolution Plan:
    • Month 1: Implement APM (Datadog, New Relic)
    • Month 2: Centralized logging (ELK, Datadog)
    • Month 3: Distributed tracing (OpenTelemetry)

7. Missing API Documentation ⚠️ MEDIUM

  • Issue: No OpenAPI/Swagger specs for 13 services
  • Impact: MEDIUM - Slow integration, developer friction, errors
  • Affected: All API services
  • Mitigation: Code review, team knowledge
  • Resolution Plan:
    • Month 1: Generate OpenAPI specs for core APIs
    • Month 2: Interactive API documentation (Swagger UI)
    • Month 3: Auto-generate from code annotations

8. No Infrastructure as Code ⚠️ MEDIUM

  • Issue: Manual infrastructure provisioning
  • Impact: MEDIUM - Inconsistent environments, slow DR, config drift
  • Affected: All infrastructure
  • Mitigation: Documented manual steps
  • Resolution Plan:
    • Month 1: Terraform for new resources
    • Month 2-3: Import existing infrastructure to Terraform
    • Month 4: Full IaC with CI/CD integration

9. Service Fragmentation ⚠️ MEDIUM

  • Issue: 13 separate services, some redundant
  • Impact: MEDIUM - High operational overhead, cost, complexity
  • Affected: All services
  • Mitigation: Clear service boundaries, documentation
  • Resolution Plan:
    • Month 1: Consolidate ML services (3 β†’ 1)
    • Month 2: Merge API services (2 β†’ 1)
    • Month 3: Evaluate frontend consolidation
    • Target: 13 β†’ 5-6 services

10. Cross-Cloud Complexity ⚠️ MEDIUM

  • Issue: AWS + GCP deployment increases complexity and cost
  • Impact: MEDIUM - Higher costs, complex DR, dual expertise needed
  • Affected: Infrastructure
  • Mitigation: Clear cloud service boundaries
  • Resolution Plan:
    • Month 1-2: Analyze cloud usage patterns
    • Month 3-4: Develop single-cloud migration plan
    • Month 5-6: Execute migration (if beneficial)

Improvement Roadmap​

Q4 2025 (Months 1-3) - Foundation & Critical Issues:

Month 1:

  • Remove all hardcoded secrets β†’ Secrets Manager ⚠️ CRITICAL
  • Implement authentication for ML services ⚠️ HIGH
  • Fix insecure HTTP communication (21 occurrences) ⚠️ HIGH
  • Achieve 40% test coverage for critical paths ⚠️ CRITICAL
  • Implement APM and centralized logging ⚠️ HIGH
  • Generate OpenAPI specs for core APIs
  • Database read replicas for performance

Month 2:

  • Achieve 60% test coverage for business logic
  • Consolidate ML services (triangle + violin + ipu β†’ unified)
  • Implement Redis caching layer
  • Start Terraform implementation
  • Security audit and penetration testing
  • Implement circuit breakers for external APIs

Month 3:

  • Achieve 80% test coverage for core services
  • Merge API services (complyai-api + api-async β†’ unified)
  • Distributed tracing implementation
  • Complete IaC migration to Terraform
  • Implement database per service pattern
  • DR testing and runbook creation

Q1 2026 (Months 4-6) - Optimization & Scale:

Month 4:

  • Frontend consolidation evaluation (app + www)
  • Database sharding for scale
  • Implement GraphQL layer for frontend
  • Advanced caching strategies
  • Cost optimization (reserved instances, serverless)

Month 5:

  • Service mesh implementation (Istio, Linkerd)
  • Kubernetes migration planning
  • Comprehensive E2E test suite (Cypress)
  • ML model registry and versioning
  • Advanced monitoring and SLA tracking

Month 6:

  • Complete service consolidation (13 β†’ 5-6 services)
  • Multi-region deployment
  • Automated performance testing
  • SOC 2 Type 1 certification
  • Series A technical readiness complete

Q2 2026 (Months 7-9) - Advanced Features:

  • SOC 2 Type 2 audit preparation
  • Advanced ML features (model explainability, A/B testing)
  • International expansion support (i18n, data residency)
  • Advanced analytics and BI integration
  • Platform API for third-party integrations

Q3 2026 (Months 10-12) - Enterprise Ready:

  • SOC 2 Type 2 certification
  • Enterprise features (SSO, SAML, advanced RBAC)
  • White-label capabilities
  • Advanced compliance reporting
  • Platform maturity for Series B

Risk Assessment​

Architecture Risks​

RiskLikelihoodImpactMitigationPriority
Security breach from hardcoded secretsHighCriticalImmediate migration to Secrets Manager, auditπŸ”΄ P0
Production outage from untested code (8.5% coverage)HighCriticalAggressive test development, canary deploymentsπŸ”΄ P0
ML service abuse (no auth)MediumHighImplement auth + rate limiting immediatelyπŸ”΄ P0
Database failure (single PostgreSQL)MediumCriticalImplement read replicas, backup testing🟠 P1
Service fragmentation causing deployment failuresMediumHighService consolidation roadmap🟠 P1
Cross-cloud data transfer costs spikeMediumMediumMonitor costs, implement caching🟑 P2
Scaling bottleneck from shared databaseHighHighDatabase per service migration🟠 P1
Vendor lock-in with Auth0LowMediumDocument migration path, API abstraction🟒 P3
Compliance failure (GDPR, SOC 2)MediumHighSecurity audit, compliance roadmap🟠 P1
Team knowledge concentration (bus factor)MediumHighDocumentation, knowledge sharing, cross-training🟠 P1
Infrastructure drift (no IaC)MediumMediumTerraform implementation🟑 P2
API breaking changes affecting clientsLowMediumVersioning strategy, deprecation policy🟒 P3
ML model performance degradationMediumMediumModel monitoring, A/B testing, rollback capability🟑 P2
Cost overrun from inefficient architectureMediumMediumCost optimization roadmap, monitoring🟑 P2

Dependencies & Single Points of Failure​

Single Points of Failure (SPOFs):

  1. PostgreSQL Database ⚠️ CRITICAL

    • Impact: Complete platform outage
    • Affected: complyai-api, api-async, core, CMS (4 services)
    • Current Mitigation: Daily backups (assumed)
    • Required:
      • Immediate: Setup read replicas with auto-failover
      • Short-term: Database per service
      • Long-term: Distributed database (CockroachDB, PostgreSQL with Patroni)
  2. Auth0 Service ⚠️ HIGH

    • Impact: No user authentication, platform inaccessible
    • Affected: complyai-frontend, all authenticated endpoints
    • Current Mitigation: Auth0 SLA (99.99% uptime)
    • Required:
      • Session token caching for short-term access
      • Fallback authentication mechanism
      • Regular backup of user data
  3. AWS S3 (Model Storage) ⚠️ HIGH

    • Impact: ML inference failure, no model loading
    • Affected: triangle, violin, ipu (3 ML services)
    • Current Mitigation: S3 durability (99.999999999%)
    • Required:
      • Model caching on inference servers
      • Cross-region S3 replication
      • Fallback model versions
  4. Facebook Business API ⚠️ MEDIUM

    • Impact: No ad data sync, delayed compliance checks
    • Affected: api-async, adhoc
    • Current Mitigation: Retry logic, error handling
    • Required:
      • Circuit breaker pattern
      • Data caching for known accounts
      • Alternative data sources
  5. Stripe Payment API ⚠️ MEDIUM

    • Impact: Payment processing failure, subscription issues
    • Affected: complyai-api, core, frontend
    • Current Mitigation: Stripe SLA (99.99% uptime)
    • Required:
      • Webhook retry mechanism
      • Payment status reconciliation
      • Manual payment processing fallback

Critical External Dependencies:

DependencyService OwnerSLAFallback Strategy
Auth0Auth0 (Okta)99.99%Session caching, grace period
StripeStripe Inc.99.99%Webhook retries, manual processing
Facebook APIMetaBest effortCached data, scheduled retries
Google Vertex AIGCP99.95%Fallback to local models
BigQueryGCP99.99%Query caching, delayed analytics
AWS S3AWS99.99%Cross-region replication, cache
PostgreSQLSelf-hostedUnknown ⚠️Backups, failover (needed)

Mitigation Priorities:

  1. P0 (This Week): PostgreSQL read replicas and failover
  2. P1 (This Month): Model caching, circuit breakers for external APIs
  3. P2 (Next Quarter): Multi-region deployment, service redundancy

Evolution & Roadmap​

Current State Assessment (SWOT)​

Strengths:

  • βœ… Modern Tech Stack: React 18, TypeScript, Python 3, TensorFlow/PyTorch
  • βœ… Microservices Foundation: Independent scaling and deployment
  • βœ… AI/ML Capabilities: Multiple specialized models for compliance analysis
  • βœ… Security Awareness: Encryption patterns, audit logging, Auth0
  • βœ… Active Development: Recent commits, engaged team
  • βœ… Good Documentation Coverage: 60%+ in several services
  • βœ… Best-in-Class Integrations: Stripe, Auth0, Facebook, Google Cloud
  • βœ… Managed Services: Leveraging cloud-native (Vertex AI, BigQuery)

Weaknesses:

  • ⚠️ Critical Test Gap: 8.5% average coverage, 7 services at 0%
  • ⚠️ Hardcoded Secrets: 3 services with production credentials in code
  • ⚠️ Service Fragmentation: 13 repos creating operational burden
  • ⚠️ Missing Auth: ML services exposed without authentication
  • ⚠️ Shared Database: Tight coupling between core services
  • ⚠️ No Observability: Missing APM, tracing, centralized metrics
  • ⚠️ Insecure Communication: 21 HTTP occurrences
  • ⚠️ No IaC: Manual infrastructure, config drift risk
  • ⚠️ Cross-Cloud Complexity: AWS + GCP operational overhead
  • ⚠️ Missing API Docs: No OpenAPI/Swagger specifications

Opportunities:

  • πŸ’‘ Service Consolidation: 13 β†’ 5-6 services (30-40% cost savings)
  • πŸ’‘ ML Platform: Unified ML service with model registry
  • πŸ’‘ API Gateway: Single entry point, better security/observability
  • πŸ’‘ Serverless ML: Cost-effective inference for variable loads
  • πŸ’‘ GraphQL: Flexible frontend queries, reduced API calls
  • πŸ’‘ Platform API: Third-party integrations, ecosystem growth
  • πŸ’‘ International Expansion: Multi-language, data residency
  • πŸ’‘ Enterprise Features: SSO, SAML, advanced RBAC for upmarket
  • πŸ’‘ Compliance Certifications: SOC 2, ISO 27001 for enterprise sales
  • πŸ’‘ ML Marketplace: Pluggable compliance models for different industries

Threats:

  • 🚨 Security Breach: Hardcoded secrets, missing auth (investor concern)
  • 🚨 Production Outage: Low test coverage risk (customer churn)
  • 🚨 Scaling Failure: Shared database bottleneck (growth blocker)
  • 🚨 Technical Debt: Slowing development velocity (competitive risk)
  • 🚨 Talent Retention: Complexity burden (team burnout)
  • 🚨 Cost Overrun: Inefficient architecture (runway impact)
  • 🚨 Compliance Failure: Missing SOC 2 (enterprise sales blocker)
  • 🚨 Vendor Lock-in: Auth0, cross-cloud dependencies
  • 🚨 API Breaking Changes: No versioning strategy (customer impact)
  • 🚨 Data Loss: No DR testing (existential risk)

Future State Vision (12 months)​

Target Architecture (Q4 2026):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ UNIFIED WEB PLATFORM β”‚
β”‚ (React/Next.js - App + Marketing) β”‚
β”‚ Auth0 + Stripe + Analytics β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ API GATEWAY LAYER β”‚
β”‚ (GraphQL + REST with Rate Limiting) β”‚
β”‚ Authentication + Authorization + Logging β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MICROSERVICES LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Compliance β”‚ β”‚ ML Platformβ”‚ β”‚Orchestrationβ”‚ β”‚Content/Policyβ”‚ β”‚
β”‚ β”‚ Service β”‚ β”‚ (Unified) β”‚ β”‚ Service β”‚ β”‚ Service β”‚ β”‚
β”‚ β”‚ (Core) β”‚ β”‚ β”‚ β”‚ (Maestro) β”‚ β”‚ (Gong) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DATA LAYER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ PostgreSQL β”‚ β”‚ Redis β”‚ β”‚ Vector DB β”‚ β”‚ BigQuery β”‚ β”‚
β”‚ β”‚ (Per Svc) β”‚ β”‚ (Cache) β”‚ β”‚ (Embeddings)β”‚ β”‚(Analytics) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Characteristics:

  • 5 Core Services: Consolidated from 13
  • API Gateway: Kong, AWS API Gateway, or GCP Apigee
  • Unified ML Platform: Single service with model registry
  • Database Per Service: Proper microservices isolation
  • Comprehensive Observability: OpenTelemetry, Datadog/New Relic
  • Infrastructure as Code: 100% Terraform-managed
  • 80%+ Test Coverage: Across all services
  • SOC 2 Type 2 Certified: Enterprise-ready security
  • Multi-Region: AWS/GCP with active-active DR
  • Platform API: Third-party integrations enabled

Technology Evolution:

  • Frontend: Migrate to Next.js for SSR, better SEO, unified app
  • API: GraphQL gateway with REST fallback
  • ML: Serverless inference (SageMaker/Vertex AI) + GPU when needed
  • Cache: Redis Cluster with persistence
  • Monitoring: OpenTelemetry β†’ Datadog/New Relic + Grafana
  • IaC: Terraform with Atlantis for automation
  • CI/CD: GitHub Actions with advanced workflows, security scanning

Migration Path​

Phase 1: Foundation (Months 1-3) - "Security & Stability"

Goals: Fix critical security issues, establish observability, improve reliability

Deliverables:

  1. βœ… Remove all hardcoded secrets β†’ AWS Secrets Manager/GCP Secret Manager
  2. βœ… Implement ML service authentication (JWT + rate limiting)
  3. βœ… Fix all insecure HTTP communication (enforce HTTPS)
  4. βœ… Achieve 60% test coverage for critical paths
  5. βœ… Implement APM and distributed tracing (Datadog/OpenTelemetry)
  6. βœ… Centralized logging (ELK stack or Datadog)
  7. βœ… Database read replicas with auto-failover
  8. βœ… Redis caching layer for API and ML predictions
  9. βœ… Generate OpenAPI specs for all APIs
  10. βœ… DR testing and runbook creation

Success Metrics:

  • Zero hardcoded secrets detected by automated scanning
  • 100% HTTPS enforcement (0 HTTP violations)
  • 60%+ test coverage (up from 8.5%)
  • < 200ms P95 API response time with observability
  • 99.9% uptime with monitoring evidence

Phase 2: Consolidation (Months 4-6) - "Simplification & Scale"

Goals: Reduce service count, unify architecture, prepare for scale

Deliverables:

  1. βœ… Consolidate ML services: triangle + violin + ipu β†’ Unified ML Platform
    • Single codebase with model registry
    • Model versioning and A/B testing
    • Serverless inference option
  2. βœ… Merge API services: complyai-api + api-async β†’ Unified API Gateway
    • GraphQL + REST support
    • Built-in async capabilities (Celery)
    • Rate limiting and auth at gateway
  3. βœ… Migrate to database per service pattern
    • Separate databases for each service
    • API-based inter-service communication
    • Event-driven patterns for data sync
  4. βœ… Infrastructure as Code: 100% Terraform coverage
  5. βœ… Service mesh implementation (Istio/Linkerd)
  6. βœ… Achieve 80% test coverage across all services
  7. βœ… Frontend consolidation evaluation (app + www)
  8. βœ… Security audit and penetration testing
  9. βœ… SOC 2 Type 1 certification initiated
  10. βœ… Cost optimization: 30%+ reduction through consolidation

Success Metrics:

  • Service count: 13 β†’ 7 (50% reduction)
  • Infrastructure cost: 30%+ reduction
  • Deployment time: 50% faster
  • Test coverage: 80%+ (up from 60%)
  • SOC 2 Type 1 audit passed

Phase 3: Enterprise Ready (Months 7-9) - "Compliance & Features"

Goals: Enterprise features, compliance certifications, platform maturity

Deliverables:

  1. βœ… SOC 2 Type 2 certification
  2. βœ… Enterprise SSO (SAML, SCIM provisioning)
  3. βœ… Advanced RBAC and audit trails
  4. βœ… Multi-region deployment (active-active)
  5. βœ… Platform API for third-party integrations
  6. βœ… Comprehensive E2E test suite (Cypress)
  7. βœ… Advanced ML features:
    • Model explainability (SHAP, LIME)
    • Drift detection and retraining automation
    • Custom model upload for enterprise
  8. βœ… Frontend migration to Next.js (unified app + marketing)
  9. βœ… GraphQL API gateway fully operational
  10. βœ… Final consolidation: 7 β†’ 5 services

Success Metrics:

  • SOC 2 Type 2 certified
  • Enterprise features: SSO, SCIM, advanced RBAC
  • Multi-region: < 10ms cross-region latency
  • API ecosystem: 5+ third-party integrations
  • Service count: 5 core services (final target)

Phase 4: Platform & Scale (Months 10-12) - "Growth Enablement"

Goals: Platform scalability, international expansion, Series B readiness

Deliverables:

  1. βœ… International expansion:
    • Multi-language support (i18n)
    • Data residency compliance (EU, APAC)
    • Regional compliance models
  2. βœ… ML Marketplace:
    • Industry-specific compliance models
    • Partner model integrations
    • Custom model training service
  3. βœ… Advanced Analytics & BI:
    • Real-time dashboards
    • Predictive compliance insights
    • Customer success metrics
  4. βœ… White-label capabilities for enterprise
  5. βœ… Kubernetes migration (if not already done)
  6. βœ… Advanced auto-scaling and cost optimization
  7. βœ… Chaos engineering and resiliency testing
  8. βœ… Performance: 10K concurrent users, < 100ms P95
  9. βœ… Documentation: Comprehensive platform docs, API guides
  10. βœ… Series B technical due diligence package complete

Success Metrics:

  • 10K+ concurrent users supported
  • Multi-region: 3+ regions (US, EU, APAC)
  • Platform API: 20+ third-party integrations
  • Revenue: API/platform revenue stream
  • Series B ready: All technical gates passed

Migration Risks & Mitigation:

RiskImpactMitigation
Service consolidation causes downtimeHighBlue-green deployments, feature flags, gradual migration
Database migration data lossCriticalComprehensive backup, migration rehearsals, rollback plan
Breaking changes for clientsMediumAPI versioning, deprecation notices, backward compatibility
Team bandwidth for migrationHighHire contractors for feature development, dedicate team to migration
Cost spike during transitionMediumGradual migration, parallel run minimization, cost monitoring
Security regressionCriticalSecurity testing at each phase, penetration testing, audit gates

References​

Internal Documents:

API Documentation (To Be Created):

  • OpenAPI/Swagger specifications (NEEDED ⚠️)
  • GraphQL schema documentation (FUTURE)
  • Integration guides (NEEDED ⚠️)

Operational Runbooks (To Be Created):

  • Service deployment procedures (NEEDED ⚠️)
  • Disaster recovery playbooks (NEEDED ⚠️ CRITICAL)
  • Incident response guides (NEEDED ⚠️)
  • Security incident procedures (NEEDED ⚠️)

External Resources​

Technology Documentation:

Best Practices & Patterns:

Compliance Frameworks:

Architecture Decision Records:

Appendices​

Glossary​

Architecture Terms:

  • Microservices: Architectural style structuring application as collection of loosely coupled services
  • API Gateway: Single entry point for all client requests, routing to appropriate microservice
  • Circuit Breaker: Design pattern preventing cascading failures in distributed systems
  • CQRS: Command Query Responsibility Segregation - separating read/write operations
  • Event Sourcing: Storing state changes as sequence of events
  • Service Mesh: Infrastructure layer handling service-to-service communication
  • Blue-Green Deployment: Deployment strategy using two identical environments
  • Canary Deployment: Gradual rollout to subset of users before full deployment

ComplyAI-Specific Terms:

  • Musical Services: ComplyAI's pattern of naming services after musical terms (triangle, violin, maestro, gong, ipu)
  • Sync API: complyai-api - handles real-time synchronous requests
  • Async API: api-async - handles background/long-running asynchronous tasks
  • Core Engine: complyai-core - shared business logic and models
  • ML Triad: Triangle, violin, ipu - three specialized ML inference services
  • Orchestrator: complyai-maestro - coordinates multi-service workflows
  • Policy Engine: complyai-gong - AI-powered policy analysis using Vertex AI
  • Compliance Score: AI-generated score indicating ad compliance level
  • Ad Creative: Advertising content (images, video, text) submitted for compliance checking

Technology Terms:

  • TensorFlow/PyTorch: Deep learning frameworks for ML model training and inference
  • chromadb: Vector database for storing ML embeddings and similarity search
  • Celery: Distributed task queue for async processing
  • Gunicorn: Python WSGI HTTP server for running Flask applications
  • Wagtail: Django-based content management system
  • Auth0: Managed authentication and authorization platform
  • Vertex AI: Google Cloud's managed ML platform
  • BigQuery: Google Cloud's data warehouse for analytics

Compliance Terms:

  • GDPR: General Data Protection Regulation (EU data privacy)
  • SOC 2: Service Organization Control 2 (security audit framework)
  • PCI DSS: Payment Card Industry Data Security Standard
  • Audit Trail: Chronological record of system activities for compliance
  • Data Residency: Requirement for data to be stored in specific geographic location
  • Right to Erasure: GDPR right for users to have personal data deleted

Metrics Collection​

How Metrics Are Collected:

Application Metrics (To Be Implemented):

  • Method: Prometheus exporters or Datadog APM agents
  • Metrics: Request counts, latency histograms, error rates, throughput
  • Storage: Time-series database (Prometheus/Datadog)
  • Visualization: Grafana dashboards or Datadog UI
  • Location: To be centralized at /metrics endpoints

Infrastructure Metrics (Current):

  • AWS: CloudWatch metrics (EC2, RDS, S3, SQS)
  • GCP: Cloud Monitoring (GCE, BigQuery, Vertex AI)
  • Access: AWS Console, GCP Console, CLI tools
  • Dashboards: Cloud provider dashboards (limited)

Frontend Metrics (Current):

  • Tool: FullStory (user sessions), Web Vitals
  • Metrics: Page load time, FCP, LCP, CLS, FID, TTI
  • Access: FullStory dashboard, browser DevTools
  • Location: Client-side instrumentation

Business Metrics (To Be Implemented):

  • Source: Application events, database queries
  • Metrics: User signups, compliance checks performed, subscriptions, revenue
  • Dashboards: To be built in BI tool (Looker, Tableau, Metabase)

Where to Find Metrics:

  1. Cloud Dashboards: AWS CloudWatch, GCP Monitoring Console
  2. Application Logs: Service log files (needs centralization)
  3. FullStory: User behavior and frontend performance
  4. Database: PostgreSQL stats, slow query logs
  5. Future: Datadog/New Relic unified dashboard (recommended)

Contact Information​

RoleNameContactResponsibility
Solutions ArchitectSkaFld Studio Team-Architecture documentation, Series A readiness
Technical Lead[To Be Filled]-Architecture decisions, technical strategy
Engineering Manager[To Be Filled]-Team coordination, delivery
DevOps Lead[To Be Filled]-Infrastructure, CI/CD, deployments
Security Lead[To Be Filled]-Security architecture, compliance
ML Lead[To Be Filled]-ML models, inference optimization
Product Manager[To Be Filled]-Product roadmap, requirements

Document Metadata​

Review Schedule​

  • Next Review Date: 2025-11-02 (Monthly)
  • Review Frequency: Monthly during migration phases, then quarterly
  • Review Triggers:
    • Major architectural changes
    • Post-incident reviews
    • Pre-fundraising updates
    • Quarterly business reviews

Change Log​

VersionDateAuthorChanges
1.02025-10-02SkaFld Studio TeamInitial comprehensive current architecture documentation based on codebase analysis

Tags​

#architecture #complyai #current-state #microservices #python #ml-services #compliance #series-a-readiness #technical-analysis #flask #django #react #tensorflow #pytorch #aws #gcp #security-audit #technical-debt #consolidation-roadmap