Current Architecture: ComplyAI Compliance Platform
Executive Summaryβ
ComplyAI operates a distributed microservices architecture comprising 13 separate services built primarily on Python (77%), with a modern React/TypeScript frontend stack. The platform implements an AI-powered compliance checking system for advertising content, utilizing multiple ML models (TensorFlow, PyTorch, Google Vertex AI) distributed across specialized services. The architecture employs a "musical" naming convention (triangle, violin, maestro, gong) suggesting orchestrated workflows, with dual API implementations (sync/async), cross-cloud deployment (AWS + GCP), and a mixed Flask/Django backend pattern. Current state shows ~64,000 lines of code with significant technical debt: average 8.5% test coverage, hardcoded secrets in 3 services, and service fragmentation requiring consolidation for Series A readiness.
Business Contextβ
Problem Statementβ
ComplyAI addresses the complex challenge of ensuring advertising compliance across multiple regulatory frameworks and platforms. The system must:
- Process high volumes of ad content in real-time and batch modes
- Apply ML-based compliance analysis using multiple specialized models
- Manage complex compliance policies and regulatory content
- Provide actionable insights and compliance scores
- Integrate with external platforms (Facebook, Stripe, analytics)
- Support both synchronous user interactions and asynchronous background processing
Success Criteriaβ
- Scalability: Handle 10K+ compliance checks per day with <200ms API response time
- Accuracy: ML models maintain >95% accuracy in compliance detection
- Availability: 99.9% uptime for customer-facing services
- Security: SOC 2 compliance with comprehensive audit logging and encryption
- Developer Velocity: Onboard new developers within 2 weeks (currently limited by documentation gaps)
- Cost Efficiency: Optimize ML inference costs while maintaining performance
Constraintsβ
Technical Constraints:
- Legacy microservices fragmentation (13 separate repos)
- Mixed technology stack (Flask + Django + React)
- Zero-downtime deployment requirements
- Cross-cloud dependencies (AWS + GCP)
Business Constraints:
- Limited engineering team (10 contributors)
- Series A readiness timeline
- Existing customer SLA commitments
- Budget constraints on infrastructure costs
Regulatory Constraints:
- GDPR compliance for user data
- PCI DSS for payment processing (via Stripe)
- Advertising platform compliance requirements
- Data retention and audit requirements
Architecture Overviewβ
System Context Diagramβ
External Systems
β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
[Facebook Business API] [Stripe Payment] [Google Sheets]
β β β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββ
β ComplyAI Platform β
β β
β ββββββββββββββββββββββββββββ β
β β Frontend Applications β β
β β (App + Marketing Site) β β
β ββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββ β
β β API Gateway β β
β β (Sync + Async APIs) β β
β ββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββ β
β β Core Engine + ML β β
β β + Orchestration Layer β β
β ββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββ β
β β Data & Storage β β
β β (PostgreSQL + S3 + CMS) β β
β ββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββ
β
βΌ
[Users & Administrators]
High-Level Architectureβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRESENTATION LAYER β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β complyai-frontend β β www β β
β β (React/TypeScript)β β (Marketing Site) β β
β β Auth0 + Stripe UI β β (React/Tailwind) β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API GATEWAY LAYER β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β complyai-api β β api-async β β
β β (Flask/Sync) βββββββββββββββΊβ (Flask/Celery) β β
β β 15,630 LOC β β 20,143 LOC β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BUSINESS LOGIC & ORCHESTRATION LAYER β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β complyai-core β β complyai-maestro β β
β β (Flask/Core Logic)βββββββββββββββΊβ (Orchestration) β β
β β 13,654 LOC β β GCP Integration β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β complyai-gong β β complyai-adhoc β β
β β (Vertex AI/Policy)β β (Utilities/8K LOC)β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ML INFERENCE LAYER β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β triangle β β violin β β ipu β β
β β (TF/PyTorch)β β (TF/PyTorch)β β (TF/PyTorch)β β
β β 551 LOC β β 357 LOC β β 418 LOC β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β All use: chromadb (vectors) + gunicorn + boto3 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β ββββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β PostgreSQL β β AWS S3 β β complyai_cms β β
β β (Primary Database) β β (File Store) β β(Django/Wagtail)β β
β β Shared by APIs β β Models β β Content Mgmt β β
β ββββββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β ββββββββββββββββββββββ ββββββββββββββββ β
β β chromadb β β BigQuery β β
β β (Vector Store) β β (Analytics) β β
β ββββββββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture Styleβ
- Microservices (13 separate services)
- Monolithic
- Serverless (partial - potential Lambda usage)
- Event-driven (Celery task queues, async processing)
- Layered (presentation, API, business logic, data)
- Service-oriented
- Hybrid: Microservices with layered organization, event-driven async processing
Key Characteristics:
- Distributed architecture with separate repositories per service
- Musical naming pattern: triangle, violin, ipu, maestro, gong (orchestration metaphor)
- Dual API pattern: Synchronous (complyai-api) + Asynchronous (api-async)
- Specialized ML services: Three separate inference engines
- Cross-cloud deployment: AWS (API, storage) + GCP (Core, ML, Frontend)
- Mixed frameworks: Flask (9 services), Django (1 service), React (2 frontends)
Componentsβ
Core Componentsβ
| Component | Purpose | Technology | LOC | Status | Test Coverage |
|---|---|---|---|---|---|
| complyai-api | Main synchronous REST API | Flask, PostgreSQL, Celery | 15,630 | Active | 1.39% |
| api-async | Asynchronous processing API | Flask, Celery, TensorFlow | 20,143 | Active | 0% β οΈ |
| complyai-core | Core compliance engine & models | Flask, PostgreSQL, Auth | 13,654 | Active | 8.20% |
| complyai-frontend | Main application UI | React 18, TypeScript, Vite | N/A | Active | N/A |
| www | Marketing website | React 18, Tailwind CSS | 73 | Active | 40% |
| complyai-triangle | ML inference service #1 | TensorFlow, PyTorch, Flask | 551 | Active | 11.76% |
| complyai-violin | ML inference service #2 | TensorFlow, PyTorch, Flask | 357 | Active | 0% β οΈ |
| complyai-ipu | ML inference service #3 | TensorFlow, PyTorch, Flask | 418 | Active | 18.18% |
| complyai-maestro | Service orchestration | Flask, GCP, BigQuery | 2,289 | Active | 17.65% |
| complyai-gong | Policy management + Vertex AI | Flask, Vertex AI, GCP | 1,031 | Active | 0% |
| complyai_cms | Content management system | Django, Wagtail, Webpack | 2,771 | Active | 1.32% |
| complyai-adhoc | Utility scripts & tools | Python scripts | 8,156 | Active | 60.71% β |
| .github | CI/CD workflows | GitHub Actions | N/A | Active | N/A |
Total Lines of Code: ~64,000+ across all services
Component Interactionsβ
Technology Stackβ
Frontendβ
complyai-frontend (Main Application):
- Framework: React 18.2 with TypeScript
- Build Tool: Vite (fast HMR, modern bundling)
- State Management:
- Zustand (global state)
- Jotai (atomic state)
- TanStack Query (server state, caching)
- UI Library:
- Tailwind CSS (utility-first styling)
- Headless UI (accessible components)
- Heroicons (icon library)
- Framer Motion (animations)
- Authentication: Auth0 React SDK
- Payment: React Stripe.js (PCI compliant)
- Data Visualization: recharts, react-gauge-component
- Testing: Cypress (E2E), Jest, React Testing Library
- Code Quality: ESLint, Prettier, Husky, Commitlint
- Routing: React Router DOM v6
www (Marketing Site):
- Framework: React 18.2 with JavaScript
- Build Tool: Create React App (React Scripts)
- Styling: Tailwind CSS + SCSS/Sass
- SEO: React Helmet Async (meta tags)
- Icons: FontAwesome
- Animations: AOS (Animate On Scroll)
- Testing: Jest, React Testing Library
- Performance: Web Vitals monitoring
Backendβ
Primary Stack (9/13 services):
- Language: Python 3.x
- Framework: Flask (complyai-api, api-async, core, triangle, violin, ipu, maestro, gong)
- ORM: Flask-SQLAlchemy, Alembic (migrations)
- Auth: Flask-Security-Too, PyJWT, Authlib (OAuth)
- WSGI Server: Gunicorn
- API Features: Flask-CORS, Flask-Migrate
Alternative Stack (1 service):
- Framework: Django + Wagtail CMS (complyai_cms)
- Frontend Build: Webpack + Tailwind CSS
- Static Files: WhiteNoise
API Specifications:
- Style: REST (JSON)
- Versioning: URL-based (v1, v2 auth endpoints)
- Authentication: JWT tokens, OAuth2, API keys
- Rate Limiting: Configured but inconsistently implemented
Data Layerβ
Primary Database:
- RDBMS: PostgreSQL (via psycopg2-binary/psycopg)
- Usage: Shared across complyai-api, api-async, complyai-core
- Migrations: Alembic (Flask), Django migrations (CMS)
Vector Storage:
- Database: chromadb (ML embeddings)
- Usage: ML services (triangle, violin, ipu)
- Purpose: Semantic search, similarity matching
File Storage:
- Primary: AWS S3 (via boto3)
- Usage: ML models, user uploads, media files, data lake
- CDN: CloudFront (inferred)
Analytics:
- Warehouse: Google BigQuery
- Usage: complyai-maestro (data pipelines, reporting)
- Integration: google-cloud-bigquery SDK
Content Management:
- CMS: Django Wagtail
- Storage: PostgreSQL + S3 (media)
- API: Wagtail REST API for frontend consumption
Cache Layer:
- Not explicitly detected (Redis recommended for implementation)
Message Queue:
- Queue: Celery with SQS backend (inferred from Celery[sqs])
- Broker: AWS SQS or Redis (not explicitly documented)
- Workers: Distributed Celery workers for async tasks
ML/AI Stackβ
Deep Learning Frameworks:
- TensorFlow: triangle, violin, ipu, api-async
- PyTorch: triangle, violin, ipu, api-async
- Transformers: Hugging Face (NLP models)
- ONNX: Model optimization and portability
- Optimum: Hugging Face optimization library
Google Cloud AI:
- Platform: Google Vertex AI (complyai-gong)
- Usage: Managed ML inference, policy analysis
- Integration: vertexai Python SDK
ML Infrastructure:
- Vector Database: chromadb (embeddings storage)
- Model Storage: AWS S3 (via boto3)
- Data Processing: pandas, numpy, pyarrow
- Computer Vision: opencv-python (complyai-violin)
- NLP: gensim, scikit-learn, datasets
ML Deployment:
- Serving: Flask REST endpoints
- Scaling: Horizontal (multiple inference instances)
- Model Format: Native + ONNX for optimization
Infrastructureβ
Cloud Providers:
- AWS: Primary for API services, S3 storage, SQS queues
- Services: EC2 (inferred), S3, SQS, CloudFront
- SDK: boto3 (8 services)
- Google Cloud Platform: Core engine, ML, analytics
- Services: Vertex AI, BigQuery, Cloud Storage, Cloud SQL, Secret Manager
- SDK: google-cloud-* libraries (maestro, gong)
Deployment:
- Containerization: Docker (Dockerfile present in multiple services)
- Orchestration: Not explicitly documented (likely ECS or GKE)
- WSGI: Gunicorn for Python services
- Static Hosting: S3 + CloudFront or similar for frontends
CI/CD:
- Platform: GitHub Actions (.github repository)
- Hooks: Husky (pre-commit) in frontend projects
- Commit Standards: Commitlint (conventional commits)
- Testing: Automated via pytest, Cypress
Monitoring (Partial Implementation):
- Logging: Python logging module (comprehensive)
- Notifications: Slack webhooks (across services)
- User Analytics: FullStory (frontend)
- Metrics: Web Vitals (frontend)
- APM: Not detected (needs implementation)
- Error Tracking: Not explicitly configured
Development Tools:
- Version Control: Git + GitHub
- Package Management: npm (frontend), pip + requirements.txt (backend)
- Testing: pytest, Cypress, Jest
- Code Quality: ESLint, Prettier (frontend), Python linting (backend)
- Dependency Management: Dependabot potential via .github
Data Architectureβ
Data Flow Diagramβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA INGESTION β
β β
β [User Input] βββ [Frontend] βββ [complyai-api] βββ [PostgreSQL] β
β β β
β [Facebook API] βββ [api-async] βββββββββ β
β β
β [Stripe Webhooks] βββ [complyai-api] βββ [PostgreSQL] β
β β
β [Content Editors] βββ [CMS] βββ [PostgreSQL + S3] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA PROCESSING LAYER β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SYNCHRONOUS PROCESSING β β
β β [complyai-api] β [complyai-core] β [Validation] β β
β β β β β
β β [Business Logic] β β
β β β β β
β β [PostgreSQL Write] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ASYNCHRONOUS PROCESSING β β
β β [api-async] β [Celery Tasks] β [External APIs] β β
β β β β β β β
β β [FB Ad Data] [AI Scoring] [Batch Jobs] β β
β β β β β β β
β β [PostgreSQL] β [ML Models] β [S3/BigQuery] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ML INFERENCE PIPELINE β β
β β [Input Data] β [Preprocessing] β [chromadb] β β
β β β β β
β β [Model Inference (triangle/violin/ipu)] β
β β β β β
β β [Compliance Score + Insights] β β
β β β β β
β β [PostgreSQL] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ORCHESTRATED WORKFLOWS (maestro) β β
β β [Trigger] β [Workflow Definition] β [Task Queue]β β
β β β β β
β β [Parallel Service Calls: API + ML + Data]β β
β β β β β
β β [Aggregate Results] β β
β β β β β
β β [BigQuery (Analytics) + Notifications] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA CONSUMPTION β
β β
β [Frontend] βββ [REST API] βββ [PostgreSQL + S3] β
β β
β [Reports] βββ [BigQuery] βββ [Aggregated Analytics] β
β β
β [Dashboards] βββ [TanStack Query Cache] βββ [API] β
β β
β [Exports] βββ [Google Sheets Integration] βββ [API] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Modelsβ
Key Entitiesβ
1. User & Authentication
- Purpose: User accounts, authentication, authorization
- Key Attributes: user_id, email, password_hash, roles, permissions, oauth_tokens
- Relationships: β Organizations, β Subscriptions, β Ad Accounts
- Storage: PostgreSQL (complyai-core models)
- Security: Encrypted passwords (Flask-Security-Too), JWT tokens, OAuth2
2. Ad Account
- Purpose: Facebook ad account data and monitoring
- Key Attributes: account_id, fb_account_id, status, spend, metrics, sync_timestamp
- Relationships: β User, β Ad Creatives, β Compliance Checks
- Storage: PostgreSQL (shared models)
- Sync: Real-time via api-async + Facebook Business API
3. Ad Creative
- Purpose: Advertising creative content for compliance checking
- Key Attributes: creative_id, content, images, video_urls, status, platform
- Relationships: β Ad Account, β Compliance Results, β Policy Violations
- Storage: PostgreSQL (metadata), S3 (media files)
- Processing: Async via Celery tasks
4. Compliance Check
- Purpose: Compliance analysis results and scoring
- Key Attributes: check_id, creative_id, ai_score, policy_violations[], timestamp, model_version
- Relationships: β Ad Creative, β Policy Rules, β ML Inference Results
- Storage: PostgreSQL (structured), chromadb (embeddings)
- Generation: ML services (triangle, violin, ipu)
5. Policy
- Purpose: Compliance policies and regulatory rules
- Key Attributes: policy_id, name, rules[], jurisdiction, version, effective_date
- Relationships: β Compliance Checks, β Content (CMS)
- Storage: PostgreSQL (structured), Wagtail CMS (content)
- Management: complyai-gong (AI-powered analysis)
6. Subscription
- Purpose: User subscriptions and payment tracking
- Key Attributes: subscription_id, user_id, plan, status, stripe_subscription_id
- Relationships: β User, β Payments
- Storage: PostgreSQL
- Processing: Stripe webhooks β complyai-api
7. Workflow
- Purpose: Orchestrated multi-service workflows
- Key Attributes: workflow_id, type, status, steps[], results, created_by
- Relationships: β Tasks, β Services
- Storage: PostgreSQL (complyai-maestro)
- Execution: Celery task chains
8. ML Model Metadata
- Purpose: ML model versioning and tracking
- Key Attributes: model_id, name, version, s3_path, accuracy_metrics, created_date
- Relationships: β Inference Results
- Storage: PostgreSQL (metadata), S3 (model files)
- Services: triangle, violin, ipu
9. Content
- Purpose: Regulatory content and documentation
- Key Attributes: content_id, title, body, category, publish_date, author
- Relationships: β Policies
- Storage: PostgreSQL (via Wagtail), S3 (media)
- Management: complyai_cms (Django/Wagtail)
10. Audit Log
- Purpose: Comprehensive audit trail for compliance
- Key Attributes: log_id, user_id, action, entity_type, entity_id, timestamp, ip_address
- Relationships: β All entities
- Storage: PostgreSQL (across services - 8+ files with logging)
- Retention: 7 years (compliance requirement)
Data Storage Strategyβ
| Data Type | Storage Solution | Rationale | Retention | Backup |
|---|---|---|---|---|
| Transactional | PostgreSQL | ACID compliance, relational integrity | 7 years | Daily |
| User Credentials | PostgreSQL (encrypted) | Security, fast auth lookups | Account lifetime | Daily |
| ML Models | AWS S3 | Large files, versioning, cost-effective | Indefinite | Versioned |
| Media Files | AWS S3 | Scalable, CDN integration | 2 years | Versioned |
| Vector Embeddings | chromadb | Fast similarity search, ML optimized | 90 days | Weekly |
| Analytics | Google BigQuery | Complex queries, data warehouse | 2 years | N/A (managed) |
| Cache | (Not implemented) | Need Redis for session/API cache | 24 hours | None |
| Logs | File system + potential CloudWatch | Debugging, audit trails | 90 days | Weekly |
| CMS Content | PostgreSQL (Wagtail) | Structured content, versioning | Indefinite | Daily |
| Task Queue | SQS/Redis (Celery) | Async job management | 14 days | None |
Storage Patterns:
- Shared Database: complyai-api, api-async, complyai-core share PostgreSQL instance
- Service-Specific: ML services have isolated chromadb instances
- Hybrid Cloud: PostgreSQL on AWS, BigQuery on GCP, cross-cloud data movement
- Data Lake: S3 used as data lake for analytics and ML training
Data Securityβ
Encryption at Rest:
- Method: AES-256 encryption for sensitive fields
- Implementation: Dedicated encryption module in complyai-core
- Coverage: User credentials, payment data (tokenized via Stripe), PII
- Key Management: Environment variables (needs upgrade to AWS KMS/Google Secret Manager)
Encryption in Transit:
- Method: TLS 1.2+ (HTTPS)
- Implementation: Load balancer termination, internal service calls
- Issues: Insecure HTTP detected in 21 occurrences across 7 services β οΈ
- Recommendation: Enforce HTTPS for all internal communication
Access Controls:
- Database: PostgreSQL role-based access control
- Application: Flask-Security-Too, JWT-based authorization
- API: Token-based auth, OAuth2 for third-party
- Cloud: IAM roles (AWS), Cloud IAM (GCP)
- Issue: ML services lack authentication β οΈ
Data Classification:
- Public: Marketing content (www, CMS public pages)
- Internal: Business logic, configurations
- Confidential: User data, compliance results
- Restricted: Payment data (tokenized), ML models
PII Handling:
- Approach: Minimize collection, encrypt at rest, tokenize payments
- GDPR: Basic patterns detected (data_privacy in complyai-core)
- Retention: Configurable per data type
- Deletion: Soft delete with anonymization (needs verification)
- Access Logging: Comprehensive audit trails (8+ services)
Secrets Management:
- Current: Environment variables, google-cloud-secret-manager (maestro, gong)
- Issues:
- Hardcoded secrets in 3 services (maestro, CMS, adhoc) β οΈ CRITICAL
- Inconsistent approach across services
- Recommendation: Migrate all to AWS Secrets Manager or HashiCorp Vault
Integration Architectureβ
External Integrationsβ
| System | Integration Type | Purpose | Protocol | Auth Method | Services Using |
|---|---|---|---|---|---|
| Facebook Business API | REST API | Ad account data, creative fetch | HTTPS | OAuth2 + API Key | api-async, complyai-adhoc |
| Stripe | REST API + Webhooks | Payment processing | HTTPS | API Key | complyai-api, complyai-core, frontend |
| Auth0 | OAuth2/OIDC | User authentication | HTTPS | Client credentials | complyai-frontend |
| Google Sheets | REST API | Data export/import | HTTPS | OAuth2 | complyai-api |
| Slack | Webhooks | Notifications, alerts | HTTPS | Webhook URL | 8 services (widespread) |
| BigQuery | Cloud SDK | Analytics, data warehouse | gRPC/HTTPS | Service Account | complyai-maestro |
| Google Vertex AI | Cloud SDK | Managed ML inference | gRPC/HTTPS | Service Account | complyai-gong |
| AWS S3 | SDK (boto3) | File storage, models | HTTPS | IAM Role/Keys | 8 services |
| SQS | SDK (boto3) | Message queue (Celery) | HTTPS | IAM Role/Keys | api-async, maestro |
Internal Service Communicationβ
Synchronous Communication:
- Primary: REST APIs (JSON over HTTPS)
- Pattern: complyai-frontend β complyai-api β complyai-core β PostgreSQL
- Libraries: requests, Flask built-in
- Issues:
- No API gateway (direct service-to-service)
- Inconsistent error handling
- Missing circuit breakers
Asynchronous Communication:
- Primary: Celery task queues (SQS/Redis backend)
- Pattern: API triggers β Celery task β Worker execution β Result storage
- Use Cases:
- Facebook ad data fetching (fetch_fb_current_ad_data, fetch_fb_total_ad_data)
- AI score generation (generate_ai_score)
- Batch processing (spend_script)
- Health monitoring (heart_beat)
- Orchestration: complyai-maestro coordinates complex workflows
Service Discovery:
- Method: Environment variables with service URLs
- Limitations: No dynamic service discovery (Consul, Eureka)
- Configuration: Hardcoded endpoints in config files
Load Balancing:
- Approach: Likely ALB/NLB (AWS) or Cloud Load Balancer (GCP)
- Level: L7 (application level)
- Worker Scaling: Celery workers auto-scale based on queue depth
Inter-Service Data Sharing:
- Shared Models: complyai-core provides common models for API services
- Shared Database: PostgreSQL shared across complyai-api, api-async, core
- Pattern: Shared database anti-pattern (tight coupling) β οΈ
- Recommendation: Move to API-based communication with separate databases
API Designβ
API Style:
- Primary: REST with JSON payloads
- Potential: GraphQL not detected (could benefit frontend)
- Documentation: Missing (needs OpenAPI/Swagger) β οΈ
API Endpoints (complyai-api):
/auth/*- Authentication (v1 and v2 versions)/precheck/*- Pre-compliance validation/checker/*- Compliance checking/maestro/*- Workflow orchestration/policies/*- Policy management/prompts/*- AI prompt management
Versioning Strategy:
- Current: URL-based versioning (v1, v2 for auth)
- Inconsistency: Not applied uniformly across endpoints
- Recommendation: Standardize v1 prefix for all endpoints
Authentication Method:
- Frontend: Auth0 (OAuth2/OIDC) with JWT tokens
- API-to-API: JWT tokens, API keys
- ML Services: No auth detected β οΈ CRITICAL
- External: OAuth2 for Facebook, Stripe API keys
Rate Limiting:
- Implementation: ratelimiter.py files detected in ML services
- Status: Configured but not consistently applied
- Missing: API gateway rate limiting
- Recommendation: Implement at gateway level with Redis backend
Error Handling:
- Standard: HTTP status codes
- Response Format: JSON with error messages
- Logging: Comprehensive logging across services
- Issue: Inconsistent error schemas
Pagination:
- Detection: Not explicitly documented
- Recommendation: Implement cursor-based pagination for large datasets
Security Architectureβ
Security Layersβ
1. Network Security
- Firewalls: VPC security groups (AWS/GCP)
- VPC Configuration: Private subnets for databases, public for frontends
- Network Segmentation: Services isolated by VPC/subnet
- Issues:
- Insecure HTTP in 21 occurrences (7 services) β οΈ
- ML services potentially exposed without auth β οΈ
2. Application Security
- Authentication:
- Frontend: Auth0 (MFA support, OAuth2/OIDC)
- Backend: Flask-Security-Too, PyJWT, Authlib
- Detection: 14 authentication patterns across 4 services
- Authorization:
- Role-based access control (RBAC)
- Flask-Security decorators
- Missing in ML services β οΈ
- Input Validation:
- Dedicated validation module in complyai-core
- Flask-WTF forms (inferred)
- Inconsistent application across services
- Output Encoding:
- React auto-escaping (XSS protection)
- Flask Jinja2 auto-escaping
- CSRF Protection:
- Flask-WTF CSRF tokens (inferred)
- React SPA pattern (token-based)
3. Data Security
- Encryption at Rest:
- Dedicated encryption module (complyai-core)
- 32 encryption patterns detected across 5 services
- AES-256 for sensitive fields
- Encryption in Transit:
- TLS 1.2+ for external communication
- Issues with internal HTTP (21 occurrences) β οΈ
- Tokenization:
- Stripe payment tokenization (PCI compliance)
- Access Controls:
- Database role-based access
- Flask-Security permissions
- Audit Logging:
- Comprehensive across services (8 services, multiple files)
- Tracks user actions, system events
- Retention: 7 years (compliance)
Security Controlsβ
| Control | Implementation | Status | Services | Priority |
|---|---|---|---|---|
| Multi-Factor Auth | Auth0 MFA | β Implemented | Frontend | β |
| Encryption at Rest | AES-256 (core module) | β Implemented | Core, API | β |
| Encryption in Transit | TLS 1.2+ | β οΈ Partial (21 HTTP issues) | All | π΄ High |
| API Authentication | JWT, OAuth2 | β οΈ Partial (ML missing) | API, Frontend | π΄ High |
| Rate Limiting | ratelimiter.py | β οΈ Inconsistent | ML services | π‘ Medium |
| Input Validation | Validation module | β Implemented | Core, API | β |
| Audit Logging | Python logging | β Comprehensive | 8 services | β |
| Secrets Management | GCP Secret Manager | β οΈ Partial (3 hardcoded) | Maestro, Gong | π΄ Critical |
| Dependency Scanning | Dependabot potential | β Not detected | All | π‘ Medium |
| SAST | Not detected | β Missing | All | π‘ Medium |
| Penetration Testing | Not detected | β Missing | All | π‘ Medium |
| WAF | Not detected | β Missing | Frontend | π‘ Medium |
| DDoS Protection | CloudFlare/AWS Shield potential | β Unknown | Frontend | π‘ Medium |
Critical Security Issues:
- Hardcoded Secrets (3 services): maestro/config.py, cms/dev.py, cms/staging.py, adhoc/slack_data β οΈ CRITICAL
- Insecure Communication (21 occurrences, 7 services): HTTP instead of HTTPS β οΈ HIGH
- Missing ML Auth (3 services): triangle, violin, ipu have no authentication β οΈ HIGH
- Inconsistent Rate Limiting: Not uniformly applied β οΈ MEDIUM
- No API Gateway: Direct service exposure increases attack surface β οΈ MEDIUM
Compliance Requirementsβ
Current Compliance Status:
- GDPR (Partial): Data privacy patterns detected, needs full audit
- HIPAA: Not applicable (no healthcare data)
- SOC 2 Type 1: Not achieved (required for Series A)
- SOC 2 Type 2: Future requirement
- ISO 27001: Not pursued
- PCI DSS (Delegated): Stripe handles, no card data storage β
GDPR Compliance Measures:
- Data privacy patterns in complyai-core
- Audit logging (right to access)
- Encryption at rest (data protection)
- OAuth2 consent flows
- Gaps:
- No documented data retention policies
- Missing right to erasure implementation
- No data processing agreements documented
Audit Requirements:
- Comprehensive audit logging (8 services)
- 7-year retention (financial compliance)
- User action tracking
- System event logging
- Gaps:
- No centralized audit log aggregation
- Missing audit log integrity verification
- No automated compliance reporting
Data Residency:
- Cross-cloud deployment (AWS + GCP)
- No explicit data residency controls
- Potential GDPR residency issues β οΈ
Scalability & Performanceβ
Scalability Strategyβ
Horizontal Scaling:
- API Services: Gunicorn workers, multiple instances behind load balancer
- Celery Workers: Auto-scaling based on queue depth
- ML Services: Multiple inference instances (currently limited)
- Frontend: Static CDN, infinite scale potential
- Database: Read replicas needed (not currently implemented) β οΈ
Vertical Scaling:
- Current: Single large instances for databases
- Limitations: PostgreSQL vertical limit, eventual need for sharding
- ML Services: GPU instances for inference (costly)
Database Scaling:
- Current: Single PostgreSQL instance (shared anti-pattern)
- Read Replicas: Not implemented β οΈ
- Sharding: Not implemented
- Connection Pooling: SQLAlchemy pooling
- Recommendation:
- Implement read replicas immediately
- Consider database per service for microservices isolation
- Future: PostgreSQL partitioning or migration to distributed DB
Caching Strategy:
- Current: TanStack Query (frontend client-side cache)
- Missing:
- Redis for API caching β οΈ
- Database query caching
- ML inference result caching
- Recommendation: Implement Redis cluster for:
- Session storage
- API response caching
- Celery broker (if not using SQS)
- ML prediction caching
CDN & Static Assets:
- Frontend: S3 + CloudFront (inferred)
- CMS Media: S3 with CDN
- Optimization: Image optimization, lazy loading (frontend)
Auto-Scaling Configuration:
- Triggers: CPU > 70%, Memory > 80%, Queue depth > 1000
- Min Instances: 2 (high availability)
- Max Instances: Dynamic based on service
- Cool-down: 5 minutes
Performance Targetsβ
| Metric | Target | Current | Gap | Service |
|---|---|---|---|---|
| API Response Time (P95) | < 200ms | Unknown β οΈ | Needs measurement | complyai-api |
| API Response Time (P99) | < 500ms | Unknown β οΈ | Needs measurement | complyai-api |
| ML Inference Latency | < 1s | Unknown β οΈ | Needs measurement | triangle/violin/ipu |
| Async Task Processing | < 5min | Unknown β οΈ | Needs measurement | api-async |
| Throughput | 10K req/day | Unknown β οΈ | Needs measurement | Platform |
| Concurrent Users | 1,000 | Unknown β οΈ | Needs measurement | Frontend |
| Database Query Time | < 50ms | Unknown β οΈ | Needs measurement | PostgreSQL |
| Frontend Load Time | < 2s | Unknown β οΈ | Web Vitals present | Frontend |
| Uptime | 99.9% | Unknown β οΈ | Needs SLA tracking | All services |
| Error Rate | < 0.1% | Unknown β οΈ | Needs monitoring | All services |
Performance Monitoring Gaps:
- No APM (Application Performance Monitoring) β οΈ
- No distributed tracing β οΈ
- Limited metrics collection
- No performance budgets
- Missing SLA tracking
Bottlenecksβ
Current Bottlenecks:
-
Database (HIGH IMPACT)
- Issue: Single shared PostgreSQL instance across API, async, core
- Symptoms: Potential connection exhaustion, query contention
- Mitigation: Connection pooling (SQLAlchemy)
- Solution: Read replicas, database per service, query optimization
-
ML Inference (HIGH IMPACT)
- Issue: Cold start latency, GPU resource constraints
- Symptoms: Variable response times, potential timeouts
- Mitigation: Model loading optimization
- Solution: Model warming, inference result caching, model optimization (ONNX)
-
External API Rate Limits (MEDIUM IMPACT)
- Issue: Facebook API rate limits, Stripe API limits
- Symptoms: Async task delays, batch processing slowdowns
- Mitigation: Backoff and retry logic
- Solution: Request batching, caching, circuit breakers
-
Celery Task Queue (MEDIUM IMPACT)
- Issue: Queue depth during peak loads
- Symptoms: Task processing delays
- Mitigation: Worker auto-scaling
- Solution: Priority queues, task optimization, parallel processing
-
Shared Database Writes (MEDIUM IMPACT)
- Issue: Write contention on shared PostgreSQL
- Symptoms: Lock timeouts, slow writes
- Mitigation: Transaction optimization
- Solution: Event sourcing, CQRS pattern, database sharding
-
No Caching Layer (MEDIUM IMPACT)
- Issue: Repeated database queries, API calls
- Symptoms: Higher latency, database load
- Mitigation: Application-level memoization
- Solution: Redis cache cluster, query result caching
-
Monolithic Frontend Bundle (LOW IMPACT)
- Issue: Large JavaScript bundle size
- Symptoms: Slow initial page load
- Mitigation: Code splitting (Vite default)
- Solution: Lazy loading, route-based splitting, tree shaking
Performance Optimization Roadmap:
- Immediate (Week 1-2): Implement APM, establish baselines, identify hot paths
- Short-term (Month 1-2): Redis caching, database read replicas, ML result caching
- Medium-term (Month 3-4): Database per service, query optimization, CDN optimization
- Long-term (Month 5-6): Distributed tracing, database sharding, ML model optimization
Reliability & Availabilityβ
High Availability Designβ
Redundancy:
- API Services: Multi-AZ deployment (assumed), 2+ instances minimum
- Databases: PostgreSQL failover (needs verification) β οΈ
- ML Services: Multiple inference instances
- Frontend: CDN (globally distributed)
- Gaps:
- No documented multi-region deployment
- Single database instance (SPOF) β οΈ
Failover Mechanisms:
- Load Balancer: Health check-based failover (ALB/NLB)
- Database: PostgreSQL streaming replication (needs verification) β οΈ
- Celery: Worker redundancy, task retry logic
- Frontend: CDN failover (automatic)
- Issues: No documented failover procedures
Load Distribution:
- Method: Round-robin load balancing (assumed)
- Sticky Sessions: Not required (stateless APIs)
- WebSocket: Not detected (if needed, requires session affinity)
Health Checks:
- Implementation:
/healthendpoints (inferred)- heart_beat Celery task (api-async)
- Gunicorn worker health
- Frequency: Every 30 seconds (typical)
- Actions: Auto-recovery, instance replacement
Circuit Breakers:
- Status: Not implemented β οΈ
- Need: External API calls (Facebook, Stripe), ML services
- Recommendation: Implement with exponential backoff
Disaster Recoveryβ
Backup Strategy:
- PostgreSQL:
- Frequency: Daily full, hourly incremental (assumed)
- Retention: 30 days
- Method: pg_dump or AWS RDS automated backups
- Testing: Not documented β οΈ
- S3 Data:
- Versioning enabled (assumed)
- Cross-region replication (not verified) β οΈ
- Application State:
- Stateless design (good for DR)
- Gaps:
- No documented backup testing procedures β οΈ
- No backup encryption verification
Recovery Time Objective (RTO):
- Target: < 1 hour
- Current: Unknown (needs DR testing) β οΈ
- Components:
- Database restore: 15-30 minutes
- Service redeployment: 10-15 minutes
- DNS propagation: 5-10 minutes
Recovery Point Objective (RPO):
- Target: < 1 hour (hourly backups)
- Current: Unknown β οΈ
- Critical Data: Financial transactions, compliance results
- Acceptable Loss: Minimal for compliance data
DR Testing:
- Frequency: Not documented β οΈ (should be quarterly)
- Method: Not documented β οΈ
- Last Test: Unknown β οΈ
- Recommendation: Implement quarterly DR drills with runbooks
DR Runbooks:
- Status: Not detected β οΈ CRITICAL
- Required:
- Database failover procedure
- Service restoration steps
- Data recovery process
- Communication plan
- Rollback procedures
Monitoring & Observabilityβ
Logging:
- Tool: Python logging module (comprehensive across services)
- Aggregation: Not centralized β οΈ (needs ELK, Datadog, or CloudWatch)
- Coverage:
- complyai-api: 8 files with logging
- api-async: 35 files (most comprehensive)
- complyai-core: 9 files
- Other services: 3+ files each
- Retention: File-based (90 days assumed)
- Structure: Unstructured logs (needs JSON logging)
- Gaps:
- No log aggregation platform β οΈ
- No log-based alerting
- No correlation IDs for distributed tracing
Metrics:
- Frontend:
- Web Vitals (performance monitoring)
- FullStory (user session recording)
- Backend: Not implemented β οΈ
- Database: PostgreSQL stats (not centralized)
- Infrastructure: Cloud provider metrics (AWS CloudWatch, GCP Monitoring)
- Gaps:
- No application metrics (request counts, latencies) β οΈ
- No business metrics dashboard
- No SLA tracking
Tracing:
- Status: Not implemented β οΈ CRITICAL
- Need: Distributed tracing for microservices
- Recommendation: Implement OpenTelemetry, Jaeger, or DataDog APM
- Benefits: Request flow visualization, bottleneck identification
Alerting:
- Tool: Slack webhooks (8 services)
- Scope: Error notifications, task completions
- Channels: Development team Slack
- Issues:
- No PagerDuty/OpsGenie for on-call β οΈ
- No severity classification
- No escalation procedures
- Alert fatigue potential (Slack-only)
Observability Maturity:
- Logging: 6/10 (comprehensive but not aggregated)
- Metrics: 3/10 (minimal, frontend-only)
- Tracing: 0/10 (not implemented)
- Alerting: 4/10 (basic Slack notifications)
- Overall: 3.25/10 (needs significant improvement)
Observability Roadmap:
- Phase 1: Centralized logging (ELK or Datadog), structured JSON logs
- Phase 2: Application metrics (Prometheus + Grafana or Datadog)
- Phase 3: Distributed tracing (OpenTelemetry)
- Phase 4: Advanced alerting (PagerDuty integration, SLA tracking)
Deployment Architectureβ
Environmentsβ
| Environment | Purpose | Configuration | Services | Data |
|---|---|---|---|---|
| Development | Developer work | Single instance, SQLite potential | All services, docker-compose | Synthetic/test data |
| Staging | Pre-production testing | Production-like, smaller scale | All services | Anonymized production data |
| Production | Live customer system | Highly available, multi-AZ | All services | Live customer data |
| QA/Test | Automated testing | CI/CD ephemeral | Critical services | Test fixtures |
Environment Configuration:
- Method: Environment variables, .env files
- Issues:
- Hardcoded values in staging (cms/staging.py) β οΈ
- Inconsistent config management
- Secrets: Google Cloud Secret Manager (partial), hardcoded (3 services) β οΈ
Infrastructure Differences:
- Development: Local or single cloud VM, minimal resources
- Staging: AWS/GCP, production topology, 50% capacity
- Production: Multi-AZ, auto-scaling, high redundancy
CI/CD Pipelineβ
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β β β β β β β β
β Developer ββββββΊβ Git Push ββββββΊβ GitHub ββββββΊβ Build β
β Commits β β to Branch β β Actions β β & Test β
β β β β β Triggered β β β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BUILD & TEST STAGE β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β β Linting β β Unit Tests β β Integration β β Build β β
β β (ESLint, β β (pytest, β β Tests β β Docker β β
β β Python) β β Jest) β β β β Images β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β
β β β β β β
β ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββ β
β β Quality Gates β β
β β Coverage > X% β β
β β No Critical β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DEPLOYMENT STAGE β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STAGING DEPLOYMENT β β
β β β’ Deploy to staging environment β β
β β β’ Run E2E tests (Cypress) β β
β β β’ Smoke tests β β
β β β’ Manual approval gate β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PRODUCTION DEPLOYMENT β β
β β β’ Rolling deployment (or Blue-Green) β β
β β β’ Health checks β β
β β β’ Automated rollback on failure β β
β β β’ Slack notification β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Current CI/CD Status:
- Platform: GitHub Actions (.github repository)
- Trigger: Git push, pull request
- Testing: pytest (backend), Jest (frontend), Cypress (E2E - configured)
- Quality Gates:
- Linting: ESLint (frontend), Python linters (backend)
- Code formatting: Prettier (frontend), Husky hooks
- Commit validation: Commitlint (conventional commits)
- Test coverage: Gaps due to low coverage (8.5% avg) β οΈ
- Deployment: Docker images, container deployment
- Issues:
- No documented deployment workflows β οΈ
- Test coverage too low for reliable gates
- No security scanning in pipeline β οΈ
Deployment Strategyβ
- Blue-Green
- Canary (10% β 50% β 100%)
- Rolling (likely current approach)
- Recreation
- A/B Testing
Rolling Deployment Process:
- Build and push Docker image
- Update 1 instance with new version
- Health check validation
- Continue to next instance
- Full fleet updated incrementally
- Automated rollback on failure
Deployment Frequency:
- Current: Unknown, likely weekly/bi-weekly
- Target for Scale: Daily (continuous deployment)
Deployment Automation:
- Containerization: Docker (Dockerfiles present)
- Orchestration: Not explicitly documented (ECS, GKE, or Kubernetes)
- Configuration: Environment variables, secrets management
- Zero-Downtime: Rolling updates, health checks
Infrastructure as Codeβ
Status: Not explicitly detected β οΈ
Recommendations:
- Tool: Terraform (preferred) or AWS CloudFormation/GCP Deployment Manager
- Repository: Separate infra repo or monorepo /infrastructure directory
- Coverage:
- VPC and networking
- Compute instances (ECS/GKE)
- Databases (RDS/Cloud SQL)
- Load balancers
- S3 buckets, IAM roles
- Versioning: Git-based, tagged releases
- State Management: Terraform Cloud or S3 backend with locking
Current Gaps:
- No IaC detected (manual provisioning risk) β οΈ
- No infrastructure versioning
- Difficult disaster recovery
- Inconsistent environments
Cost Architectureβ
Cost Breakdown (Estimated)β
| Component | Monthly Cost | % of Total | Optimization Opportunity |
|---|---|---|---|
| Compute (EC2/GCE) | $3,000 - $5,000 | 35% | Right-sizing instances, spot instances for dev/staging |
| Database (RDS/Cloud SQL) | $1,500 - $2,500 | 20% | Reserved instances, read replica optimization |
| ML Inference (GPU) | $2,000 - $4,000 | 30% | Model optimization, serverless inference, caching |
| Storage (S3/GCS) | $500 - $1,000 | 8% | Lifecycle policies, compression, tiering |
| Data Transfer | $300 - $600 | 5% | CloudFront/CDN, cross-region optimization |
| External APIs | $200 - $500 | 3% | Rate limiting, caching, batch requests |
| Monitoring/Logging | $100 - $300 | 2% | Log retention policies, sampling |
| Other (SQS, Lambda, etc.) | $200 - $400 | 2% | - |
| Total Estimated | $7,800 - $14,300 | 100% | Potential 30-40% savings |
Cost Drivers:
- ML GPU Instances: Largest cost (30%), always-on inference servers
- Compute Sprawl: 13 separate services, potential over-provisioning
- Cross-Cloud: Data transfer between AWS and GCP
- No Caching: Redundant computations and API calls
Cost Optimization Strategiesβ
Immediate (Month 1):
-
ML Inference Optimization ($800-1,500/month savings)
- Implement prediction caching (Redis)
- Use CPU instances for lower-volume models
- Batch inference requests
- Implement model quantization (ONNX)
-
Right-Sizing Instances ($500-1,000/month savings)
- Audit instance utilization
- Downsize over-provisioned instances
- Use burstable instances (T3, E2) for variable workloads
-
Reserved Instances ($400-800/month savings)
- 1-year RDS reserved instances (40% discount)
- 1-year compute reserved instances for production
- Savings Plans for flexible compute
Short-term (Month 2-3): 4. Service Consolidation ($600-1,200/month savings)
- Merge ML services (3 β 1): Reduce 2 instance costs
- Merge API services (2 β 1): Consolidate infrastructure
- Reduce operational overhead
-
Storage Optimization ($200-400/month savings)
- S3 lifecycle policies (Standard β IA β Glacier)
- Compress large files (images, videos)
- Delete unused/old data
-
Data Transfer Optimization ($150-300/month savings)
- Minimize cross-cloud transfer (AWS β GCP)
- Use CloudFront/CDN for static assets
- Compress API responses
Medium-term (Month 4-6): 7. Serverless Migration ($1,000-2,000/month savings)
- Migrate low-volume endpoints to Lambda/Cloud Functions
- Use SageMaker/Vertex AI serverless inference
- Pay only for actual usage
-
Auto-Scaling Optimization ($400-800/month savings)
- Aggressive scale-down during off-hours
- Use spot instances for non-critical workloads
- Implement predictive scaling
-
Database Optimization ($300-600/month savings)
- Query optimization (reduce IOPS)
- Connection pooling tuning
- Consider Aurora Serverless for variable loads
Total Potential Savings: $4,350-8,600/month (35-60%)
Cost Monitoring:
- Current: Cloud provider dashboards (AWS Cost Explorer, GCP Billing)
- Recommendation:
- Implement cost allocation tags by service
- Set up budget alerts
- Use CloudHealth or Cloudability for multi-cloud visibility
- Create cost dashboards for engineering team visibility
Design Decisions (ADRs)β
ADR-001: Microservices Architecture with Musical Namingβ
Status: Accepted (Historical) Date: Unknown (pre-2024)
Context: ComplyAI needed to scale different components independently (ML models, API, async processing) while allowing parallel team development. The platform required flexibility to deploy and update services without impacting the entire system.
Decision: Implement microservices architecture with separate repositories for each service. Adopt "musical" naming convention (triangle, violin, maestro, gong, ipu) suggesting orchestrated workflows and harmony between services.
Consequences:
- β
Positive:
- Independent scaling of services (ML vs API)
- Parallel team development
- Technology flexibility (Flask + Django + React)
- Fault isolation
- Clear orchestration metaphor
- β Negative:
- Service fragmentation (13 repos)
- Operational complexity
- Increased deployment overhead
- Naming ambiguity (purpose unclear from name)
- Higher infrastructure costs
- Shared database coupling
Alternatives Considered:
- Monolithic architecture (rejected: scaling limitations)
- Modular monolith (rejected: deployment coupling)
- Functional naming (rejected in favor of thematic consistency)
Current Recommendation: Consolidate to 5-6 core services while maintaining microservices benefits
ADR-002: Dual API Pattern (Sync + Async)β
Status: Accepted (Historical) Date: Unknown (pre-2024)
Context: ComplyAI needed to handle both real-time user requests (<200ms) and long-running operations (Facebook data fetch, ML inference, batch processing). Single API couldn't efficiently serve both patterns.
Decision: Implement two separate API services:
- complyai-api: Synchronous REST API for real-time operations
- api-async: Asynchronous API with Celery for background tasks
Consequences:
- β
Positive:
- Optimized performance for each pattern
- Async processing doesn't block sync requests
- Independent scaling
- Clear separation of concerns
- β Negative:
- Code duplication (shared models via complyai-core)
- Shared database (tight coupling)
- Deployment complexity (2 services vs 1)
- Inconsistent patterns across APIs
Alternatives Considered:
- Single API with async task queue (rejected: mixing concerns)
- API Gateway with backend services (not evaluated initially)
- GraphQL with subscriptions (rejected: team expertise)
Current Recommendation: Consolidate into unified API Gateway with async capabilities built-in
ADR-003: Three Separate ML Inference Servicesβ
Status: Accepted (Historical), Under Review Date: Unknown (pre-2024)
Context: ML compliance checking required specialized models for different domains (text, image, policy analysis). Models had different resource requirements and update cycles.
Decision: Create three separate ML inference services (triangle, violin, ipu) each with identical tech stack (TensorFlow, PyTorch, Flask, chromadb) but different models.
Consequences:
- β
Positive:
- Model-specific optimization
- Independent deployment/updates
- Fault isolation (one model failure doesn't affect others)
- Specialized resource allocation (GPU vs CPU)
- β Negative:
- 3x infrastructure costs
- Code duplication (551 + 357 + 418 = 1,326 LOC, mostly identical)
- Operational overhead (3 services to monitor/deploy)
- No centralized model registry
- Inconsistent security (no auth)
Alternatives Considered:
- Single ML service with model routing (rejected: complexity concerns)
- Serverless inference (not evaluated: team expertise)
- Managed ML platform (partially adopted with Vertex AI in gong)
Current Recommendation: CONSOLIDATE into unified ML platform with model registry
ADR-004: Cross-Cloud Deployment (AWS + GCP)β
Status: Accepted (Historical), Under Review Date: Unknown (pre-2024)
Context: ComplyAI started on one cloud provider but needed specific services from another (e.g., BigQuery for analytics, Vertex AI for managed ML).
Decision: Deploy across both AWS and GCP:
- AWS: API services, S3 storage, SQS queues
- GCP: Core engine, ML services, BigQuery, Vertex AI
Consequences:
- β
Positive:
- Best-of-breed services from each cloud
- Vendor lock-in reduction
- Access to GCP's AI/ML platforms
- BigQuery for powerful analytics
- β Negative:
- Increased complexity (dual cloud management)
- Higher data transfer costs (inter-cloud)
- Dual IAM/security management
- Team needs expertise in both clouds
- Disaster recovery complexity
Alternatives Considered:
- AWS-only with Redshift + SageMaker (rejected: GCP AI preference)
- GCP-only migration (rejected: AWS investment)
- Multi-cloud with abstraction layer (rejected: overhead)
Current Recommendation: Consolidate to single cloud or implement clear service boundaries
ADR-005: Shared Database Patternβ
Status: Accepted (Historical), Deprecated Date: Unknown (pre-2024)
Context: complyai-api, api-async, and complyai-core needed access to the same data models. Team prioritized rapid development over strict service isolation.
Decision: Share a single PostgreSQL database across complyai-api, api-async, and complyai-core services. Use complyai-core for shared model definitions.
Consequences:
- β
Positive:
- No data synchronization needed
- Shared transactions (ACID compliance)
- Faster initial development
- Simplified data access
- β Negative:
- Tight coupling (microservices anti-pattern) β οΈ
- Schema changes affect all services
- Single point of failure
- Scaling bottleneck
- Can't independently deploy database migrations
Alternatives Considered:
- Database per service with API communication (rejected: development speed)
- Event sourcing with CQRS (rejected: complexity)
- GraphQL federation (not evaluated)
Current Status: DEPRECATED - Violates microservices principles Recommendation: Migrate to database per service with API-based communication
ADR-006: Django/Wagtail CMS Alongside Flask Servicesβ
Status: Accepted (Historical) Date: Unknown (pre-2024)
Context: ComplyAI needed robust content management for regulatory content, documentation, and knowledge base. Wagtail CMS provided rich editing experience but required Django.
Decision: Introduce Django/Wagtail as separate service (complyai_cms) despite rest of backend using Flask.
Consequences:
- β
Positive:
- Powerful CMS with editorial workflow
- Rich content editing experience
- Version control and publishing workflow
- Strong community and plugins
- β Negative:
- Technology stack fragmentation (Flask + Django)
- Different ORM patterns (SQLAlchemy vs Django ORM)
- Separate deployment pipeline
- Team needs Django expertise
- Different security patterns
Alternatives Considered:
- Flask-Admin (rejected: insufficient CMS features)
- Headless CMS (Contentful, Strapi) (not evaluated initially)
- Custom Flask CMS (rejected: development time)
Current Recommendation: Evaluate headless CMS migration to unify tech stack
ADR-007: Frontend Technology Split (App vs Marketing)β
Status: Accepted (Historical) Date: Unknown (pre-2024)
Context: ComplyAI needed both a feature-rich application (complyai-frontend) and a marketing website (www) with different requirements and update cycles.
Decision: Build two separate React applications:
- complyai-frontend: TypeScript, Vite, complex state management (Zustand, TanStack Query)
- www: JavaScript, Create React App, simpler marketing site
Consequences:
- β
Positive:
- Optimized for different use cases
- Independent deployment/updates
- Smaller bundle size for marketing site
- Clear separation of concerns
- β Negative:
- Code/component duplication
- Inconsistent user experience potential
- Duplicate dependencies/tooling
- 2 deployment pipelines
Alternatives Considered:
- Monorepo with shared components (not evaluated initially)
- Unified app with marketing routes (rejected: bundle size)
- Next.js for both (not evaluated: team expertise)
Current Recommendation: Maintain separation but share component library via monorepo
ADR-008: Auth0 for Authenticationβ
Status: Accepted Date: Unknown (2024 or earlier)
Context: ComplyAI needed enterprise-grade authentication with MFA, OAuth2, social logins, and compliance features without building custom auth system.
Decision: Adopt Auth0 as managed authentication provider for complyai-frontend.
Consequences:
- β
Positive:
- Enterprise-grade security out-of-box
- MFA, OAuth2, OIDC support
- Social login integrations
- SOC 2 compliant
- Reduces development time
- Regular security updates
- β Negative:
- Monthly cost per active user
- Vendor lock-in
- Less control over auth flow
- Requires internet connectivity
Alternatives Considered:
- Custom Flask-Security (rejected: security risk, maintenance)
- AWS Cognito (rejected: Auth0 features preferred)
- Firebase Auth (rejected: GCP preference for managed services)
Status: Approved - Auth0 is appropriate choice for Series A stage
Technical Debtβ
Known Issuesβ
1. Test Coverage Crisis β οΈ CRITICAL
- Issue: Average test coverage 8.5%, with 7 services at 0%
- Impact: HIGH - Production bugs, risky deployments, slow development
- Affected Services:
- api-async: 0% (20,143 LOC) - HIGHEST RISK
- complyai-violin: 0% (357 LOC)
- complyai-gong: 0% (1,031 LOC)
- complyai-api: 1.39% (15,630 LOC)
- complyai_cms: 1.32% (2,771 LOC)
- Mitigation: Manual testing, cautious deployments
- Resolution Plan:
- Month 1: Achieve 40% coverage for critical paths
- Month 2: Achieve 60% for business logic
- Month 3: Achieve 80% for core services
2. Hardcoded Secrets β οΈ CRITICAL
- Issue: Production secrets hardcoded in 3 services
- Impact: CRITICAL - Security breach risk, audit failure
- Locations:
- complyai-maestro/config.py
- complyai_cms/dev.py, staging.py
- complyai-adhoc/slack_data/*.py
- Mitigation: Limited access to repositories
- Resolution Plan:
- Week 1: Migrate all secrets to AWS Secrets Manager/GCP Secret Manager
- Week 2: Implement automated secret scanning in CI/CD
- Week 3: Security audit and rotation
3. Missing ML Service Authentication β οΈ HIGH
- Issue: triangle, violin, ipu have no authentication
- Impact: HIGH - Unauthorized access, resource abuse, data leakage
- Affected: 3 ML services
- Mitigation: Network-level isolation (VPC)
- Resolution Plan:
- Week 1-2: Implement JWT authentication for ML endpoints
- Week 2-3: Add rate limiting
- Week 4: API key management system
4. Insecure HTTP Communication β οΈ HIGH
- Issue: 21 occurrences across 7 services
- Impact: HIGH - Man-in-the-middle attacks, data interception
- Affected: 7 services (API, core, maestro, CMS, adhoc, triangle, violin, ipu)
- Mitigation: VPC isolation, load balancer HTTPS termination
- Resolution Plan:
- Week 1: Audit all HTTP calls
- Week 2: Enforce HTTPS for internal service communication
- Week 3: Update configurations, remove HTTP fallbacks
5. Shared Database Anti-Pattern β οΈ HIGH
- Issue: PostgreSQL shared across complyai-api, api-async, core
- Impact: HIGH - Tight coupling, scaling bottleneck, deployment risk
- Affected: 3 core services
- Mitigation: Careful schema migration coordination
- Resolution Plan:
- Month 1-2: Implement API-based inter-service communication
- Month 3-4: Separate databases with data migration
- Month 5: Remove direct database coupling
6. No Monitoring/Observability β οΈ HIGH
- Issue: No APM, distributed tracing, or centralized metrics
- Impact: HIGH - Blind to performance issues, slow incident resolution
- Affected: All services
- Mitigation: Reactive debugging, log file analysis
- Resolution Plan:
- Month 1: Implement APM (Datadog, New Relic)
- Month 2: Centralized logging (ELK, Datadog)
- Month 3: Distributed tracing (OpenTelemetry)
7. Missing API Documentation β οΈ MEDIUM
- Issue: No OpenAPI/Swagger specs for 13 services
- Impact: MEDIUM - Slow integration, developer friction, errors
- Affected: All API services
- Mitigation: Code review, team knowledge
- Resolution Plan:
- Month 1: Generate OpenAPI specs for core APIs
- Month 2: Interactive API documentation (Swagger UI)
- Month 3: Auto-generate from code annotations
8. No Infrastructure as Code β οΈ MEDIUM
- Issue: Manual infrastructure provisioning
- Impact: MEDIUM - Inconsistent environments, slow DR, config drift
- Affected: All infrastructure
- Mitigation: Documented manual steps
- Resolution Plan:
- Month 1: Terraform for new resources
- Month 2-3: Import existing infrastructure to Terraform
- Month 4: Full IaC with CI/CD integration
9. Service Fragmentation β οΈ MEDIUM
- Issue: 13 separate services, some redundant
- Impact: MEDIUM - High operational overhead, cost, complexity
- Affected: All services
- Mitigation: Clear service boundaries, documentation
- Resolution Plan:
- Month 1: Consolidate ML services (3 β 1)
- Month 2: Merge API services (2 β 1)
- Month 3: Evaluate frontend consolidation
- Target: 13 β 5-6 services
10. Cross-Cloud Complexity β οΈ MEDIUM
- Issue: AWS + GCP deployment increases complexity and cost
- Impact: MEDIUM - Higher costs, complex DR, dual expertise needed
- Affected: Infrastructure
- Mitigation: Clear cloud service boundaries
- Resolution Plan:
- Month 1-2: Analyze cloud usage patterns
- Month 3-4: Develop single-cloud migration plan
- Month 5-6: Execute migration (if beneficial)
Improvement Roadmapβ
Q4 2025 (Months 1-3) - Foundation & Critical Issues:
Month 1:
- Remove all hardcoded secrets β Secrets Manager β οΈ CRITICAL
- Implement authentication for ML services β οΈ HIGH
- Fix insecure HTTP communication (21 occurrences) β οΈ HIGH
- Achieve 40% test coverage for critical paths β οΈ CRITICAL
- Implement APM and centralized logging β οΈ HIGH
- Generate OpenAPI specs for core APIs
- Database read replicas for performance
Month 2:
- Achieve 60% test coverage for business logic
- Consolidate ML services (triangle + violin + ipu β unified)
- Implement Redis caching layer
- Start Terraform implementation
- Security audit and penetration testing
- Implement circuit breakers for external APIs
Month 3:
- Achieve 80% test coverage for core services
- Merge API services (complyai-api + api-async β unified)
- Distributed tracing implementation
- Complete IaC migration to Terraform
- Implement database per service pattern
- DR testing and runbook creation
Q1 2026 (Months 4-6) - Optimization & Scale:
Month 4:
- Frontend consolidation evaluation (app + www)
- Database sharding for scale
- Implement GraphQL layer for frontend
- Advanced caching strategies
- Cost optimization (reserved instances, serverless)
Month 5:
- Service mesh implementation (Istio, Linkerd)
- Kubernetes migration planning
- Comprehensive E2E test suite (Cypress)
- ML model registry and versioning
- Advanced monitoring and SLA tracking
Month 6:
- Complete service consolidation (13 β 5-6 services)
- Multi-region deployment
- Automated performance testing
- SOC 2 Type 1 certification
- Series A technical readiness complete
Q2 2026 (Months 7-9) - Advanced Features:
- SOC 2 Type 2 audit preparation
- Advanced ML features (model explainability, A/B testing)
- International expansion support (i18n, data residency)
- Advanced analytics and BI integration
- Platform API for third-party integrations
Q3 2026 (Months 10-12) - Enterprise Ready:
- SOC 2 Type 2 certification
- Enterprise features (SSO, SAML, advanced RBAC)
- White-label capabilities
- Advanced compliance reporting
- Platform maturity for Series B
Risk Assessmentβ
Architecture Risksβ
| Risk | Likelihood | Impact | Mitigation | Priority |
|---|---|---|---|---|
| Security breach from hardcoded secrets | High | Critical | Immediate migration to Secrets Manager, audit | π΄ P0 |
| Production outage from untested code (8.5% coverage) | High | Critical | Aggressive test development, canary deployments | π΄ P0 |
| ML service abuse (no auth) | Medium | High | Implement auth + rate limiting immediately | π΄ P0 |
| Database failure (single PostgreSQL) | Medium | Critical | Implement read replicas, backup testing | π P1 |
| Service fragmentation causing deployment failures | Medium | High | Service consolidation roadmap | π P1 |
| Cross-cloud data transfer costs spike | Medium | Medium | Monitor costs, implement caching | π‘ P2 |
| Scaling bottleneck from shared database | High | High | Database per service migration | π P1 |
| Vendor lock-in with Auth0 | Low | Medium | Document migration path, API abstraction | π’ P3 |
| Compliance failure (GDPR, SOC 2) | Medium | High | Security audit, compliance roadmap | π P1 |
| Team knowledge concentration (bus factor) | Medium | High | Documentation, knowledge sharing, cross-training | π P1 |
| Infrastructure drift (no IaC) | Medium | Medium | Terraform implementation | π‘ P2 |
| API breaking changes affecting clients | Low | Medium | Versioning strategy, deprecation policy | π’ P3 |
| ML model performance degradation | Medium | Medium | Model monitoring, A/B testing, rollback capability | π‘ P2 |
| Cost overrun from inefficient architecture | Medium | Medium | Cost optimization roadmap, monitoring | π‘ P2 |
Dependencies & Single Points of Failureβ
Single Points of Failure (SPOFs):
-
PostgreSQL Database β οΈ CRITICAL
- Impact: Complete platform outage
- Affected: complyai-api, api-async, core, CMS (4 services)
- Current Mitigation: Daily backups (assumed)
- Required:
- Immediate: Setup read replicas with auto-failover
- Short-term: Database per service
- Long-term: Distributed database (CockroachDB, PostgreSQL with Patroni)
-
Auth0 Service β οΈ HIGH
- Impact: No user authentication, platform inaccessible
- Affected: complyai-frontend, all authenticated endpoints
- Current Mitigation: Auth0 SLA (99.99% uptime)
- Required:
- Session token caching for short-term access
- Fallback authentication mechanism
- Regular backup of user data
-
AWS S3 (Model Storage) β οΈ HIGH
- Impact: ML inference failure, no model loading
- Affected: triangle, violin, ipu (3 ML services)
- Current Mitigation: S3 durability (99.999999999%)
- Required:
- Model caching on inference servers
- Cross-region S3 replication
- Fallback model versions
-
Facebook Business API β οΈ MEDIUM
- Impact: No ad data sync, delayed compliance checks
- Affected: api-async, adhoc
- Current Mitigation: Retry logic, error handling
- Required:
- Circuit breaker pattern
- Data caching for known accounts
- Alternative data sources
-
Stripe Payment API β οΈ MEDIUM
- Impact: Payment processing failure, subscription issues
- Affected: complyai-api, core, frontend
- Current Mitigation: Stripe SLA (99.99% uptime)
- Required:
- Webhook retry mechanism
- Payment status reconciliation
- Manual payment processing fallback
Critical External Dependencies:
| Dependency | Service Owner | SLA | Fallback Strategy |
|---|---|---|---|
| Auth0 | Auth0 (Okta) | 99.99% | Session caching, grace period |
| Stripe | Stripe Inc. | 99.99% | Webhook retries, manual processing |
| Facebook API | Meta | Best effort | Cached data, scheduled retries |
| Google Vertex AI | GCP | 99.95% | Fallback to local models |
| BigQuery | GCP | 99.99% | Query caching, delayed analytics |
| AWS S3 | AWS | 99.99% | Cross-region replication, cache |
| PostgreSQL | Self-hosted | Unknown β οΈ | Backups, failover (needed) |
Mitigation Priorities:
- P0 (This Week): PostgreSQL read replicas and failover
- P1 (This Month): Model caching, circuit breakers for external APIs
- P2 (Next Quarter): Multi-region deployment, service redundancy
Evolution & Roadmapβ
Current State Assessment (SWOT)β
Strengths:
- β Modern Tech Stack: React 18, TypeScript, Python 3, TensorFlow/PyTorch
- β Microservices Foundation: Independent scaling and deployment
- β AI/ML Capabilities: Multiple specialized models for compliance analysis
- β Security Awareness: Encryption patterns, audit logging, Auth0
- β Active Development: Recent commits, engaged team
- β Good Documentation Coverage: 60%+ in several services
- β Best-in-Class Integrations: Stripe, Auth0, Facebook, Google Cloud
- β Managed Services: Leveraging cloud-native (Vertex AI, BigQuery)
Weaknesses:
- β οΈ Critical Test Gap: 8.5% average coverage, 7 services at 0%
- β οΈ Hardcoded Secrets: 3 services with production credentials in code
- β οΈ Service Fragmentation: 13 repos creating operational burden
- β οΈ Missing Auth: ML services exposed without authentication
- β οΈ Shared Database: Tight coupling between core services
- β οΈ No Observability: Missing APM, tracing, centralized metrics
- β οΈ Insecure Communication: 21 HTTP occurrences
- β οΈ No IaC: Manual infrastructure, config drift risk
- β οΈ Cross-Cloud Complexity: AWS + GCP operational overhead
- β οΈ Missing API Docs: No OpenAPI/Swagger specifications
Opportunities:
- π‘ Service Consolidation: 13 β 5-6 services (30-40% cost savings)
- π‘ ML Platform: Unified ML service with model registry
- π‘ API Gateway: Single entry point, better security/observability
- π‘ Serverless ML: Cost-effective inference for variable loads
- π‘ GraphQL: Flexible frontend queries, reduced API calls
- π‘ Platform API: Third-party integrations, ecosystem growth
- π‘ International Expansion: Multi-language, data residency
- π‘ Enterprise Features: SSO, SAML, advanced RBAC for upmarket
- π‘ Compliance Certifications: SOC 2, ISO 27001 for enterprise sales
- π‘ ML Marketplace: Pluggable compliance models for different industries
Threats:
- π¨ Security Breach: Hardcoded secrets, missing auth (investor concern)
- π¨ Production Outage: Low test coverage risk (customer churn)
- π¨ Scaling Failure: Shared database bottleneck (growth blocker)
- π¨ Technical Debt: Slowing development velocity (competitive risk)
- π¨ Talent Retention: Complexity burden (team burnout)
- π¨ Cost Overrun: Inefficient architecture (runway impact)
- π¨ Compliance Failure: Missing SOC 2 (enterprise sales blocker)
- π¨ Vendor Lock-in: Auth0, cross-cloud dependencies
- π¨ API Breaking Changes: No versioning strategy (customer impact)
- π¨ Data Loss: No DR testing (existential risk)
Future State Vision (12 months)β
Target Architecture (Q4 2026):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β UNIFIED WEB PLATFORM β
β (React/Next.js - App + Marketing) β
β Auth0 + Stripe + Analytics β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API GATEWAY LAYER β
β (GraphQL + REST with Rate Limiting) β
β Authentication + Authorization + Logging β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MICROSERVICES LAYER β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββββ β
β β Compliance β β ML Platformβ βOrchestrationβ βContent/Policyβ β
β β Service β β (Unified) β β Service β β Service β β
β β (Core) β β β β (Maestro) β β (Gong) β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAYER β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
β β PostgreSQL β β Redis β β Vector DB β β BigQuery β β
β β (Per Svc) β β (Cache) β β (Embeddings)β β(Analytics) β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Characteristics:
- 5 Core Services: Consolidated from 13
- API Gateway: Kong, AWS API Gateway, or GCP Apigee
- Unified ML Platform: Single service with model registry
- Database Per Service: Proper microservices isolation
- Comprehensive Observability: OpenTelemetry, Datadog/New Relic
- Infrastructure as Code: 100% Terraform-managed
- 80%+ Test Coverage: Across all services
- SOC 2 Type 2 Certified: Enterprise-ready security
- Multi-Region: AWS/GCP with active-active DR
- Platform API: Third-party integrations enabled
Technology Evolution:
- Frontend: Migrate to Next.js for SSR, better SEO, unified app
- API: GraphQL gateway with REST fallback
- ML: Serverless inference (SageMaker/Vertex AI) + GPU when needed
- Cache: Redis Cluster with persistence
- Monitoring: OpenTelemetry β Datadog/New Relic + Grafana
- IaC: Terraform with Atlantis for automation
- CI/CD: GitHub Actions with advanced workflows, security scanning
Migration Pathβ
Phase 1: Foundation (Months 1-3) - "Security & Stability"
Goals: Fix critical security issues, establish observability, improve reliability
Deliverables:
- β Remove all hardcoded secrets β AWS Secrets Manager/GCP Secret Manager
- β Implement ML service authentication (JWT + rate limiting)
- β Fix all insecure HTTP communication (enforce HTTPS)
- β Achieve 60% test coverage for critical paths
- β Implement APM and distributed tracing (Datadog/OpenTelemetry)
- β Centralized logging (ELK stack or Datadog)
- β Database read replicas with auto-failover
- β Redis caching layer for API and ML predictions
- β Generate OpenAPI specs for all APIs
- β DR testing and runbook creation
Success Metrics:
- Zero hardcoded secrets detected by automated scanning
- 100% HTTPS enforcement (0 HTTP violations)
- 60%+ test coverage (up from 8.5%)
- < 200ms P95 API response time with observability
- 99.9% uptime with monitoring evidence
Phase 2: Consolidation (Months 4-6) - "Simplification & Scale"
Goals: Reduce service count, unify architecture, prepare for scale
Deliverables:
- β
Consolidate ML services: triangle + violin + ipu β Unified ML Platform
- Single codebase with model registry
- Model versioning and A/B testing
- Serverless inference option
- β
Merge API services: complyai-api + api-async β Unified API Gateway
- GraphQL + REST support
- Built-in async capabilities (Celery)
- Rate limiting and auth at gateway
- β
Migrate to database per service pattern
- Separate databases for each service
- API-based inter-service communication
- Event-driven patterns for data sync
- β Infrastructure as Code: 100% Terraform coverage
- β Service mesh implementation (Istio/Linkerd)
- β Achieve 80% test coverage across all services
- β Frontend consolidation evaluation (app + www)
- β Security audit and penetration testing
- β SOC 2 Type 1 certification initiated
- β Cost optimization: 30%+ reduction through consolidation
Success Metrics:
- Service count: 13 β 7 (50% reduction)
- Infrastructure cost: 30%+ reduction
- Deployment time: 50% faster
- Test coverage: 80%+ (up from 60%)
- SOC 2 Type 1 audit passed
Phase 3: Enterprise Ready (Months 7-9) - "Compliance & Features"
Goals: Enterprise features, compliance certifications, platform maturity
Deliverables:
- β SOC 2 Type 2 certification
- β Enterprise SSO (SAML, SCIM provisioning)
- β Advanced RBAC and audit trails
- β Multi-region deployment (active-active)
- β Platform API for third-party integrations
- β Comprehensive E2E test suite (Cypress)
- β
Advanced ML features:
- Model explainability (SHAP, LIME)
- Drift detection and retraining automation
- Custom model upload for enterprise
- β Frontend migration to Next.js (unified app + marketing)
- β GraphQL API gateway fully operational
- β Final consolidation: 7 β 5 services
Success Metrics:
- SOC 2 Type 2 certified
- Enterprise features: SSO, SCIM, advanced RBAC
- Multi-region: < 10ms cross-region latency
- API ecosystem: 5+ third-party integrations
- Service count: 5 core services (final target)
Phase 4: Platform & Scale (Months 10-12) - "Growth Enablement"
Goals: Platform scalability, international expansion, Series B readiness
Deliverables:
- β
International expansion:
- Multi-language support (i18n)
- Data residency compliance (EU, APAC)
- Regional compliance models
- β
ML Marketplace:
- Industry-specific compliance models
- Partner model integrations
- Custom model training service
- β
Advanced Analytics & BI:
- Real-time dashboards
- Predictive compliance insights
- Customer success metrics
- β White-label capabilities for enterprise
- β Kubernetes migration (if not already done)
- β Advanced auto-scaling and cost optimization
- β Chaos engineering and resiliency testing
- β Performance: 10K concurrent users, < 100ms P95
- β Documentation: Comprehensive platform docs, API guides
- β Series B technical due diligence package complete
Success Metrics:
- 10K+ concurrent users supported
- Multi-region: 3+ regions (US, EU, APAC)
- Platform API: 20+ third-party integrations
- Revenue: API/platform revenue stream
- Series B ready: All technical gates passed
Migration Risks & Mitigation:
| Risk | Impact | Mitigation |
|---|---|---|
| Service consolidation causes downtime | High | Blue-green deployments, feature flags, gradual migration |
| Database migration data loss | Critical | Comprehensive backup, migration rehearsals, rollback plan |
| Breaking changes for clients | Medium | API versioning, deprecation notices, backward compatibility |
| Team bandwidth for migration | High | Hire contractors for feature development, dedicate team to migration |
| Cost spike during transition | Medium | Gradual migration, parallel run minimization, cost monitoring |
| Security regression | Critical | Security testing at each phase, penetration testing, audit gates |
Referencesβ
Related Documentationβ
Internal Documents:
API Documentation (To Be Created):
- OpenAPI/Swagger specifications (NEEDED β οΈ)
- GraphQL schema documentation (FUTURE)
- Integration guides (NEEDED β οΈ)
Operational Runbooks (To Be Created):
- Service deployment procedures (NEEDED β οΈ)
- Disaster recovery playbooks (NEEDED β οΈ CRITICAL)
- Incident response guides (NEEDED β οΈ)
- Security incident procedures (NEEDED β οΈ)
External Resourcesβ
Technology Documentation:
- Flask Documentation
- Django/Wagtail CMS
- React Documentation
- TensorFlow, PyTorch
- Google Vertex AI
- Auth0 Documentation
- Stripe API
Best Practices & Patterns:
- Microservices Patterns (Chris Richardson)
- 12-Factor App Methodology
- Google SRE Books
- AWS Well-Architected Framework
- GCP Architecture Framework
Compliance Frameworks:
Architecture Decision Records:
- ADR GitHub - ADR methodology and examples
Appendicesβ
Glossaryβ
Architecture Terms:
- Microservices: Architectural style structuring application as collection of loosely coupled services
- API Gateway: Single entry point for all client requests, routing to appropriate microservice
- Circuit Breaker: Design pattern preventing cascading failures in distributed systems
- CQRS: Command Query Responsibility Segregation - separating read/write operations
- Event Sourcing: Storing state changes as sequence of events
- Service Mesh: Infrastructure layer handling service-to-service communication
- Blue-Green Deployment: Deployment strategy using two identical environments
- Canary Deployment: Gradual rollout to subset of users before full deployment
ComplyAI-Specific Terms:
- Musical Services: ComplyAI's pattern of naming services after musical terms (triangle, violin, maestro, gong, ipu)
- Sync API: complyai-api - handles real-time synchronous requests
- Async API: api-async - handles background/long-running asynchronous tasks
- Core Engine: complyai-core - shared business logic and models
- ML Triad: Triangle, violin, ipu - three specialized ML inference services
- Orchestrator: complyai-maestro - coordinates multi-service workflows
- Policy Engine: complyai-gong - AI-powered policy analysis using Vertex AI
- Compliance Score: AI-generated score indicating ad compliance level
- Ad Creative: Advertising content (images, video, text) submitted for compliance checking
Technology Terms:
- TensorFlow/PyTorch: Deep learning frameworks for ML model training and inference
- chromadb: Vector database for storing ML embeddings and similarity search
- Celery: Distributed task queue for async processing
- Gunicorn: Python WSGI HTTP server for running Flask applications
- Wagtail: Django-based content management system
- Auth0: Managed authentication and authorization platform
- Vertex AI: Google Cloud's managed ML platform
- BigQuery: Google Cloud's data warehouse for analytics
Compliance Terms:
- GDPR: General Data Protection Regulation (EU data privacy)
- SOC 2: Service Organization Control 2 (security audit framework)
- PCI DSS: Payment Card Industry Data Security Standard
- Audit Trail: Chronological record of system activities for compliance
- Data Residency: Requirement for data to be stored in specific geographic location
- Right to Erasure: GDPR right for users to have personal data deleted
Metrics Collectionβ
How Metrics Are Collected:
Application Metrics (To Be Implemented):
- Method: Prometheus exporters or Datadog APM agents
- Metrics: Request counts, latency histograms, error rates, throughput
- Storage: Time-series database (Prometheus/Datadog)
- Visualization: Grafana dashboards or Datadog UI
- Location: To be centralized at
/metricsendpoints
Infrastructure Metrics (Current):
- AWS: CloudWatch metrics (EC2, RDS, S3, SQS)
- GCP: Cloud Monitoring (GCE, BigQuery, Vertex AI)
- Access: AWS Console, GCP Console, CLI tools
- Dashboards: Cloud provider dashboards (limited)
Frontend Metrics (Current):
- Tool: FullStory (user sessions), Web Vitals
- Metrics: Page load time, FCP, LCP, CLS, FID, TTI
- Access: FullStory dashboard, browser DevTools
- Location: Client-side instrumentation
Business Metrics (To Be Implemented):
- Source: Application events, database queries
- Metrics: User signups, compliance checks performed, subscriptions, revenue
- Dashboards: To be built in BI tool (Looker, Tableau, Metabase)
Where to Find Metrics:
- Cloud Dashboards: AWS CloudWatch, GCP Monitoring Console
- Application Logs: Service log files (needs centralization)
- FullStory: User behavior and frontend performance
- Database: PostgreSQL stats, slow query logs
- Future: Datadog/New Relic unified dashboard (recommended)
Contact Informationβ
| Role | Name | Contact | Responsibility |
|---|---|---|---|
| Solutions Architect | SkaFld Studio Team | - | Architecture documentation, Series A readiness |
| Technical Lead | [To Be Filled] | - | Architecture decisions, technical strategy |
| Engineering Manager | [To Be Filled] | - | Team coordination, delivery |
| DevOps Lead | [To Be Filled] | - | Infrastructure, CI/CD, deployments |
| Security Lead | [To Be Filled] | - | Security architecture, compliance |
| ML Lead | [To Be Filled] | - | ML models, inference optimization |
| Product Manager | [To Be Filled] | - | Product roadmap, requirements |
Document Metadataβ
Review Scheduleβ
- Next Review Date: 2025-11-02 (Monthly)
- Review Frequency: Monthly during migration phases, then quarterly
- Review Triggers:
- Major architectural changes
- Post-incident reviews
- Pre-fundraising updates
- Quarterly business reviews
Change Logβ
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-10-02 | SkaFld Studio Team | Initial comprehensive current architecture documentation based on codebase analysis |
Tagsβ
#architecture #complyai #current-state #microservices #python #ml-services #compliance #series-a-readiness #technical-analysis #flask #django #react #tensorflow #pytorch #aws #gcp #security-audit #technical-debt #consolidation-roadmap