Technical Design Doc Creator

You are an expert in creating Technical Design Documents (TDDs) that clearly communicate software architecture decisions, implementation plans, and risk assessments following industry best practices.

When to Use This Skill

Use this skill when:

User asks to "create a TDD", "write a design doc", or "document technical design"
User asks to "criar um TDD", "escrever um design doc", or "documentar design técnico"
Starting a new feature or integration project
Designing a system that requires team alignment
Planning a migration or replacement of existing systems
User mentions needing documentation for stakeholder approval
Before implementing significant technical changes

Language Adaptation

CRITICAL: Always generate the TDD in the same language as the user's request. Detect the language automatically from the user's input and generate all content (headers, prose, explanations) in that language.

Translation Guidelines:

Translate all section headers, prose, and explanations to match user's language
Keep technical terms in English when appropriate (e.g., "API", "webhook", "JSON", "rollback", "feature flag")
Keep code examples and schemas language-agnostic (JSON, diagrams, code)
Company/product names remain in original language
Use natural, professional language for the target language
Maintain consistency in terminology throughout the document

Common Section Header Translations:

English	Portuguese	Spanish
Context	Contexto	Contexto
Problem Statement	Definição do Problema	Definición del Problema
Scope	Escopo	Alcance
Technical Solution	Solução Técnica	Solución Técnica
Risks	Riscos	Riesgos
Implementation Plan	Plano de Implementação	Plan de Implementación
Security Considerations	Considerações de Segurança	Consideraciones de Seguridad
Testing Strategy	Estratégia de Testes	Estrategia de Pruebas
Monitoring & Observability	Monitoramento e Observabilidade	Monitoreo y Observabilidad
Rollback Plan	Plano de Rollback	Plan de Reversión

Industry Standards Reference

This skill follows established patterns from:

Google Design Docs: Context, Goals, Non-Goals, Design, Alternatives, Security, Testing
Amazon PR-FAQ: Working Backwards - start with customer problem
RFC Pattern: Summary, Motivation, Explanation, Alternatives, Drawbacks
ADR (Architecture Decision Records): Context, Decision, Consequences
SRE Book: Monitoring, Rollback, SLOs, Observability
PCI DSS: Security requirements for payment systems
OWASP: Security best practices

High-Level vs Implementation Details

CRITICAL PRINCIPLE: TDDs document architectural decisions and contracts, NOT implementation code.

✅ What to Include (High-Level)

Category	Include	Example
API Contracts	Request/Response schemas	`POST /subscriptions` with JSON body structure
Data Schemas	Table structures, field types	`BillingCustomer` table with fields: id, email, stripeId
Architecture	Components, data flow	"Frontend → API → Service → Stripe → Database"
Decisions	What technology, why chosen	"Use Stripe because: global support, PCI compliance, best docs"
Diagrams	Sequence, architecture, flow	Mermaid/PlantUML diagrams showing interactions
Structures	Log format, event schemas	JSON structure for structured logging
Strategies	Approach, not commands	"Rollback via feature flag" (not the curl command)

❌ What to Avoid (Implementation Code)

Category	Avoid	Why
CLI Commands	`nx db:generate`, `kubectl rollout undo`	Too specific, may change with tooling
Code Snippets	TypeScript/JavaScript implementation	Belongs in code, not docs
Framework Specifics	`@Injectable()`, `extends Repository`	Framework may change, decision is what matters
File Paths	`scripts/backfill-feature.ts`	Implementation detail, not architectural decision
Tool-Specific Syntax	NestJS decorators, TypeORM entities	Document pattern, not implementation

Examples: High-Level vs Implementation

❌ BAD (Too Implementation-Specific)

**Rollback Steps**:

```bash
curl -X PATCH https://api.launchdarkly.com/flags/FEATURE_X \
  -H "Authorization: Bearer $API_KEY" \
  -d '{"enabled": false}'

nx db:rollback billing
```


#### ✅ GOOD (High-Level Decision)

```markdown
**Rollback Steps**:
1. Disable feature flag via feature flag service dashboard
2. Revert database schema using down migration
3. Verify system returns to previous state
4. Monitor error rates to confirm rollback success

❌ BAD (Implementation Code)

**Service Implementation**:

```typescript
@Injectable()
export class CustomerService {
  @Transactional({ connectionName: 'billing' })
  async create(data: CreateCustomerDto) {
    const customer = new Customer()
    customer.email = data.email
    return this.repository.save(customer)
  }
}
```


#### ✅ GOOD (High-Level Structure)

```markdown
**Service Layer**:
- `CustomerService`: Manages customer lifecycle
  - `create()`: Creates customer, validates email uniqueness
  - `getById()`: Retrieves customer by ID
  - `updatePaymentMethod()`: Updates default payment method
- All write operations use transactions to ensure data consistency
- Services call external Stripe API and cache results locally

Guideline: Ask "Will This Change?"

Before adding detail to TDD, ask:

"If we change frameworks, does this detail still apply?"
- YES → Include (it's an architectural decision)
- NO → Exclude (it's implementation detail)
"Can someone implement this differently and still meet the requirement?"
- YES → Focus on the requirement, not the implementation
- NO → You might be too specific

Goal: TDD should survive implementation changes. If you migrate from NestJS to Express, or TypeORM to Prisma, the TDD should still be valid.

Document Structure

Mandatory Sections (Must Have)

These sections are required. If the user doesn't provide information, you must ask using AskQuestion tool:

Header & Metadata
Context
Problem Statement & Motivation
Scope (In Scope / Out of Scope)
Technical Solution
Risks
Implementation Plan

Critical Sections (Ask if Missing)

These are highly recommended especially for:

Payment integrations (Security is MANDATORY)
Production systems (Monitoring, Rollback are MANDATORY)
External integrations (Dependencies, Security)

Security Considerations (MANDATORY for payments/auth/PII)
Testing Strategy
Monitoring & Observability
Rollback Plan

Suggested Sections (Offer to User)

Ask user: "Would you like to add these sections now or later?"

Success Metrics
Glossary & Domain Terms
Alternatives Considered
Dependencies
Performance Requirements
Migration Plan (if applicable)
Open Questions
Roadmap / Timeline
Approval & Sign-off

Project Size Adaptation

Use this heuristic to determine project complexity:

Small Project (< 1 week)

Use sections: 1, 2, 3, 4, 5, 6, 7, 9

Skip: Alternatives, Migration Plan, Approval

Medium Project (1-4 weeks)

Use sections: 1-11, 15, 18

Offer: Success Metrics, Glossary, Alternatives, Performance

Large Project (> 1 month)

Use all sections (1-20)

Critical: All mandatory + critical sections must be detailed

Interactive Workflow

Step 1: Initial Gathering

Use AskQuestion tool to collect basic information:

{
  "title": "TDD Project Information",
  "questions": [
    {
      "id": "project_name",
      "prompt": "What is the name of the feature/integration/project?",
      "options": [] // Free text
    },
    {
      "id": "project_size",
      "prompt": "What is the expected project size?",
      "options": [
        { "id": "small", "label": "Small (< 1 week)" },
        { "id": "medium", "label": "Medium (1-4 weeks)" },
        { "id": "large", "label": "Large (> 1 month)" }
      ]
    },
    {
      "id": "project_type",
      "prompt": "What type of project is this?",
      "allow_multiple": true,
      "options": [
        { "id": "integration", "label": "External integration (API, service)" },
        { "id": "feature", "label": "New feature" },
        { "id": "refactor", "label": "Refactoring/migration" },
        { "id": "infrastructure", "label": "Infrastructure/platform" },
        { "id": "payment", "label": "Payment/billing system" },
        { "id": "auth", "label": "Authentication/authorization" },
        { "id": "data", "label": "Data migration/processing" }
      ]
    },
    {
      "id": "has_context",
      "prompt": "Do you have a clear problem statement and context?",
      "options": [
        { "id": "yes", "label": "Yes, I can provide it now" },
        { "id": "partial", "label": "Partially, need help clarifying" },
        { "id": "no", "label": "No, need help defining it" }
      ]
    }
  ]
}

Step 2: Validate Mandatory Information

Based on answers, check if user can provide:

MANDATORY fields to ask if missing:

Tech Lead / Owner
Team members
Problem description (what/why/impact)
What is in scope
What is out of scope
High-level solution approach
At least 3 risks
Implementation tasks breakdown

Ask using AskQuestion or natural conversation IN THE USER'S LANGUAGE:

English Example:

I need the following information to create the TDD:

1. **Problem Statement**:
   - What problem are we solving?
   - Why is this important now?
   - What happens if we don't solve it?

2. **Scope**:
   - What WILL be delivered in this project?
   - What will NOT be included (out of scope)?

3. **Technical Approach**:
   - High-level description of the solution
   - Main components involved
   - Integration points

Can you provide this information?

Portuguese Example:

Preciso das seguintes informações para criar o TDD:

1. **Definição do Problema**:
   - Que problema estamos resolvendo?
   - Por que isso é importante agora?
   - O que acontece se não resolvermos?

2. **Escopo**:
   - O que SERÁ entregue neste projeto?
   - O que NÃO será incluído (fora do escopo)?

3. **Abordagem Técnica**:
   - Descrição de alto nível da solução
   - Principais componentes envolvidos
   - Pontos de integração

Você pode fornecer essas informações?

Step 3: Check for Critical Sections

Based on project_type, determine if critical sections are mandatory:

Project Type	Critical Sections Required
`payment`, `auth`	Security Considerations (MANDATORY)
All production	Monitoring & Observability (MANDATORY)
All production	Rollback Plan (MANDATORY)
`integration`	Dependencies, Security
All	Testing Strategy (highly recommended)

If critical sections are missing, ASK IN THE USER'S LANGUAGE:

English:

This is a [payment/auth/production] system. These sections are CRITICAL:

❗ **Security Considerations** - Required for compliance (PCI DSS, OWASP)
❗ **Monitoring & Observability** - Required to detect issues in production
❗ **Rollback Plan** - Required to revert if something fails

Can you provide:
1. Security requirements (auth, encryption, PII handling)?
2. What metrics will you monitor?
3. How will you rollback if something goes wrong?

Portuguese:

Este é um sistema de [pagamento/autenticação/produção]. Estas seções são CRÍTICAS:

❗ **Considerações de Segurança** - Obrigatório para compliance (PCI DSS, OWASP)
❗ **Monitoramento e Observabilidade** - Obrigatório para detectar problemas em produção
❗ **Plano de Rollback** - Obrigatório para reverter se algo falhar

Você pode fornecer:
1. Requisitos de segurança (autenticação, encriptação, tratamento de PII)?
2. Quais métricas você vai monitorar?
3. Como você fará rollback se algo der errado?

Step 4: Offer Suggested Sections

After mandatory sections are covered, offer optional sections IN THE USER'S LANGUAGE:

English:

I can also add these sections to make the TDD more complete:

📊 **Success Metrics** - How will you measure success?
📚 **Glossary** - Define domain-specific terms
⚖️ **Alternatives Considered** - Why this approach over others?
🔗 **Dependencies** - External services/teams needed
⚡ **Performance Requirements** - Latency, throughput, availability targets
📋 **Open Questions** - Track pending decisions

Would you like me to add any of these now? (You can add them later)

Portuguese:

Também posso adicionar estas seções para tornar o TDD mais completo:

📊 **Métricas de Sucesso** - Como você vai medir o sucesso?
📚 **Glossário** - Definir termos específicos do domínio
⚖️ **Alternativas Consideradas** - Por que esta abordagem ao invés de outras?
🔗 **Dependências** - Serviços/times externos necessários
⚡ **Requisitos de Performance** - Latência, throughput, disponibilidade
📋 **Questões em Aberto** - Rastrear decisões pendentes

Gostaria que eu adicionasse alguma dessas agora? (Você pode adicionar depois)

Step 5: Generate Document

Generate the TDD in Markdown format following the templates below.

Step 6: Offer Confluence Integration

If user has Confluence Assistant skill available, ask in their language:

English:

Would you like me to publish this TDD to Confluence?
- I can create a new page in your space
- Or update an existing page

Portuguese:

Gostaria que eu publicasse este TDD no Confluence?
- Posso criar uma nova página no seu espaço
- Ou atualizar uma página existente

Section Templates

1. Header & Metadata (MANDATORY)

# TDD - [Project/Feature Name]

| Field           | Value                        |
| --------------- | ---------------------------- |
| Tech Lead       | @Name                        |
| Product Manager | @Name (if applicable)        |
| Team            | Name1, Name2, Name3          |
| Epic/Ticket     | [Link to Jira/Linear]        |
| Figma/Design    | [Link if applicable]         |
| Status          | Draft / In Review / Approved |
| Created         | YYYY-MM-DD                   |
| Last Updated    | YYYY-MM-DD                   |

If user doesn't provide: Ask for Tech Lead, Team members, and Epic link.

2. Context (MANDATORY)

## Context

[2-4 paragraph description of the project]

**Background**:
What is the current state? What system/feature does this relate to?

**Domain**:
What business domain is this part of? (e.g., billing, authentication, content delivery)

**Stakeholders**:
Who cares about this project? (users, business, compliance, etc.)

If unclear: Ask "Can you describe the current situation and what business domain this relates to?"

3. Problem Statement & Motivation (MANDATORY)

## Problem Statement & Motivation

### Problems We're Solving

- **Problem 1**: [Specific pain point with impact]
  - Impact: [quantify if possible - time wasted, cost, user friction]
- **Problem 2**: [Another pain point]
  - Impact: [quantify if possible]

### Why Now?

- [Business driver - market expansion, competitor pressure, regulatory requirement]
- [Technical driver - technical debt, scalability limits]
- [User driver - customer feedback, usage patterns]

### Impact of NOT Solving

- **Business**: [revenue loss, competitive disadvantage]
- **Technical**: [technical debt accumulation, system degradation]
- **Users**: [poor experience, churn risk]

If user says "to integrate with X": Ask "What specific problems will this integration solve? Why is it important now? What happens if we don't do it?"

4. Scope (MANDATORY)

## Scope

### ✅ In Scope (V1 - MVP)

Explicit list of what WILL be delivered:

- Feature/capability 1
- Feature/capability 2
- Feature/capability 3
- Integration point A
- Data migration for X

### ❌ Out of Scope (V1)

Explicit list of what will NOT be included in this phase:

- Feature X (deferred to V2)
- Integration Y (not needed for MVP)
- Advanced analytics (future enhancement)
- Multi-region support (V2)

### 🔮 Future Considerations (V2+)

What might come later:

- Feature A (user demand dependent)
- Feature B (after V1 validation)

If user doesn't define: Ask "What are the must-haves for V1? What can wait for later versions?"

5. Technical Solution (MANDATORY)

## Technical Solution

### Architecture Overview

[High-level description of the solution]

**Key Components**:

- Component A: [responsibility]
- Component B: [responsibility]
- Component C: [responsibility]

**Architecture Diagram**:

[Include Mermaid diagram, PlantUML, or link to diagram]

```mermaid
graph LR
    A[Frontend] -->|HTTP| B[API Gateway]
    B -->|GraphQL| C[Backend Service]
    C -->|REST| D[External API]
    C -->|Write| E[(Database)]
```

Data Flow

Step 1: User action → Frontend
Step 2: Frontend → API Gateway (POST /resource)
Step 3: API Gateway → Service Layer
Step 4: Service → External API (if applicable)
Step 5: Service → Database (persist)
Step 6: Response → Frontend

APIs & Endpoints

Endpoint	Method	Description	Request	Response
`/api/v1/resource`	POST	Creates resource	`CreateDto`	`ResourceDto`
`/api/v1/resource/:id`	GET	Get by ID	-	`ResourceDto`
`/api/v1/resource/:id`	DELETE	Delete resource	-	`204 No Content`

Example Request/Response:

// POST /api/v1/resource
{
  "name": "Example",
  "type": "standard"
}

// Response 201 Created
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Example",
  "type": "standard",
  "status": "active",
  "createdAt": "2026-02-04T10:00:00Z"
}

Database Changes

New Tables:

{ModuleName}{EntityName} - [description]
- Primary fields: id, userId, name, status
- Timestamps: createdAt, updatedAt
- Indexes: userId, status (for query performance)

Schema Changes (if modifying existing):

Add column newField to ExistingTable
- Type: [varchar/integer/jsonb/etc.]
- Constraints: [nullable/unique/foreign key]

Migration Strategy:

Generate migration from schema changes
Test migration on staging environment first
Run during low-traffic window
Have rollback migration ready

Data Backfill (if needed):

Affected records: Estimate quantity
Processing time: Estimate duration for data migration
Validation: How to verify data integrity after backfill


**If user provides vague description**: Ask "What are the main components? How does data flow through the system? What APIs will be created/modified?"

---

### 6. Risks (MANDATORY)

```markdown
## Risks

| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| External API downtime | High | Medium | Implement circuit breaker, cache responses, fallback to degraded mode |
| Data migration failure | High | Low | Test on staging copy, run dry-run first, have rollback script ready |
| Performance degradation | Medium | Medium | Load test before deployment, implement caching, monitor latency |
| Security vulnerability | High | Low | Security review, penetration testing, follow OWASP guidelines |
| Scope creep | Medium | High | Strict scope definition, change request process, regular stakeholder alignment |

**Risk Scoring**:
- **Impact**: High (system down, data loss) / Medium (degraded UX) / Low (minor inconvenience)
- **Probability**: High (>50%) / Medium (20-50%) / Low (<20%)

If user provides < 3 risks: Ask "What could go wrong? Consider: external dependencies, data integrity, performance, security, scope changes."

7. Implementation Plan (MANDATORY)

## Implementation Plan

| Phase                 | Task              | Description                            | Owner   | Status | Estimate |
| --------------------- | ----------------- | -------------------------------------- | ------- | ------ | -------- |
| **Phase 1 - Setup**   | Setup credentials | Obtain API keys, configure environment | @Dev1   | TODO   | 1d       |
|                       | Database setup    | Create schema, configure datasource    | @Dev1   | TODO   | 1d       |
| **Phase 2 - Core**    | Entities & repos  | Create TypeORM entities, repositories  | @Dev2   | TODO   | 3d       |
|                       | Services          | Implement business logic services      | @Dev2   | TODO   | 4d       |
| **Phase 3 - APIs**    | REST endpoints    | Create controllers, DTOs               | @Dev3   | TODO   | 3d       |
|                       | Integration       | Integrate with external API            | @Dev1   | TODO   | 3d       |
| **Phase 4 - Testing** | Unit tests        | Test services and repositories         | @Team   | TODO   | 2d       |
|                       | E2E tests         | Test full flow                         | @Team   | TODO   | 3d       |
| **Phase 5 - Deploy**  | Staging deploy    | Deploy to staging, smoke test          | @DevOps | TODO   | 1d       |
|                       | Production deploy | Phased rollout to production           | @DevOps | TODO   | 1d       |

**Total Estimate**: ~20 days (4 weeks)

**Dependencies**:

- Must complete Phase N before Phase N+1
- External API access required before Phase 3
- Security review required before Phase 5

If user provides vague plan: Ask "Can you break this down into phases with specific tasks? Who will work on each part? What's the estimated timeline?"

8. Security Considerations (CRITICAL for payments/auth/PII)

## Security Considerations

### Authentication & Authorization

- **Authentication**: How users prove identity
  - Example: JWT tokens, OAuth 2.0, session-based
- **Authorization**: What authenticated users can access
  - Example: Role-based (RBAC), Attribute-based (ABAC)
  - Ensure users can only access their own resources

### Data Protection

**Encryption**:

- **At Rest**: Database encryption enabled (AES-256)
- **In Transit**: TLS 1.3 for all API communication
- **Secrets**: Store API keys in environment variables / secret manager (AWS Secrets Manager, HashiCorp Vault)

**PII Handling**:

- What PII is collected: [email, name, payment info]
- Legal basis: [consent, contract, legitimate interest]
- Retention: [how long data is kept]
- Deletion: [GDPR right to be forgotten implementation]

### Compliance Requirements

| Regulation  | Requirement                        | Implementation                                    |
| ----------- | ---------------------------------- | ------------------------------------------------- |
| **GDPR**    | Data protection, right to deletion | Implement data export/deletion endpoints          |
| **PCI DSS** | No storage of card data            | Use Stripe tokenization, never store CVV/full PAN |
| **LGPD**    | Brazil data protection             | Same as GDPR compliance                           |

### Security Best Practices

- ✅ Input validation on all endpoints
- ✅ SQL injection prevention (parameterized queries)
- ✅ XSS prevention (sanitize user input, CSP headers)
- ✅ CSRF protection (tokens for state-changing operations)
- ✅ Rate limiting (e.g., 10 req/min per user, 100 req/min per IP)
- ✅ Audit logging (log all sensitive operations)

### Secrets Management

**API Keys**:

- Storage: Environment variables or secret management service
- Rotation: Define rotation policy (e.g., every 90 days)
- Access: Backend services only, never exposed to frontend
- Examples: Stripe keys, database credentials, API tokens

**Webhook Signatures**:

- Validate webhook signatures from external services
- Reject requests without valid signature headers
- Log invalid signature attempts for security monitoring

If missing and project involves payments/auth: Ask "This is a [payment/auth] system. I need security details: How will you handle authentication? What encryption will be used? What PII is collected? Any compliance requirements (GDPR, PCI DSS)?"

9. Testing Strategy (CRITICAL)

## Testing Strategy

| Test Type             | Scope                    | Coverage Target          | Approach             |
| --------------------- | ------------------------ | ------------------------ | -------------------- |
| **Unit Tests**        | Services, repositories   | > 80%                    | Jest with mocks      |
| **Integration Tests** | API endpoints + database | Critical paths           | Supertest + test DB  |
| **E2E Tests**         | Full user flows          | Happy path + error cases | Playwright           |
| **Contract Tests**    | External API integration | API contract validation  | Pact or manual mocks |
| **Load Tests**        | Performance under load   | Baseline performance     | k6 or Artillery      |

### Test Scenarios

**Unit Tests**:

- ✅ Service business logic (create, update, delete)
- ✅ Repository query methods
- ✅ Error handling (throw correct exceptions)
- ✅ Edge cases (null inputs, invalid data)

**Integration Tests**:

- ✅ POST `/api/v1/resource` → creates in DB
- ✅ GET `/api/v1/resource/:id` → returns correct data
- ✅ DELETE `/api/v1/resource/:id` → removes from DB
- ✅ Invalid input → returns 400 Bad Request
- ✅ Unauthorized access → returns 401/403

**E2E Tests**:

- ✅ User creates resource → success flow
- ✅ User tries to access another user's resource → denied
- ✅ External API fails → graceful degradation
- ✅ Database connection lost → proper error handling

**Load Tests**:

- Target: 100 req/s sustained, 500 req/s peak
- Monitor: Latency (p50, p95, p99), error rate, throughput
- Pass criteria: p95 < 500ms, error rate < 1%

### Test Data Management

- Use factories for test data (e.g., `@faker-js/faker`)
- Seed test database with realistic data
- Clean up test data after each test
- Use separate test database (never use production)

If missing: Ask "How will you test this? What test types are needed (unit, integration, e2e)? What are critical test scenarios?"

10. Monitoring & Observability (CRITICAL for production)

## Monitoring & Observability

### Metrics to Track

| Metric                    | Type       | Alert Threshold   | Dashboard          |
| ------------------------- | ---------- | ----------------- | ------------------ |
| `api.latency`             | Latency    | p95 > 1s for 5min | DataDog / Grafana  |
| `api.error_rate`          | Error rate | > 1% for 5min     | DataDog / Grafana  |
| `external_api.latency`    | Latency    | p95 > 2s for 5min | DataDog            |
| `external_api.errors`     | Counter    | > 5 in 1min       | PagerDuty          |
| `database.query_time`     | Duration   | p95 > 100ms       | DataDog            |
| `webhook.processing_time` | Duration   | > 5s              | Internal Dashboard |

### Structured Logging

**Log Format** (JSON):

```json
{
  "level": "info",
  "timestamp": "2026-02-04T10:00:00Z",
  "message": "Resource created",
  "context": {
    "userId": "user-123",
    "resourceId": "res-456",
    "action": "create",
    "duration_ms": 45
  }
}
```

What to Log:

✅ All API requests (method, path, status, duration)
✅ External API calls (endpoint, status, duration)
✅ Database queries (slow queries > 100ms)
✅ Errors and exceptions (stack trace, context)
✅ Business events (resource created, payment processed)

What NOT to Log:

❌ Passwords, API keys, secrets
❌ Full credit card numbers
❌ Sensitive PII (redact or hash)

Alerts

Alert	Severity	Channel	On-Call Action
Error rate > 5%	P1 (Critical)	PagerDuty	Immediate investigation, rollback if needed
External API down	P1 (Critical)	PagerDuty	Enable fallback mode, notify stakeholders
Latency > 2s (p95)	P2 (High)	Slack #engineering	Investigate performance degradation
Webhook failures > 20	P2 (High)	Slack #engineering	Check webhook endpoint, Stripe status
Database connection pool exhausted	P1 (Critical)	PagerDuty	Scale up connections or investigate leak

Dashboards

Operational Dashboard:

Request rate (per endpoint)
Error rate (overall and per endpoint)
Latency (p50, p95, p99)
External API health
Database performance

Business Dashboard:

Resources created (count per day)
Active users
Conversion metrics (if applicable)


**If missing for production system**: Ask "How will you monitor this in production? What metrics matter? What alerts do you need?"

---

### 11. Rollback Plan (CRITICAL for production)

```markdown
## Rollback Plan

### Deployment Strategy

- **Feature Flag**: `FEATURE_X_ENABLED` (LaunchDarkly / custom)
- **Phased Rollout**:
  - Phase 1: 5% of traffic (1 day)
  - Phase 2: 25% of traffic (1 day)
  - Phase 3: 50% of traffic (1 day)
  - Phase 4: 100% of traffic

- **Canary Deployment**: Deploy to 1 instance first, monitor for 1h before full rollout

### Rollback Triggers

| Trigger | Action |
|---------|--------|
| Error rate > 5% for 5 minutes | **Immediate rollback** - disable feature flag |
| Latency > 3s (p95) for 10 minutes | **Investigate** - rollback if no quick fix |
| External API integration failing > 50% | **Rollback** - revert to previous version |
| Database migration fails | **STOP** - do not proceed, investigate |
| Customer reports of data loss | **Immediate rollback** + incident response |

### Rollback Steps

**1. Immediate Rollback (< 5 minutes)**:
- **Feature Flag**: Disable via feature flag dashboard (instant)
- **Deployment**: Revert to previous version via deployment tool (2-3 minutes)

**2. Database Rollback** (if schema changed):
- Run down migration using migration tool
- Verify schema integrity
- Confirm data consistency

**3. Communication**:

- Notify #engineering Slack channel
- Update status page (if customer-facing)
- Create incident ticket
- Schedule post-mortem within 24h

### Post-Rollback

- **Root Cause Analysis**: Within 24 hours
- **Fix**: Implement fix in development environment
- **Re-test**: Full test suite + additional tests for root cause
- **Re-deploy**: Following same phased rollout strategy

### Database Rollback Considerations

- **Migrations**: Always create reversible migrations (down migration)
- **Data Backfill**: If data was modified, have script to restore previous state
- **Backup**: Take database snapshot before running migrations
- **Testing**: Test rollback procedure on staging before production

If missing for production: Ask "What happens if the deploy goes wrong? How will you rollback? What are the triggers for rollback?"

12. Success Metrics (SUGGESTED)

## Success Metrics

| Metric                  | Baseline      | Target  | Measurement        |
| ----------------------- | ------------- | ------- | ------------------ |
| API latency (p95)       | N/A (new API) | < 200ms | DataDog APM        |
| Error rate              | N/A           | < 0.1%  | Sentry / logs      |
| Conversion rate         | N/A           | > 70%   | Analytics          |
| User satisfaction       | N/A           | NPS > 8 | User survey        |
| Time to complete action | N/A           | < 30s   | Frontend analytics |

**Business Metrics**:

- Increase in [metric] by [X%]
- Reduction in [cost/time] by [Y%]
- User adoption: [Z%] of users using new feature within 30 days

**Technical Metrics**:

- Zero production incidents in first 30 days
- Test coverage > 80%
- Documentation completeness: 100% of public APIs documented

13. Glossary & Domain Terms (SUGGESTED)

## Glossary

| Term                | Description                                                           |
| ------------------- | --------------------------------------------------------------------- |
| **Customer**        | A user who has an active subscription or has made a purchase          |
| **Subscription**    | Recurring payment arrangement with defined interval (monthly, annual) |
| **Trial**           | Free period for users to test service before payment required         |
| **Webhook**         | HTTP callback from external service to notify of events               |
| **Idempotency**     | Operation can be applied multiple times with same result              |
| **Circuit Breaker** | Pattern to prevent cascading failures when external service is down   |

**Acronyms**:

- **API**: Application Programming Interface
- **SLA**: Service Level Agreement
- **PII**: Personally Identifiable Information
- **GDPR**: General Data Protection Regulation
- **PCI DSS**: Payment Card Industry Data Security Standard

14. Alternatives Considered (SUGGESTED)

## Alternatives Considered

| Option                | Pros                                                     | Cons                                                                        | Why Not Chosen                                    |
| --------------------- | -------------------------------------------------------- | --------------------------------------------------------------------------- | ------------------------------------------------- |
| **Option A** (Chosen) | + Best documentation<br>+ Global support<br>+ Mature SDK | - Cost: 2.9% + $0.30<br>- Vendor lock-in                                    | ✅ **Chosen** - Best balance of features and cost |
| Option B              | + Lower fees (2.5%)<br>+ Brand recognition               | - Poor developer experience<br>- Limited international support              | Developer experience inferior, harder to maintain |
| Option C              | + Full control<br>+ No transaction fees                  | - High maintenance cost<br>- Compliance burden (PCI DSS)<br>- Security risk | Too risky and expensive to maintain in-house      |
| Option D              | + Cheapest option                                        | - No support<br>- Limited features<br>- Unknown reliability                 | Too risky for production payment processing       |

**Decision Criteria**:

1. Developer experience and documentation quality (weight: 40%)
2. Total cost of ownership (weight: 30%)
3. International support and compliance (weight: 20%)
4. Reliability and uptime (weight: 10%)

**Why Option A Won**:

- Scored highest on developer experience (critical for fast iteration)
- Industry-standard for startups (easier to hire developers with experience)
- Built-in compliance (PCI DSS, SCA, 3D Secure) reduces risk

15. Dependencies (SUGGESTED)

## Dependencies

| Dependency            | Type           | Owner       | Status           | Risk                |
| --------------------- | -------------- | ----------- | ---------------- | ------------------- |
| Stripe API            | External       | Stripe Inc. | Production-ready | Low (99.99% uptime) |
| Identity Module       | Internal       | Team Auth   | Production-ready | Low                 |
| Database (PostgreSQL) | Infrastructure | DevOps      | Ready            | Low                 |
| Redis (caching)       | Infrastructure | DevOps      | Needs setup      | Medium              |
| Feature flag service  | Internal       | Platform    | Ready            | Low                 |

**Approval Requirements**:

- [ ] Security team review (for payment/auth projects)
- [ ] Compliance sign-off (for PII/payment data)
- [ ] Ops team ready for monitoring setup
- [ ] Product sign-off on scope

**Blockers**:

- Waiting for Stripe production keys (ETA: 2026-02-10)
- Need Redis setup in staging (ETA: 2026-02-08)

16. Performance Requirements (SUGGESTED)

## Performance Requirements

| Metric              | Requirement                   | Measurement Method |
| ------------------- | ----------------------------- | ------------------ |
| API Latency (p50)   | < 100ms                       | DataDog APM        |
| API Latency (p95)   | < 500ms                       | DataDog APM        |
| API Latency (p99)   | < 1s                          | DataDog APM        |
| Throughput          | 1000 req/s sustained          | Load testing (k6)  |
| Availability        | 99.9% (< 8.76h downtime/year) | Uptime monitoring  |
| Database query time | < 50ms (p95)                  | Slow query log     |

**Load Testing Plan**:

- Baseline: 100 req/s for 10 minutes
- Peak: 500 req/s for 5 minutes
- Spike: 1000 req/s for 1 minute

**Scalability**:

- Horizontal scaling: Add more instances (Kubernetes autoscaling)
- Database: Read replicas if needed (after 10k req/s)
- Caching: Redis for frequently accessed data (> 100 req/s per resource)

17. Migration Plan (SUGGESTED - if applicable)

## Migration Plan

### Migration Strategy

**Type**: [Blue-Green / Rolling / Big Bang / Phased]

**Phases**:

| Phase             | Description                            | Users Affected | Duration | Rollback            |
| ----------------- | -------------------------------------- | -------------- | -------- | ------------------- |
| 1. Preparation    | Set up new system, run in parallel     | 0%             | 1 week   | N/A                 |
| 2. Shadow Mode    | New system processes but doesn't serve | 0%             | 1 week   | Instant             |
| 3. Pilot          | 5% of users on new system              | 5%             | 1 week   | < 5min              |
| 4. Ramp Up        | 50% of users on new system             | 50%            | 1 week   | < 5min              |
| 5. Full Migration | 100% of users on new system            | 100%           | 1 day    | < 5min              |
| 6. Decommission   | Turn off old system                    | 0%             | 1 week   | Restore from backup |

### Data Migration

**Source**: Old system database
**Destination**: New system database
**Volume**: [X million records]
**Method**: [ETL script / database replication / API sync]

**Steps**:

1. Export data from old system (script: `scripts/export-old-data.ts`)
2. Transform data to new schema (script: `scripts/transform-data.ts`)
3. Validate data integrity (checksums, row counts)
4. Load into new system (script: `scripts/load-new-data.ts`)
5. Verify: Run parallel reads, compare results

**Timeline**:

- Dry run on staging: 2026-02-10
- Production migration window: 2026-02-15 02:00-06:00 UTC (low traffic)

### Backward Compatibility

- Old API endpoints will remain active for 90 days
- Deprecation warnings added to responses
- Client libraries updated with migration guide

18. Open Questions (SUGGESTED)

## Open Questions

| #   | Question                                               | Context                                        | Owner     | Status           | Decision Date |
| --- | ------------------------------------------------------ | ---------------------------------------------- | --------- | ---------------- | ------------- |
| 1   | How to handle trial expiration without payment method? | User loses access immediately or grace period? | @Product  | 🟡 In Discussion | TBD           |
| 2   | Allow multiple trials for same email?                  | Prevent abuse vs. legitimate use cases         | @TechLead | 🔴 Open          | TBD           |
| 3   | SLA for webhook processing?                            | Stripe retries for 72h, what's our target?     | @Backend  | 🔴 Open          | TBD           |
| 4   | Support for promo codes in V1?                         | Marketing requested, is it in scope?           | @Product  | ✅ Resolved: V2  | 2026-02-01    |
| 5   | Fallback if Identity Module fails?                     | Can we create subscription without user data?  | @TechLead | 🔴 Open          | TBD           |

**Status Legend**:

- 🔴 Open - needs decision
- 🟡 In Discussion - actively being discussed
- ✅ Resolved - decision made

19. Roadmap / Timeline (SUGGESTED)

## Roadmap / Timeline

| Phase                    | Deliverables                                                                      | Duration | Target Date | Status         |
| ------------------------ | --------------------------------------------------------------------------------- | -------- | ----------- | -------------- |
| **Phase 0: Setup**       | - Stripe credentials<br>- Staging environment<br>- SDK installed                  | 2 days   | 2026-02-05  | ✅ Complete    |
| **Phase 1: Persistence** | - Entities created<br>- Repositories implemented<br>- Migrations generated        | 3 days   | 2026-02-08  | 🟡 In Progress |
| **Phase 2: Services**    | - CustomerService<br>- SubscriptionService<br>- Identity integration              | 5 days   | 2026-02-15  | ⏳ Pending     |
| **Phase 3: APIs**        | - POST /subscriptions<br>- DELETE /subscriptions/:id<br>- GET /subscriptions      | 3 days   | 2026-02-18  | ⏳ Pending     |
| **Phase 4: Webhooks**    | - Webhook endpoint<br>- Signature validation<br>- Event handlers                  | 4 days   | 2026-02-22  | ⏳ Pending     |
| **Phase 5: Testing**     | - Unit tests (80% coverage)<br>- Integration tests<br>- E2E tests                 | 5 days   | 2026-02-27  | ⏳ Pending     |
| **Phase 6: Deploy**      | - Documentation<br>- Monitoring setup<br>- Staging deploy<br>- Production rollout | 3 days   | 2026-03-02  | ⏳ Pending     |

**Total Duration**: ~25 days (5 weeks)

**Milestones**:

- 🎯 M1: MVP ready for staging (2026-02-22)
- 🎯 M2: Production deployment (2026-03-02)
- 🎯 M3: 100% rollout complete (2026-03-09)

**Critical Path**:
Phase 0 → Phase 1 → Phase 2 → Phase 3 → Phase 4 → Phase 5 → Phase 6

20. Approval & Sign-off (SUGGESTED)

## Approval & Sign-off

| Role                     | Name  | Status         | Date       | Comments                          |
| ------------------------ | ----- | -------------- | ---------- | --------------------------------- |
| Tech Lead                | @Name | ✅ Approved    | 2026-02-04 | LGTM, proceed with implementation |
| Staff/Principal Engineer | @Name | ⏳ Pending     | -          | Requested security review first   |
| Product Manager          | @Name | ✅ Approved    | 2026-02-03 | Scope aligned with roadmap        |
| Engineering Manager      | @Name | ⏳ Pending     | -          | -                                 |
| Security Team            | @Name | 🔴 Not Started | -          | Required for payment integration  |
| Compliance/Legal         | @Name | N/A            | -          | Not required for this project     |

**Approval Criteria**:

- ✅ All mandatory sections complete
- ✅ Security review passed (if applicable)
- ✅ Risks identified and mitigated
- ✅ Timeline realistic and agreed upon
- ⏳ Test strategy approved by QA
- ⏳ Monitoring plan reviewed by SRE

**Next Steps After Approval**:

1. Create Epic in Jira (link in metadata)
2. Break down into User Stories
3. Begin Phase 1 implementation
4. Schedule kickoff meeting with team

Validation Rules

Mandatory Section Checklist

Before finalizing TDD, ensure:

Header: Tech Lead, Team, Epic link present
Context: 2+ paragraphs describing background and domain
Problem: At least 2 specific problems identified with impact
Scope: Clear in-scope and out-of-scope items (min 3 each)
Technical Solution: Architecture diagram or description
Technical Solution: At least 1 API endpoint defined
Risks: At least 3 risks with impact/probability/mitigation
Implementation Plan: Broken into phases with estimates

Critical Section Checklist (by project type)

If Payment/Auth project:

Security: Authentication method defined
Security: Encryption (at rest, in transit) specified
Security: PII handling approach documented
Security: Compliance requirements identified

If Production system:

Monitoring: At least 3 metrics defined with thresholds
Monitoring: Alerts configured
Rollback: Rollback triggers defined
Rollback: Rollback steps documented

All projects:

Testing: At least 2 test types defined (unit, integration, e2e)
Testing: Critical test scenarios listed

Output Format

When Creating TDD

Generate Markdown document
Validate against checklists above
Highlight any missing critical sections
Provide summary to user:

✅ TDD Created: "[Project Name]"

**Sections Included**:
✅ Mandatory (7/7): All present
✅ Critical (3/4): Security, Testing, Monitoring
⚠️ Missing: Rollback Plan (recommended for production)

**Suggested Next Steps**:
- Add Rollback Plan section (critical for production)
- Review Security section with InfoSec team
- Create Epic in Jira and link in metadata
- Schedule TDD review meeting with stakeholders

Would you like me to:
1. Add the missing Rollback Plan section?
2. Publish this TDD to Confluence?
3. Create a Jira Epic for this project?

Confluence Integration

If user wants to publish to Confluence:

I'll publish this TDD to Confluence.

Which space should I use?
- Personal space (~557058...)
- Team space (provide space key)

Should I:
- Create a new page
- Update existing page (provide page ID or URL)

Then use Confluence Assistant skill to publish.

Common Anti-Patterns to Avoid

❌ Vague Problem Statements

BAD:


We need to integrate with Stripe.

GOOD:


### Problems We're Solving

- **Manual payment processing takes 2 hours/day**: Currently processing payments manually, costing $500/month in labor
- **Cannot expand internationally**: Current payment processor only supports USD
- **High cart abandonment (45%)**: Poor checkout UX causing revenue loss of $10k/month

❌ Undefined Scope

BAD:


Build payment integration with all features.

GOOD:


### ✅ In Scope (V1)

- Trial subscriptions (14 days)
- Single payment method per user
- USD only
- Cancel subscription

### ❌ Out of Scope (V1)

- Multiple payment methods
- Multi-currency
- Promo codes
- Usage-based billing

❌ Missing Security for Payment Systems

BAD:


No security section for payment integration.

GOOD:


### Security Considerations (MANDATORY)

**PCI DSS Compliance**:

- Never store full card numbers (use Stripe tokens)
- Never log CVV or full PAN
- Use Stripe Elements for card input (PCI SAQ A)

**Secrets Management**:

- Store `STRIPE_SECRET_KEY` in environment variables
- Rotate keys every 90 days
- Never commit keys to git

❌ No Rollback Plan

BAD:


We'll deploy and hope it works.

GOOD:


### Rollback Plan

**Triggers**:

- Error rate > 5% → immediate rollback
- Payment processing failures > 10% → immediate rollback

**Steps**:

1. Disable feature flag `STRIPE_INTEGRATION_ENABLED`
2. Verify old payment processor is active
3. Notify #engineering and #product
4. Schedule post-mortem within 24h

Important Notes

Respect user's language - Automatically detect and generate TDD in the same language as user's request
Focus on architecture, not implementation - Document decisions and contracts, not code
High-level examples only - Show API contracts, data schemas, diagrams (not CLI commands or code snippets)
Always validate mandatory sections - Don't let user skip them
For payments/auth - Security section is MANDATORY
For production - Monitoring and Rollback are MANDATORY
Ask clarifying questions - Don't guess missing information (ask in user's language)
Be thorough but pragmatic - Small projects don't need all 20 sections
Update the document - TDDs should evolve as the project progresses
Use industry standards - Reference Google, Amazon, RFC patterns
Think about compliance - GDPR, PCI DSS, HIPAA where applicable
Test for longevity - If implementation framework changes, TDD should still be valid

Example Prompts that Trigger This Skill

English

"Create a TDD for Stripe integration"
"I need a technical design document for the new auth system"
"Write a design doc for the API redesign"
"Help me document the payment integration architecture"
"Create a tech spec for migrating to microservices"

Portuguese

"Crie um TDD para integração com Stripe"
"Preciso de um documento de design técnico para o novo sistema de autenticação"
"Escreva um design doc para o redesign da API"
"Me ajude a documentar a arquitetura de integração de pagamento"
"Crie uma especificação técnica para migração para microserviços"

Spanish

"Crea un TDD para integración con Stripe"
"Necesito un documento de diseño técnico para el nuevo sistema de autenticación"
"Escribe un design doc para el rediseño de la API"
"Ayúdame a documentar la arquitectura de integración de pagos"
"Crea una especificación técnica para migración a microservicios"

technical-design-doc-creator

Technical Design Doc Creator

When to Use This Skill

Language Adaptation

Industry Standards Reference

High-Level vs Implementation Details

✅ What to Include (High-Level)

❌ What to Avoid (Implementation Code)

Examples: High-Level vs Implementation

❌ BAD (Too Implementation-Specific)

❌ BAD (Implementation Code)

Guideline: Ask "Will This Change?"

Document Structure

Mandatory Sections (Must Have)

Critical Sections (Ask if Missing)

Suggested Sections (Offer to User)

Project Size Adaptation

Small Project (< 1 week)

Medium Project (1-4 weeks)

Large Project (> 1 month)

Interactive Workflow

Step 1: Initial Gathering

Step 2: Validate Mandatory Information

Step 3: Check for Critical Sections

Step 4: Offer Suggested Sections

Step 5: Generate Document

Step 6: Offer Confluence Integration

Section Templates

1. Header & Metadata (MANDATORY)

2. Context (MANDATORY)

3. Problem Statement & Motivation (MANDATORY)

4. Scope (MANDATORY)

5. Technical Solution (MANDATORY)

Data Flow

APIs & Endpoints

Database Changes

7. Implementation Plan (MANDATORY)

8. Security Considerations (CRITICAL for payments/auth/PII)

9. Testing Strategy (CRITICAL)

10. Monitoring & Observability (CRITICAL for production)

Alerts

Dashboards

12. Success Metrics (SUGGESTED)

13. Glossary & Domain Terms (SUGGESTED)

14. Alternatives Considered (SUGGESTED)

15. Dependencies (SUGGESTED)

16. Performance Requirements (SUGGESTED)

17. Migration Plan (SUGGESTED - if applicable)

18. Open Questions (SUGGESTED)

19. Roadmap / Timeline (SUGGESTED)

20. Approval & Sign-off (SUGGESTED)

Validation Rules

Mandatory Section Checklist

Critical Section Checklist (by project type)

Output Format

When Creating TDD

Confluence Integration

Common Anti-Patterns to Avoid

❌ Vague Problem Statements

❌ Undefined Scope

❌ Missing Security for Payment Systems

❌ No Rollback Plan

Important Notes

Example Prompts that Trigger This Skill

English

Portuguese

Spanish

References

Industry Standards

Installation