Skip to content

Cluster GitOps Repository Structure Critique

Executive Summary

This document provides a comprehensive architectural critique of the cluster-gitops repository structure, identifying inconsistencies, areas of concern, and recommendations for improvement. The repository shows evidence of recent refactoring and evolution, with some transitional artifacts and documentation drift.

Overall Assessment: The GitOps architecture is fundamentally sound with ApplicationSets and multi-source Helm patterns, but suffers from: - Inconsistent documentation - Orphaned directories from refactoring - Complex nested structure that may confuse new contributors - Multiple competing patterns not fully consolidated

Priority: Address documentation drift and remove orphaned artifacts before onboarding additional team members.


1. Structural Issues

1.1. Duplicate ApplicationSet Directories ⚠️ HIGH PRIORITY

Issue: Two ApplicationSet directories exist with different implementations:

cluster-gitops/
├── argo/applicationsets/          # ← Orphaned? (4.6k syrf.yaml)
│   └── syrf.yaml                  # Uses merge generator pattern
└── argocd/applicationsets/        # ← ACTIVE (4.0k syrf.yaml)
    ├── syrf.yaml                  # Uses matrix generator pattern
    ├── custom-charts.yaml
    └── plugins.yaml

Analysis: - argocd/bootstrap/root.yaml (L32) references argocd/applicationsets/ - The argo/ directory appears to be an orphaned experiment or transitional artifact - The two syrf.yaml files implement different generator strategies: - argo/: Merge generator (combines base + environment configs) - argocd/: Matrix generator (simpler, environment-first approach)

Git History Evidence:

a479d8b - refactor(argocd): implement merge generator pattern in ApplicationSet
12078a5 - refactor(argocd): reorganize ArgoCD configuration structure
d487860 - chore(gitops): remove obsolete configuration files

Recent commits show active refactoring but incomplete cleanup.

Impact: - Confusion: Which ApplicationSet is the "real" one? - Drift Risk: If both are maintained, they'll diverge - Documentation Mismatch: README doesn't mention argo/ directory at all

Recommendation: 1. Determine which implementation is canonical (likely argocd/) 2. Delete argo/ directory entirely if obsolete 3. If argo/ contains experimental work, move to a feature branch 4. Update CLAUDE.md to document the decision

Priority: HIGH - This is confusing and error-prone


1.2. Documentation vs Reality Mismatch ⚠️ HIGH PRIORITY

Issue: README.md describes a repository structure that doesn't match the actual implementation.

README Claims (lines 38-106):

cluster-gitops/
├── applicationsets/              # ← Doesn't exist (should be argocd/applicationsets/)
│   ├── syrf.yaml
│   ├── infrastructure.yaml       # ← Doesn't exist
│   └── custom-charts.yaml
├── syrf/
│   ├── api/                      # ← WRONG: Actually syrf/services/api/
│   │   ├── config.yaml
│   │   ├── values.yaml
│   │   ├── values-staging.yaml
│   │   └── values-production.yaml
├── infrastructure/               # ← Doesn't exist
│   ├── cert-manager/
│   ├── ingress-nginx/
│   └── ...
├── environments/                 # ← WRONG: Actually syrf/environments/
│   ├── staging/
│   │   ├── namespace.yaml
│   │   ├── shared-values.yaml
│   │   └── syrf.yaml             # ← Doesn't exist

Actual Structure:

cluster-gitops/
├── argocd/applicationsets/       # ← Not documented in README
├── syrf/
│   ├── services/                 # ← Extra nesting level
│   │   ├── api/
│   │   │   ├── config.yaml
│   │   │   └── values.yaml
│   │   └── ...
│   └── environments/             # ← Under syrf/, not top-level
│       ├── staging/
│       │   ├── namespace.yaml
│       │   ├── shared-values.yaml
│       │   ├── api/              # ← Per-service env configs
│       │   │   ├── config.yaml
│       │   │   └── values.yaml
│       │   └── ...
├── plugins/                      # ← Infrastructure is here
│   ├── helm/                     # ← Called "plugins" not "infrastructure"
│   └── local/                    # ← Not mentioned in README

Impact: - New team members will be confused - Documentation can't be trusted - CI/CD automation may reference wrong paths - Tutorial examples won't work

Examples of Path Mismatches:

README Path Actual Path Status
applicationsets/syrf.yaml argocd/applicationsets/syrf.yaml ❌ Wrong
syrf/api/config.yaml syrf/services/api/config.yaml ❌ Wrong
environments/staging/syrf.yaml Does not exist ❌ Missing
infrastructure/cert-manager/ plugins/helm/cert-manager/ ❌ Wrong

Recommendation: 1. Option A: Update README to match reality 2. Option B: Restructure repository to match README 3. Recommended: Option A (less disruptive, README is newer) 4. Add path validation tests to prevent future drift

Priority: HIGH - Documented paths don't work


1.3. Missing Infrastructure ApplicationSet 🔍 MEDIUM PRIORITY

Issue: README mentions infrastructure.yaml ApplicationSet but it doesn't exist.

README Reference (line 44):

applicationsets/
├── syrf.yaml                # SyRF services (auto-generated Applications)
├── infrastructure.yaml      # Infrastructure components  ← MISSING
└── custom-charts.yaml       # Custom Helm charts

What Exists Instead: - argocd/applicationsets/plugins.yaml - Manages infrastructure via plugins pattern - plugins/helm/* - Individual infrastructure components (cert-manager, ingress-nginx, etc.)

Analysis: The infrastructure components ARE being managed, but via the "plugins" pattern instead of a dedicated "infrastructure" ApplicationSet. This is semantically confusing:

  • Plugins typically means optional extensions
  • Infrastructure means required platform components

Infrastructure components like cert-manager, ingress-nginx, and external-dns are not optional - they're core platform requirements.

Recommendation: 1. Option A: Rename pluginsinfrastructure throughout 2. Option B: Keep plugins but update documentation to clarify terminology 3. Option C: Create a separate infrastructure.yaml ApplicationSet

Recommended: Option A (clearer semantics)

Changes Required: - Rename plugins/infrastructure/ - Rename applicationsets/plugins.yamlapplicationsets/infrastructure.yaml - Rename AppProject from pluginsinfrastructure - Update all documentation

Priority: MEDIUM - Confusing naming but functionally works


1.4. Inconsistent Service Configuration Model 🔍 MEDIUM PRIORITY

Issue: Service configuration is split across multiple locations with unclear precedence.

Current Model:

syrf/
├── services/{service}/
│   ├── config.yaml           # Base metadata (chartPath, chartRepo)
│   └── values.yaml           # Shared Helm values (all environments)
└── environments/{env}/{service}/
    ├── config.yaml           # Environment-specific (enabled, chartTag)
    └── values.yaml           # Environment-specific Helm values

Example - API Service:

  1. syrf/services/api/config.yaml:

    serviceName: api
    service:
      chartPath: src/services/api/.chart
      chartRepo: https://github.com/camaradesuk/syrf
    

  2. syrf/environments/staging/api/config.yaml:

    serviceName: api
    envName: staging
    service:
      enabled: true
      chartTag: api-v9.1.1  # ← THE CRITICAL FIELD (updated by CI/CD)
    

Problems:

  1. Duplication: serviceName appears in both files
  2. Merge Complexity: Requires merge generator to combine
  3. Unclear Ownership: Which file is the "source of truth"?
  4. CI/CD Confusion: Where should automation update fields?

Value File Hierarchy (6 levels!):

1. Chart defaults (in syrf repo)
2. global/values.yaml (all services, all envs)
3. environments/{env}/shared-values.yaml (all services in env)
4. syrf/services/{service}/config.yaml
5. syrf/environments/{env}/{service}/config.yaml
6. syrf/services/{service}/values.yaml
7. syrf/environments/{env}/shared-values.yaml
8. syrf/environments/{env}/{service}/values.yaml

From ApplicationSet (argo/applicationsets/syrf.yaml, lines 82-88):

valueFiles:
  - $values/global/values.yaml
  - $values/syrf/services/{{.serviceName}}/config.yaml
  - $values/syrf/environments/{{.envName}}/{{.serviceName}}/config.yaml
  - $values/syrf/services/{{.serviceName}}/values.yaml
  - $values/syrf/environments/{{.envName}}/shared-values.yaml
  - $values/syrf/environments/{{.envName}}/{{.serviceName}}/values.yaml

Impact: - High cognitive load for new developers - Debugging "where does this value come from?" is difficult - Potential for conflicting values across files - Merge conflicts more likely during concurrent updates

Alternative Model (Simpler):

syrf/environments/{env}/{service}.yaml
  ├── Service metadata (chartPath, chartRepo)
  ├── Deployment config (enabled, chartTag)
  └── Helm values (merged into values)

Single file per environment per service - much simpler!

Recommendation: 1. Evaluate if base service configs (syrf/services/*/) provide real value 2. If values are always environment-specific, consolidate into env files 3. If base configs are needed, document clear guidelines for what goes where 4. Consider reducing from 6 levels to 3-4 levels

Priority: MEDIUM - Works but complex


1.5. Plugins Directory Naming Inconsistency 🔍 LOW PRIORITY

Issue: ApplicationSet expects plugins/git/*/config.yaml but only plugins/local/* exists.

ApplicationSet (argocd/applicationsets/plugins.yaml, lines 21-26):

# Git-based chart plugins
- git:
    repoURL: https://github.com/camaradesuk/cluster-gitops
    revision: main
    files:
      - path: "plugins/git/*/config.yaml"  # ← Expects git/

Actual Directory:

$ ls -la /home/chris/workspace/syrf/cluster-gitops/plugins/
drwxr-xr-x helm/
drwxr-xr-x local/     # ← Should be git/ ?

Analysis: - The ApplicationSet is looking for Git-based charts in plugins/git/ - Only plugins/local/ exists (contains rabbitmq-secrets) - This generator will match zero files and silently fail

Recommendation: 1. Rename plugins/local/plugins/git/ 2. OR update ApplicationSet to match plugins/local/*/config.yaml 3. Document the naming convention and its purpose

Priority: LOW - Appears unused currently but will break if used


2. Design Concerns

2.1. Multi-Source Helm Pattern Complexity 🔍 INFO

Observation: The ApplicationSet uses ArgoCD's multi-source pattern to separate Helm charts from values.

Pattern:

sources:
  # Source 1: Helm chart from monorepo at specific tag
  - repoURL: 'https://github.com/camaradesuk/syrf'
    targetRevision: '{{.service.chartTag}}'  # e.g., api-v9.1.1
    path: 'src/services/api/.chart'
    helm:
      valueFiles:
        - $values/global/values.yaml
        - $values/syrf/services/api/values.yaml
        # ... more value files ...

  # Source 2: Values from cluster-gitops repo
  - repoURL: 'https://github.com/camaradesuk/cluster-gitops'
    targetRevision: main
    ref: values

Benefits: - ✅ Helm charts versioned with application code - ✅ Values decoupled in GitOps repo - ✅ Single source of truth for versions (chartTag) - ✅ Enables independent chart versioning

Concerns: - ⚠️ Complex to understand for ArgoCD beginners - ⚠️ Requires two git repositories for every deployment - ⚠️ Value file resolution across repos is opaque - ⚠️ Difficult to test locally without ArgoCD

Recommendation: - Document this pattern clearly with diagrams - Provide local testing workflow (e.g., Helm template command) - Consider adding troubleshooting guide for common issues

Priority: INFO - Document better, pattern is intentional


2.2. Environment Service Config Enabling Mechanism 🔍 MEDIUM PRIORITY

Issue: Multiple ways to enable/disable services creates ambiguity.

Method 1: Selector in ApplicationSet (argo/applicationsets/syrf.yaml, lines 44-52):

selector:
  matchExpressions:
    - key: service.enabled
      operator: In
      values: ['true']
    - key: service.chartTag
      operator: NotIn
      values: ['', null]

Method 2: Config file flag (syrf/environments/staging/api/config.yaml):

service:
  enabled: true  # ← Boolean flag
  chartTag: api-v9.1.1

Questions: 1. What happens if enabled: false but chartTag exists? 2. What happens if enabled: true but chartTag is empty? 3. Which takes precedence?

Current Behavior (from selector): - Both conditions must be true (AND logic) - Service disabled if either enabled: false OR chartTag is empty/null

Recommendation: 1. Document the enabling logic clearly 2. Consider simplifying to single mechanism 3. Add validation to prevent invalid states

Priority: MEDIUM - Can cause confusion


2.3. Missing Test/Validation Tooling 🔍 MEDIUM PRIORITY

Issue: No automated validation for ApplicationSet templates or config files.

Risks Without Validation: - Invalid YAML syntax in config files - Missing required fields (serviceName, chartTag, etc.) - Incorrect value file paths - Template rendering errors

Recommendation: Create validation tooling:

#!/bin/bash
# scripts/validate-gitops.sh

# 1. YAML syntax validation
echo "Validating YAML syntax..."
fd -e yaml -e yml -x yamllint {}

# 2. Required fields validation
echo "Checking required fields..."
for config in syrf/environments/*/*/config.yaml; do
  yq eval '.serviceName' "$config" > /dev/null || echo "ERROR: Missing serviceName in $config"
  yq eval '.service.chartTag' "$config" > /dev/null || echo "ERROR: Missing chartTag in $config"
done

# 3. ApplicationSet template rendering (dry-run)
echo "Testing ApplicationSet template rendering..."
argocd appset generate plugins -f argocd/applicationsets/plugins.yaml

# 4. Path existence validation
echo "Validating referenced paths..."
# Check that all chartPaths exist in syrf monorepo
# Check that all valueFiles paths exist

echo "✅ Validation complete"

Integration: - Add to GitHub Actions CI - Run as pre-commit hook - Include in PR preview environments

Priority: MEDIUM - Prevents deployment failures


3. Documentation Issues

3.1. Obsolete Deployment Guide 🔍 HIGH PRIORITY

Issue: docs/deploying-services.md references old structure and manual workflows.

Obsolete Content (lines 50-77):

# Example: Deploy API v1.2.3 to staging
nano environments/staging/api.values.yaml  # ← File doesn't exist

image:
  repository: ghcr.io/camaradesuk/syrf-api
  tag: "1.2.3"  # Update this line  # ← Wrong approach

Reality: - No environments/staging/api.values.yaml file - CI/CD automatically updates syrf/environments/staging/api/config.yaml - Field updated is service.chartTag, not image.tag

Recommendation: 1. Rewrite deployment guide to match current automated workflow 2. Remove manual deployment instructions (obsolete) 3. Add section on emergency manual rollbacks 4. Include troubleshooting for automation failures

Priority: HIGH - Misleading documentation causes errors


3.2. Missing Architecture Diagrams 🔍 MEDIUM PRIORITY

Issue: No visual diagrams explaining: - ApplicationSet generator flow - Value file precedence hierarchy - Deployment workflow (CI/CD → GitOps → K8s) - Service discovery mechanism

Recommendation: Create diagrams using Mermaid (GitHub renders it):

flowchart TD
    A[CI/CD builds image] --> B[Creates git tag: api-v9.1.1]
    B --> C[Updates chartTag in cluster-gitops]
    C --> D[GitHub webhook → ArgoCD]
    D --> E[ApplicationSet regenerates Application]
    E --> F[ArgoCD syncs to K8s]
    F --> G[K8s pulls new image from GHCR]

Priority: MEDIUM - Improves onboarding


4. Best Practices Assessment

4.1. What's Working Well ✅

Strong Points:

  1. GitOps Principles
  2. Single source of truth in git
  3. Declarative configuration
  4. Full audit trail via git history

  5. ApplicationSet Automation

  6. Auto-discovery of services
  7. DRY principle (one ApplicationSet → many Applications)
  8. Environment-specific configurations

  9. AppProject RBAC

  10. Clear environment boundaries
  11. Separation of staging/production permissions
  12. Sync windows for production safety

  13. Multi-Source Helm

  14. Charts versioned with code
  15. Values decoupled for GitOps
  16. Flexible value precedence

  17. PR Preview Environments

  18. Ephemeral testing environments
  19. Auto-cleanup on PR close
  20. Safe experimentation

4.2. Alignment with ArgoCD Best Practices

Checklist:

Best Practice Status Notes
App-of-Apps pattern ✅ Yes argocd/bootstrap/root.yaml
AppProjects for RBAC ✅ Yes Staging, production, plugins
Automated sync for non-prod ✅ Yes Staging auto-syncs
Manual sync for production ⚠️ Partial Sync allowed but automated
Sync windows ⚠️ Partial Defined but not enforced
Resource hooks ✅ Yes PostSync for notifications
Multi-source apps ✅ Yes Charts + values separation
Secrets management ❓ Unknown External Secrets Operator?
Progressive delivery ❌ No No Argo Rollouts integration

Recommendations: 1. Enforce production sync windows (currently set but automation bypasses) 2. Investigate secrets management setup 3. Consider Argo Rollouts for canary deployments


5. Security Observations

5.1. Overly Permissive Plugins AppProject ⚠️ MEDIUM PRIORITY

Issue: The plugins AppProject uses wildcard permissions.

From README (lines 406-410):

The `plugins` AppProject uses wildcard permissions (`group: '*', kind: '*'`)
which may appear overly permissive. However, this is necessary and secured
through defense-in-depth...

Analysis: - ✅ Mitigations documented (manual sync, RBAC, namespace restrictions) - ⚠️ Still risky if ArgoCD is compromised - ⚠️ No resource type restrictions (can create any K8s resource)

Recommendation: 1. Audit what resources plugins actually need 2. Tighten to specific resource types if possible 3. Add OPA policies to validate plugin manifests 4. Regular security audits of plugin configs

Priority: MEDIUM - Defense-in-depth is good, but tightening is better


6. Recommendations Summary

6.1. Immediate Actions (High Priority)

  1. Remove Orphaned argo/ Directory
  2. Decision: Is it obsolete or experimental?
  3. Action: Delete or move to feature branch
  4. Document the decision

  5. Fix Documentation-Reality Mismatch

  6. Update README.md to match actual structure
  7. Fix all path references
  8. Add validation tests

  9. Rewrite Deployment Guide

  10. Remove obsolete manual instructions
  11. Document current automated workflow
  12. Include troubleshooting section

  13. Consolidate ApplicationSet Implementations

  14. Choose one generator pattern (merge or matrix)
  15. Remove the other
  16. Document the choice in ADR

6.2. Short-Term Improvements (Medium Priority)

  1. Simplify Configuration Model
  2. Evaluate value of base service configs
  3. Consider consolidating to fewer files
  4. Document clear guidelines

  5. Add Validation Tooling

  6. Create scripts/validate-gitops.sh
  7. Integrate into CI/CD
  8. Add pre-commit hooks

  9. Improve Documentation

  10. Add architecture diagrams
  11. Create troubleshooting guide
  12. Document multi-source pattern

  13. Fix Plugins Naming

  14. Rename plugins/infrastructure/
  15. OR document why "plugins" terminology
  16. Fix local/ vs git/ inconsistency

6.3. Long-Term Enhancements (Low Priority)

  1. Progressive Delivery
  2. Evaluate Argo Rollouts integration
  3. Implement canary deployments
  4. Add automated rollback

  5. Secrets Management Audit

  6. Document External Secrets Operator setup
  7. Validate secret rotation
  8. Security review

  9. Monitoring Integration

  10. Add deployment metrics
  11. ArgoCD application health dashboards
  12. Alerting on sync failures

7. What to Do Next

7.1. Decision Points Requiring User Input

Question 1: ApplicationSet Strategy - Keep merge generator (argo/) or matrix generator (argocd/)? - Current: Both exist with different implementations - Recommendation: Choose one, document in ADR

Question 2: Directory Structure - Rename to match README (e.g., infrastructure/) or update README? - Recommendation: Update README (less disruptive)

Question 3: Configuration Simplification - Keep split base/environment configs or consolidate? - Trade-off: DRY vs simplicity - Recommendation: Evaluate actual reuse, then decide

Question 4: Production Sync Policy - Enforce manual sync or keep current automated approach? - Current: Automated but sync windows defined - Recommendation: Enforce manual for production safety

7.2. Suggested Next Steps

Phase 1: Clean Up (1-2 days) 1. Review and delete argo/ directory 2. Update README to match reality 3. Fix plugins naming inconsistency 4. Run git commit with all cleanup

Phase 2: Document (2-3 days) 1. Create architecture diagrams 2. Rewrite deployment guide 3. Add troubleshooting guide 4. Create ADR for key decisions

Phase 3: Validate (3-5 days) 1. Build validation script 2. Integrate into CI/CD 3. Test all ApplicationSets 4. Fix any discovered issues

Phase 4: Harden (Ongoing) 1. Review AppProject permissions 2. Add OPA policies 3. Security audit 4. Monitor and iterate


8. Conclusion

The cluster-gitops repository demonstrates a solid understanding of GitOps principles and ArgoCD best practices, with ApplicationSets, multi-source Helm, and automated deployments working well in staging.

Key Strengths: - Automated staging deployments via CI/CD - Clear environment separation (staging/production) - DRY principle with ApplicationSets - Git-based audit trail

Key Concerns: - Documentation drift (README doesn't match reality) - Orphaned directories from refactoring - Complex nested configuration structure - Missing validation tooling

Priority Focus: 1. Documentation accuracy (HIGH) 2. Remove orphaned artifacts (HIGH) 3. Simplify where possible (MEDIUM) 4. Add validation (MEDIUM)

Overall: The repository is functional but needs cleanup and documentation. With the recommended improvements, it will be much easier for new team members to understand and contribute.


Document Status: Draft - Ready for team review Review Requested: Architecture decisions, prioritization, next steps Last Updated: 2025-11-15