Skip to content

GPU Usage Logging, Monitoring, and Documentation

Status: 🔄 In Progress (40% - ⅖ tasks complete) Priority: Medium Story Points: 2 Assignee: @chrissena Created: 2025-04-08 Latest Update: 2025-05-20 Epic: GPU Support Implementation

Problem Statement

As an operator running ML inference workloads on GKE with GPU support, I need comprehensive logging and monitoring capabilities to verify GPU usage, track performance, and troubleshoot issues.

Current State:

  • GPU support implemented but limited observability
  • Some device logging exists (CPU vs GPU)
  • Documentation for GPU deployment created
  • Missing: inference duration metrics, GPU utilization tracking, monitoring integration

Pain Points:

  • Difficult to verify if models are actually using GPUs
  • No visibility into inference performance metrics
  • Limited troubleshooting capabilities when issues occur
  • GPU resource utilization not tracked for capacity planning

Proposed Solution

Implement comprehensive logging, monitoring, and documentation for GPU usage to provide full observability of ML inference operations.

User Story:

As an operator, I need detailed logging, performance monitoring, and updated documentation so that I can verify GPU usage and troubleshoot any issues.

Technical Approach

1. Enhanced Device Logging ✅ COMPLETED:

  • Log device placement for each model (CPU vs GPU)
  • Record model initialization and device assignment
  • Capture device-specific errors and warnings

2. Performance Metrics Logging ⏳ IN PROGRESS:

  • Record inference duration for each prediction call
  • Track GPU utilization metrics during inference
  • Log batch processing performance
  • Add timing information to STDOUT responses

3. Monitoring Integration ⏳ PENDING:

  • Integrate with Prometheus for metric collection
  • Configure Stackdriver (Google Cloud Monitoring) integration
  • Create dashboards for GPU performance visualization
  • Set up alerts for performance degradation

4. Error Handling Improvements ⏳ PENDING:

  • Add structured error codes to STDOUT
  • Implement error categorization (GPU-specific vs general)
  • Enhance error messages with troubleshooting context

5. Documentation Updates ✅ COMPLETED:

  • Document GPU deployment configuration for GKE
  • Include pod specifications and environment setup
  • Provide troubleshooting guide
  • Add performance tuning recommendations

Acceptance Criteria

Completed ✅:

  • Logs display device used for inference (CPU vs GPU) and capture model placement
  • Documentation updated with instructions for deploying container with GPU support on GKE

Remaining ⏳:

  • Inference times and GPU utilization metrics are logged and accessible
  • Monitoring tools can access and display relevant GPU performance metrics
  • Timer added to report inference duration in STDOUT responses
  • Error codes added to STDOUT for structured error handling

Technical Notes

Architecture

ML Inference Service:

HTTP Request → Python API
        Model Inference (GPU/CPU)
        Performance Logging
        STDOUT Response + Metrics
        Prometheus Export

Monitoring Stack:

ML Service → Prometheus → Grafana/Cloud Monitoring
        Alerting Rules

Implementation Details

Logging Enhancements:

import time
import logging
from contextlib import contextmanager

@contextmanager
def time_inference(model_name: str):
    """Context manager to time inference operations"""
    start_time = time.perf_counter()
    device = get_model_device(model_name)

    logging.info(f"Starting inference on {device} for model: {model_name}")

    try:
        yield
    finally:
        duration = time.perf_counter() - start_time
        logging.info(f"Inference completed in {duration:.3f}s on {device}")

        # Export metric to Prometheus
        inference_duration.labels(
            model=model_name,
            device=device
        ).observe(duration)

# Usage in prediction endpoint
def predict(data):
    with time_inference("my-model"):
        result = model.predict(data)

    return {
        "result": result,
        "inference_time_ms": duration * 1000,
        "device": device,
        "error_code": 0  # Success
    }

Prometheus Metrics:

from prometheus_client import Summary, Counter, Gauge

# Inference duration histogram
inference_duration = Summary(
    'ml_inference_duration_seconds',
    'Time spent processing inference request',
    ['model', 'device']
)

# GPU utilization gauge
gpu_utilization = Gauge(
    'ml_gpu_utilization_percent',
    'GPU utilization percentage',
    ['gpu_id']
)

# Error counter
inference_errors = Counter(
    'ml_inference_errors_total',
    'Total inference errors',
    ['model', 'error_code']
)

Error Codes (to be implemented):

class InferenceErrorCode(Enum):
    SUCCESS = 0
    GPU_OOM = 1001  # GPU out of memory
    GPU_NOT_AVAILABLE = 1002  # GPU requested but not available
    MODEL_NOT_LOADED = 2001  # Model not properly loaded
    INVALID_INPUT = 3001  # Invalid input data
    INFERENCE_TIMEOUT = 4001  # Inference timeout

Deployment Configuration

GKE Pod Spec with GPU:

apiVersion: v1
kind: Pod
metadata:
  name: ml-inference
spec:
  containers:
  - name: inference
    image: syrf-ml-inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0"
    - name: PROMETHEUS_PORT
      value: "9090"

Dependencies

Python Libraries:

  • prometheus_client - Metric export
  • torch or tensorflow - ML framework with GPU support
  • pynvml - NVIDIA GPU monitoring

Infrastructure:

  • GKE cluster with GPU node pool
  • NVIDIA device plugin for Kubernetes
  • Prometheus for metric collection
  • Grafana or Cloud Monitoring for visualization

Related Services:

  • ML inference API service
  • Monitoring infrastructure (Prometheus, Grafana)
  • Google Cloud Monitoring (Stackdriver)

Testing Strategy

Unit Tests

def test_inference_timing():
    """Test that inference timing is recorded"""
    with patch('time.perf_counter') as mock_time:
        mock_time.side_effect = [0.0, 0.5]  # 500ms inference

        result = predict(test_data)

        assert result['inference_time_ms'] == 500
        assert result['error_code'] == 0

def test_device_logging():
    """Test that device placement is logged"""
    with patch('logging.info') as mock_log:
        predict(test_data)

        mock_log.assert_any_call(
            'Starting inference on cuda:0 for model: my-model'
        )

Integration Tests

  • Deploy to GKE test cluster with GPU
  • Verify Prometheus metrics are exported
  • Check Cloud Monitoring integration
  • Validate dashboard displays metrics correctly
  • Test alert firing for performance degradation

Manual Testing

  • Run inference requests with varying batch sizes
  • Monitor GPU utilization in real-time
  • Verify log output includes all required information
  • Test error scenarios (OOM, GPU unavailable, etc.)
  • Validate STDOUT error codes

Blockers and Risks

Current Blockers

  • None identified

Risks

  • 🔶 Prometheus Integration: May require infrastructure changes
  • 🔶 Performance Overhead: Metric collection could impact inference latency (target: <5ms overhead)
  • 🔶 GPU Availability: Testing requires GPU-enabled GKE nodes

Next Actions

Immediate (This Sprint)

  1. Implement Inference Timing:
  2. Add timer to prediction endpoint
  3. Include timing in STDOUT response
  4. Test performance overhead

  5. Add Error Codes:

  6. Define error code enum
  7. Implement error categorization
  8. Update STDOUT response format
  9. Document error codes

Next Sprint

  1. Prometheus Integration:
  2. Implement metric export endpoint
  3. Configure Prometheus scraping
  4. Create Grafana dashboards
  5. Set up basic alerts

  6. GPU Utilization Tracking:

  7. Integrate pynvml for GPU monitoring
  8. Export GPU metrics to Prometheus
  9. Add GPU utilization to logs

Success Metrics

  • ✅ Device logging operational (CPU vs GPU)
  • ✅ GPU deployment documentation complete
  • ⏳ Inference timing < 5ms overhead from metric collection
  • ⏳ GPU utilization metrics available in Prometheus
  • ⏳ Error rate < 1% for GPU operations
  • ⏳ 100% of inference operations include timing data
  • ⏳ All error codes documented and implemented
  • Issue #1804: Primary tracking issue (User Story 5)
  • GPU Support Epic: Parent epic for GPU implementation work
  • ML Inference Service: Related service requiring monitoring

Timeline

Created: 2025-04-08 Latest Update: 2025-05-20 Estimated Completion: TBD (pending Prometheus integration)


Source: GitHub Issue #1804 Last Synced: 2025-11-24

This feature brief was auto-generated from the GitHub issue. The work is 40% complete with 2 of 5 acceptance criteria done. Core device logging and documentation are complete; remaining work focuses on performance metrics and monitoring integration.

Next Steps:

  1. Implement inference timing and error codes (quick wins)
  2. Plan Prometheus integration approach
  3. Test on GPU-enabled GKE cluster
  4. Create monitoring dashboards