GPU Usage Logging, Monitoring, and Documentation¶

Status: 🔄 In Progress (40% - ⅖ tasks complete) Priority: Medium Story Points: 2 Assignee: @chrissena Created: 2025-04-08 Latest Update: 2025-05-20 Epic: GPU Support Implementation

Problem Statement¶

As an operator running ML inference workloads on GKE with GPU support, I need comprehensive logging and monitoring capabilities to verify GPU usage, track performance, and troubleshoot issues.

Current State:

GPU support implemented but limited observability
Some device logging exists (CPU vs GPU)
Documentation for GPU deployment created
Missing: inference duration metrics, GPU utilization tracking, monitoring integration

Pain Points:

Difficult to verify if models are actually using GPUs
No visibility into inference performance metrics
Limited troubleshooting capabilities when issues occur
GPU resource utilization not tracked for capacity planning

Proposed Solution¶

Implement comprehensive logging, monitoring, and documentation for GPU usage to provide full observability of ML inference operations.

User Story:

As an operator, I need detailed logging, performance monitoring, and updated documentation so that I can verify GPU usage and troubleshoot any issues.

Technical Approach¶

1. Enhanced Device Logging ✅ COMPLETED:

Log device placement for each model (CPU vs GPU)
Record model initialization and device assignment
Capture device-specific errors and warnings

2. Performance Metrics Logging ⏳ IN PROGRESS:

Record inference duration for each prediction call
Track GPU utilization metrics during inference
Log batch processing performance
Add timing information to STDOUT responses

3. Monitoring Integration ⏳ PENDING:

Integrate with Prometheus for metric collection
Configure Stackdriver (Google Cloud Monitoring) integration
Create dashboards for GPU performance visualization
Set up alerts for performance degradation

4. Error Handling Improvements ⏳ PENDING:

Add structured error codes to STDOUT
Implement error categorization (GPU-specific vs general)
Enhance error messages with troubleshooting context

5. Documentation Updates ✅ COMPLETED:

Document GPU deployment configuration for GKE
Include pod specifications and environment setup
Provide troubleshooting guide
Add performance tuning recommendations

Acceptance Criteria¶

Completed ✅:

Logs display device used for inference (CPU vs GPU) and capture model placement
Documentation updated with instructions for deploying container with GPU support on GKE

Remaining ⏳:

Inference times and GPU utilization metrics are logged and accessible
Monitoring tools can access and display relevant GPU performance metrics
Timer added to report inference duration in STDOUT responses
Error codes added to STDOUT for structured error handling

Technical Notes¶

Architecture¶

ML Inference Service:

HTTP Request → Python API
                ↓
        Model Inference (GPU/CPU)
                ↓
        Performance Logging
                ↓
        STDOUT Response + Metrics
                ↓
        Prometheus Export

Monitoring Stack:

ML Service → Prometheus → Grafana/Cloud Monitoring
                ↓
        Alerting Rules

Implementation Details¶

Logging Enhancements:

import time
import logging
from contextlib import contextmanager

@contextmanager
def time_inference(model_name: str):
    """Context manager to time inference operations"""
    start_time = time.perf_counter()
    device = get_model_device(model_name)

    logging.info(f"Starting inference on {device} for model: {model_name}")

    try:
        yield
    finally:
        duration = time.perf_counter() - start_time
        logging.info(f"Inference completed in {duration:.3f}s on {device}")

        # Export metric to Prometheus
        inference_duration.labels(
            model=model_name,
            device=device
        ).observe(duration)

# Usage in prediction endpoint
def predict(data):
    with time_inference("my-model"):
        result = model.predict(data)

    return {
        "result": result,
        "inference_time_ms": duration * 1000,
        "device": device,
        "error_code": 0  # Success
    }

Prometheus Metrics:

from prometheus_client import Summary, Counter, Gauge

# Inference duration histogram
inference_duration = Summary(
    'ml_inference_duration_seconds',
    'Time spent processing inference request',
    ['model', 'device']
)

# GPU utilization gauge
gpu_utilization = Gauge(
    'ml_gpu_utilization_percent',
    'GPU utilization percentage',
    ['gpu_id']
)

# Error counter
inference_errors = Counter(
    'ml_inference_errors_total',
    'Total inference errors',
    ['model', 'error_code']
)

Error Codes (to be implemented):

class InferenceErrorCode(Enum):
    SUCCESS = 0
    GPU_OOM = 1001  # GPU out of memory
    GPU_NOT_AVAILABLE = 1002  # GPU requested but not available
    MODEL_NOT_LOADED = 2001  # Model not properly loaded
    INVALID_INPUT = 3001  # Invalid input data
    INFERENCE_TIMEOUT = 4001  # Inference timeout

Deployment Configuration¶

GKE Pod Spec with GPU:

apiVersion: v1
kind: Pod
metadata:
  name: ml-inference
spec:
  containers:
  - name: inference
    image: syrf-ml-inference:latest
    resources:
      limits:
        nvidia.com/gpu: 1  # Request 1 GPU
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: "0"
    - name: PROMETHEUS_PORT
      value: "9090"

Dependencies¶

Python Libraries:

prometheus_client - Metric export
torch or tensorflow - ML framework with GPU support
pynvml - NVIDIA GPU monitoring

Infrastructure:

GKE cluster with GPU node pool
NVIDIA device plugin for Kubernetes
Prometheus for metric collection
Grafana or Cloud Monitoring for visualization

Related Services:

ML inference API service
Monitoring infrastructure (Prometheus, Grafana)
Google Cloud Monitoring (Stackdriver)

Testing Strategy¶

Unit Tests¶

def test_inference_timing():
    """Test that inference timing is recorded"""
    with patch('time.perf_counter') as mock_time:
        mock_time.side_effect = [0.0, 0.5]  # 500ms inference

        result = predict(test_data)

        assert result['inference_time_ms'] == 500
        assert result['error_code'] == 0

def test_device_logging():
    """Test that device placement is logged"""
    with patch('logging.info') as mock_log:
        predict(test_data)

        mock_log.assert_any_call(
            'Starting inference on cuda:0 for model: my-model'
        )

Integration Tests¶

Deploy to GKE test cluster with GPU
Verify Prometheus metrics are exported
Check Cloud Monitoring integration
Validate dashboard displays metrics correctly
Test alert firing for performance degradation

Manual Testing¶

Run inference requests with varying batch sizes
Monitor GPU utilization in real-time
Verify log output includes all required information
Test error scenarios (OOM, GPU unavailable, etc.)
Validate STDOUT error codes

Blockers and Risks¶

Current Blockers¶

None identified

Risks¶

🔶 Prometheus Integration: May require infrastructure changes
🔶 Performance Overhead: Metric collection could impact inference latency (target: <5ms overhead)
🔶 GPU Availability: Testing requires GPU-enabled GKE nodes

Next Actions¶

Immediate (This Sprint)¶

Implement Inference Timing:
Add timer to prediction endpoint
Include timing in STDOUT response
Test performance overhead
Add Error Codes:
Define error code enum
Implement error categorization
Update STDOUT response format
Document error codes

Next Sprint¶

Prometheus Integration:
Implement metric export endpoint
Configure Prometheus scraping
Create Grafana dashboards
Set up basic alerts
GPU Utilization Tracking:
Integrate pynvml for GPU monitoring
Export GPU metrics to Prometheus
Add GPU utilization to logs

Success Metrics¶

✅ Device logging operational (CPU vs GPU)
✅ GPU deployment documentation complete
⏳ Inference timing < 5ms overhead from metric collection
⏳ GPU utilization metrics available in Prometheus
⏳ Error rate < 1% for GPU operations
⏳ 100% of inference operations include timing data
⏳ All error codes documented and implemented

Issue #1804: Primary tracking issue (User Story 5)
GPU Support Epic: Parent epic for GPU implementation work
ML Inference Service: Related service requiring monitoring

Timeline¶

Created: 2025-04-08 Latest Update: 2025-05-20 Estimated Completion: TBD (pending Prometheus integration)

Source: GitHub Issue #1804 Last Synced: 2025-11-24

This feature brief was auto-generated from the GitHub issue. The work is 40% complete with 2 of 5 acceptance criteria done. Core device logging and documentation are complete; remaining work focuses on performance metrics and monitoring integration.

Next Steps:

Implement inference timing and error codes (quick wins)
Plan Prometheus integration approach
Test on GPU-enabled GKE cluster
Create monitoring dashboards