GPU Usage Logging, Monitoring, and Documentation¶
Status: 🔄 In Progress (40% - ⅖ tasks complete) Priority: Medium Story Points: 2 Assignee: @chrissena Created: 2025-04-08 Latest Update: 2025-05-20 Epic: GPU Support Implementation
Problem Statement¶
As an operator running ML inference workloads on GKE with GPU support, I need comprehensive logging and monitoring capabilities to verify GPU usage, track performance, and troubleshoot issues.
Current State:
- GPU support implemented but limited observability
- Some device logging exists (CPU vs GPU)
- Documentation for GPU deployment created
- Missing: inference duration metrics, GPU utilization tracking, monitoring integration
Pain Points:
- Difficult to verify if models are actually using GPUs
- No visibility into inference performance metrics
- Limited troubleshooting capabilities when issues occur
- GPU resource utilization not tracked for capacity planning
Proposed Solution¶
Implement comprehensive logging, monitoring, and documentation for GPU usage to provide full observability of ML inference operations.
User Story:
As an operator, I need detailed logging, performance monitoring, and updated documentation so that I can verify GPU usage and troubleshoot any issues.
Technical Approach¶
1. Enhanced Device Logging ✅ COMPLETED:
- Log device placement for each model (CPU vs GPU)
- Record model initialization and device assignment
- Capture device-specific errors and warnings
2. Performance Metrics Logging ⏳ IN PROGRESS:
- Record inference duration for each prediction call
- Track GPU utilization metrics during inference
- Log batch processing performance
- Add timing information to STDOUT responses
3. Monitoring Integration ⏳ PENDING:
- Integrate with Prometheus for metric collection
- Configure Stackdriver (Google Cloud Monitoring) integration
- Create dashboards for GPU performance visualization
- Set up alerts for performance degradation
4. Error Handling Improvements ⏳ PENDING:
- Add structured error codes to STDOUT
- Implement error categorization (GPU-specific vs general)
- Enhance error messages with troubleshooting context
5. Documentation Updates ✅ COMPLETED:
- Document GPU deployment configuration for GKE
- Include pod specifications and environment setup
- Provide troubleshooting guide
- Add performance tuning recommendations
Acceptance Criteria¶
Completed ✅:
- Logs display device used for inference (CPU vs GPU) and capture model placement
- Documentation updated with instructions for deploying container with GPU support on GKE
Remaining ⏳:
- Inference times and GPU utilization metrics are logged and accessible
- Monitoring tools can access and display relevant GPU performance metrics
- Timer added to report inference duration in STDOUT responses
- Error codes added to STDOUT for structured error handling
Technical Notes¶
Architecture¶
ML Inference Service:
HTTP Request → Python API
↓
Model Inference (GPU/CPU)
↓
Performance Logging
↓
STDOUT Response + Metrics
↓
Prometheus Export
Monitoring Stack:
Implementation Details¶
Logging Enhancements:
import time
import logging
from contextlib import contextmanager
@contextmanager
def time_inference(model_name: str):
"""Context manager to time inference operations"""
start_time = time.perf_counter()
device = get_model_device(model_name)
logging.info(f"Starting inference on {device} for model: {model_name}")
try:
yield
finally:
duration = time.perf_counter() - start_time
logging.info(f"Inference completed in {duration:.3f}s on {device}")
# Export metric to Prometheus
inference_duration.labels(
model=model_name,
device=device
).observe(duration)
# Usage in prediction endpoint
def predict(data):
with time_inference("my-model"):
result = model.predict(data)
return {
"result": result,
"inference_time_ms": duration * 1000,
"device": device,
"error_code": 0 # Success
}
Prometheus Metrics:
from prometheus_client import Summary, Counter, Gauge
# Inference duration histogram
inference_duration = Summary(
'ml_inference_duration_seconds',
'Time spent processing inference request',
['model', 'device']
)
# GPU utilization gauge
gpu_utilization = Gauge(
'ml_gpu_utilization_percent',
'GPU utilization percentage',
['gpu_id']
)
# Error counter
inference_errors = Counter(
'ml_inference_errors_total',
'Total inference errors',
['model', 'error_code']
)
Error Codes (to be implemented):
class InferenceErrorCode(Enum):
SUCCESS = 0
GPU_OOM = 1001 # GPU out of memory
GPU_NOT_AVAILABLE = 1002 # GPU requested but not available
MODEL_NOT_LOADED = 2001 # Model not properly loaded
INVALID_INPUT = 3001 # Invalid input data
INFERENCE_TIMEOUT = 4001 # Inference timeout
Deployment Configuration¶
GKE Pod Spec with GPU:
apiVersion: v1
kind: Pod
metadata:
name: ml-inference
spec:
containers:
- name: inference
image: syrf-ml-inference:latest
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: PROMETHEUS_PORT
value: "9090"
Dependencies¶
Python Libraries:
prometheus_client- Metric exporttorchortensorflow- ML framework with GPU supportpynvml- NVIDIA GPU monitoring
Infrastructure:
- GKE cluster with GPU node pool
- NVIDIA device plugin for Kubernetes
- Prometheus for metric collection
- Grafana or Cloud Monitoring for visualization
Related Services:
- ML inference API service
- Monitoring infrastructure (Prometheus, Grafana)
- Google Cloud Monitoring (Stackdriver)
Testing Strategy¶
Unit Tests¶
def test_inference_timing():
"""Test that inference timing is recorded"""
with patch('time.perf_counter') as mock_time:
mock_time.side_effect = [0.0, 0.5] # 500ms inference
result = predict(test_data)
assert result['inference_time_ms'] == 500
assert result['error_code'] == 0
def test_device_logging():
"""Test that device placement is logged"""
with patch('logging.info') as mock_log:
predict(test_data)
mock_log.assert_any_call(
'Starting inference on cuda:0 for model: my-model'
)
Integration Tests¶
- Deploy to GKE test cluster with GPU
- Verify Prometheus metrics are exported
- Check Cloud Monitoring integration
- Validate dashboard displays metrics correctly
- Test alert firing for performance degradation
Manual Testing¶
- Run inference requests with varying batch sizes
- Monitor GPU utilization in real-time
- Verify log output includes all required information
- Test error scenarios (OOM, GPU unavailable, etc.)
- Validate STDOUT error codes
Blockers and Risks¶
Current Blockers¶
- None identified
Risks¶
- 🔶 Prometheus Integration: May require infrastructure changes
- 🔶 Performance Overhead: Metric collection could impact inference latency (target: <5ms overhead)
- 🔶 GPU Availability: Testing requires GPU-enabled GKE nodes
Next Actions¶
Immediate (This Sprint)¶
- Implement Inference Timing:
- Add timer to prediction endpoint
- Include timing in STDOUT response
-
Test performance overhead
-
Add Error Codes:
- Define error code enum
- Implement error categorization
- Update STDOUT response format
- Document error codes
Next Sprint¶
- Prometheus Integration:
- Implement metric export endpoint
- Configure Prometheus scraping
- Create Grafana dashboards
-
Set up basic alerts
-
GPU Utilization Tracking:
- Integrate pynvml for GPU monitoring
- Export GPU metrics to Prometheus
- Add GPU utilization to logs
Success Metrics¶
- ✅ Device logging operational (CPU vs GPU)
- ✅ GPU deployment documentation complete
- ⏳ Inference timing < 5ms overhead from metric collection
- ⏳ GPU utilization metrics available in Prometheus
- ⏳ Error rate < 1% for GPU operations
- ⏳ 100% of inference operations include timing data
- ⏳ All error codes documented and implemented
Related Issues¶
- Issue #1804: Primary tracking issue (User Story 5)
- GPU Support Epic: Parent epic for GPU implementation work
- ML Inference Service: Related service requiring monitoring
Timeline¶
Created: 2025-04-08 Latest Update: 2025-05-20 Estimated Completion: TBD (pending Prometheus integration)
Source: GitHub Issue #1804 Last Synced: 2025-11-24
This feature brief was auto-generated from the GitHub issue. The work is 40% complete with 2 of 5 acceptance criteria done. Core device logging and documentation are complete; remaining work focuses on performance metrics and monitoring integration.
Next Steps:
- Implement inference timing and error codes (quick wins)
- Plan Prometheus integration approach
- Test on GPU-enabled GKE cluster
- Create monitoring dashboards