feat(s3): improve observability for S3 operations and L0 retention #996
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.




Summary
Follow-up observability improvements for #976 (R2 silent deletion failures).
New Prometheus metrics for detecting issues in aggregate:
litestream_replica_operation_duration_seconds- Histogram to detect slow/throttled operationslitestream_replica_operation_errors_total- Counter for errors by error code (S3 AccessDenied, etc.)litestream_l0_retention_files_total- Gauge showing why L0 files aren't being deleted (eligible, not_compacted, too_recent)Enhanced logging for debugging specific incidents:
These changes would have helped debug the original #976 issue faster - users could see DELETE ops slowing down via metrics and understand why files weren't deleted via logs.
Test plan
/metricsendpoint🤖 Generated with Claude Code