Skip to content

Conversation

@corylanou
Copy link
Collaborator

Summary

Follow-up observability improvements for #976 (R2 silent deletion failures).

  • New Prometheus metrics for detecting issues in aggregate:

    • litestream_replica_operation_duration_seconds - Histogram to detect slow/throttled operations
    • litestream_replica_operation_errors_total - Counter for errors by error code (S3 AccessDenied, etc.)
    • litestream_l0_retention_files_total - Gauge showing why L0 files aren't being deleted (eligible, not_compacted, too_recent)
  • Enhanced logging for debugging specific incidents:

    • DeleteLTXFiles: Logs duration_ms per batch, individual error details (key, code, message)
    • EnforceL0RetentionByTime: Logs scan breakdown before deletion (total files, eligible, not_compacted_yet, too_recent)

These changes would have helped debug the original #976 issue faster - users could see DELETE ops slowing down via metrics and understand why files weren't deleted via logs.

Test plan

  • Build passes
  • Unit tests pass for L0 retention and DeleteLTXFiles
  • Pre-commit checks pass
  • Manual verification with Prometheus /metrics endpoint

🤖 Generated with Claude Code

Add Prometheus metrics and enhanced logging to help debug issues like
silent deletion failures on Cloudflare R2 (#976).

New metrics:
- litestream_replica_operation_duration_seconds: Operation timing histogram
- litestream_replica_operation_errors_total: Error counter by error code
- litestream_l0_retention_files_total: L0 file status breakdown

Enhanced logging:
- DeleteLTXFiles: Add duration_ms to batch completion, log individual errors
- EnforceL0RetentionByTime: Log scan breakdown showing why files weren't deleted

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@alongwill
Copy link

FYI: Tried out this build while doing another test for #976. Nothing particular to report, just mentioning so you can see it working 👍

1a. litestream_replica_operation_duration_seconds_bucket
image

1b. litestream_replica_operation_duration_seconds_count
image

1c. litestream_replica_operation_duration_seconds_sum
image

2. litestream_replica_operation_errors_total

  • no data

3. litestream_l0_retention_files_total

sum(rate(litestream_l0_retention_files_total{cluster="...",account="andy-sqlite-test"}[$__rate_interval])) by (status)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants