Essential Metrics for CockroachDB Standard Deployments

On this page

These essential CockroachDB metrics let you monitor your CockroachDB Standard cluster. The metrics are available on graphs on the tabs listed in the Metrics tabs column. Where Custom is listed, the metric may be graphed in a Custom Metrics Chart. The Usage column explains why each metric is important to visualize and how to make both practical and actionable use of the metric in a production deployment.

SQL

CockroachDB Metric Name	Short Name	Description	Usage	Metrics tab
`sql.conn.latency`	P90, P99	Latency to establish and authenticate a SQL connection	Connection latency is calculated as the time in nanoseconds between when the cluster receives a connection request and establishes the connection to the client, including authentication. This graph shows the p90 and p99 latencies for SQL connections to the cluster. These metrics characterize the database connection latency which can affect the application performance, for example, by having slow startup times.	SQL
`sql.conns`	Connections	Number of open SQL connections	This metric shows the total number of SQL client connections across the cluster. Refer to the Sessions page for more details on the sessions. This metric also shows the distribution, or balancing, of connections across the cluster. Review Connection Pooling.	SQL
`sql.ddl.count`	DDL Statements	Number of SQL DDL statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort by the Execution Count column. To find problematic sessions, on the Sessions page, sort by the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.	SQL
`sql.delete.count`	Delete	Number of SQL DELETE statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort by the Execution Count column. To find problematic sessions, on the Sessions page, sort by the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.	Overview, SQL
`sql.distsql.contended_queries.count`	Contention	Number of SQL queries that experienced contention	This metric is incremented whenever there is a non-trivial amount of contention experienced by a statement whether read-write or write-write conflicts. Monitor this metric to correlate possible workload performance issues to contention conflicts.	SQL
`sql.failure.count`	Errors	Number of statements resulting in a planning or runtime error	This metric is a high-level indicator of workload and application degradation with query failures. Use the Insights page to find failed executions with their error code to troubleshoot or use application-level logs, if instrumented, to determine the cause of error.	SQL
`sql.full.scan.count`	Full scans	Number of full table or index scans	This metric is a high-level indicator of potentially suboptimal query plans in the workload that may require index tuning and maintenance. To identify the statements with a full table scan, use `SHOW FULL TABLE SCAN` or the SQL Activity Statements page with the corresponding metric time frame. The Statements page also includes explain plans and index recommendations. You can also disallow full table scans with the `disallow_full_table_scans` setting. Not all full scans are necessarily bad especially over smaller tables.	SQL
`sql.insert.count`	Insert	Number of SQL INSERT statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort by the Execution Count column. To find problematic sessions, on the Sessions page, sort by the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.	Overview, SQL
`sql.new_conns`	Connections Per Second	Number of SQL connections created	The rate of this metric shows how frequently new connections are being established. This can be useful in determining if a high rate of incoming new connections is causing additional load on the server due to a misconfigured application.	Overview, SQL
`sql.select.count`	Select	Number of SQL SELECT statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort by the Execution Count column. To find problematic sessions, on the Sessions page, sort by the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.	Overview, SQL
`sql.service.latency`	P90, P99, P99.9, P99.99	Latency of SQL request execution	These high-level metrics reflect workload performance. Monitor these metrics to understand latency over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. The Statements page has P90 Latency and P99 latency columns to enable correlation with this metric.	Overview, SQL
`sql.statements.active`	Active Statements	Number of currently active user SQL statements	This high-level metric reflects workload volume.	SQL
`sql.txn.abort.count`	Aborts	Number of SQL transaction abort errors	This high-level metric reflects workload performance. A persistently high number of SQL transaction abort errors may negatively impact the workload performance and needs to be investigated.	SQL
`sql.txn.begin.count`	Begin	Number of SQL transaction BEGIN statements successfully executed	This metric reflects workload volume by counting explicit transactions. Use this metric to determine whether explicit transactions can be refactored as implicit transactions (individual statements).	SQL
`sql.txn.commit.count`	Commits	Number of SQL transaction COMMIT statements successfully executed	This metric shows the number of transactions that completed successfully. This metric can be used as a proxy to measure the number of successful explicit transactions.	SQL
`sql.txn.latency`	P90, P99	Latency of SQL transactions	Over the last minute, this cluster executed 90% or 99% of transactions within this time. This time does not include network latency between the cluster and client. These metrics provide an overview of the current SQL workload.	SQL
`sql.txn.rollback.count`	Rollbacks	Number of SQL transaction ROLLBACK statements successfully executed	This metric shows the number of orderly transaction rollbacks. A persistently high number of rollbacks may negatively impact the workload performance and needs to be investigated.	SQL
`sql.txns.open`	Open Transactions	Number of currently open user SQL transactions	This metric should roughly correspond to the number of cores * 4. If this metric is consistently larger, scale out the cluster.	SQL
`sql.update.count`	Update	Number of SQL UPDATE statements successfully executed	This high-level metric reflects workload volume. Monitor this metric to identify abnormal application behavior or patterns over time. If abnormal patterns emerge, apply the metric's time range to the SQL Activity pages to investigate interesting outliers or patterns. For example, on the Transactions page and the Statements page, sort by the Execution Count column. To find problematic sessions, on the Sessions page, sort by the Transaction Count column. Find the sessions with high transaction counts and trace back to a user or application.	Overview, SQL
`txn.restarts.asyncwritefailure`	Async Consensus Failure	Number of restarts due to async consensus writes that failed to leave intents	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL
`txn.restarts.readwithinuncertainty`	Read Within Uncertainty Interval	Number of restarts due to reading a new value within the uncertainty interval	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL
`txn.restarts.serializable`	Forwarded Timestamp	Number of restarts due to a forwarded commit timestamp and isolation=SERIALIZABLE	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL
`txn.restarts.txnaborted`	Aborted	Number of restarts due to an abort by a concurrent transaction (usually due to deadlock)	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL
`txn.restarts.txnpush`	Push Failure	Number of restarts due to a transaction push failure	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL
`txn.restarts.unknown`	Unknown	Number of restarts due to a unknown reasons	The errors tracked by this metric are generally due to deadlocks. Deadlocks can often be prevented with a considered transaction design. Identify the conflicting transactions involved in the deadlocks. Then, if possible, redesign the business logic implementation prone to deadlocks.	SQL
`txn.restarts.writetooold`	Write Too Old	Number of restarts due to a concurrent writer committing first	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL
`txn.restarts.writetoooldmulti`	Write Too Old (multiple)	Number of restarts due to multiple concurrent writers committing first	This metric is one measure of the impact of contention conflicts on workload performance. For guidance on contention conflicts, review transaction contention best practices and performance tuning recipes. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should be investigated. For the specific error, refer to the transaction retry error reference for more details.	SQL

Request Units

CockroachDB Metric Name	Short Name	Description	Usage	Metrics tab
`tenant.consumption.cross_region_network_ru`	Network traffic	Total number of RUs charged for cross-region network traffic	The number of RUs consumed by cross-region networking. Correlate these metrics with Request Units (RUs). For more information about multi-region clusters and networking, refer to Multi-region clusters.	Request Units
`tenant.consumption.external_io_egress_bytes`	Bulk I/O operations	Total number of bytes written to external services such as cloud storage providers	The number of RUs consumed by byte traffic for cluster bulk I/O operations (e.g., CDC). Egress bytes are converted to Request Units using this equivalency: 1 RU = 1 KiB Network egress. Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.pgwire_egress_bytes`	Client traffic	Total number of bytes transferred from a SQL pod to the client	The number of RUs consumed by byte traffic to the client. Egress bytes are converted to Request Units using this equivalency: 1 RU = 1 KiB Network egress. Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.read_batches`	Batches	Total number of KV read batches	The number of RUs consumed by KV storage layer reads, broken down by batches. SQL statements are translated into lower-level KV read requests that are sent in batches. Batches may contain any number of requests. Requests can have a payload containing any number of bytes. Storage layer I/O is converted to Request Units using this equivalency: 1 RU = 2 storage read batches Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.read_bytes`	Bytes	Total number of bytes read from KV	The number of RUs consumed by KV storage layer reads, broken down by bytes. SQL statements are translated into lower-level KV read requests that are sent in batches. Batches may contain any number of requests. Requests can have a payload containing any number of bytes. Storage layer I/O is converted to Request Units using this equivalency: 1 RU = 64 KiB read request payload (prorated) Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.read_requests`	Requests	Total number of KV read requests	The number of RUs consumed by KV storage layer reads, broken down by requests. SQL statements are translated into lower-level KV read requests that are sent in batches. Batches may contain any number of requests. Requests can have a payload containing any number of bytes. Storage layer I/O is converted to Request Units using this equivalency: 1 RU = 8 storage read requests Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.request_units`	RU, Average RUs	Total RU consumption	The CPU and I/O resources being used by queries in the cluster. Simple queries consume few RUs, while complicated queries with many reads and writes consume more RUs. To learn more about how RUs are calculated, refer to Resource Usage.	Overview, Request Units
`tenant.consumption.sql_pods_cpu_seconds`	Total amount of CPU used by SQL pods	Total amount of CPU used by SQL pods	The number of RUs consumed by CPU usage of SQL processes (not storage processes). The CPU seconds is converted to Request Units using this equivalency: 1 RU = 3 milliseconds SQL CPU. Correlate this metric with Request Units (RUs) and determine if your workload is CPU-intensive. To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.write_batches`	Batches	Total number of KV write batches	The number of RUs consumed by KV writes, broken down by batches. SQL statements are translated into lower-level KV write requests that are sent in batches. Batches may contain any number of requests. Requests can have a payload containing any number of bytes. Write operations are replicated to multiple storage processes (3 by default), with each replica counted as a separate write operation. Storage layer I/O is converted to Request Units using this equivalency: 1 RU = 1 storage write batch Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.write_bytes`	Bytes	Total number of bytes written to KV	The number of RUs consumed by KV writes, broken down by bytes. SQL statements are translated into lower-level KV write requests that are sent in batches. Batches may contain any number of requests. Requests can have a payload containing any number of bytes. Write operations are replicated to multiple storage processes (3 by default), with each replica counted as a separate write operation. Storage layer I/O is converted to Request Units using this equivalency: 1 RU = 1 KiB write request payload (prorated) Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units
`tenant.consumption.write_requests`	Requests	Total number of KV write requests	The number of RUs consumed by KV writes, broken down by requests. SQL statements are translated into lower-level KV write requests that are sent in batches. Batches may contain any number of requests. Requests can have a payload containing any number of bytes. Write operations are replicated to multiple storage processes (3 by default), with each replica counted as a separate write operation. Storage layer I/O is converted to Request Units using this equivalency: 1 RU = 1 storage write request Correlate this metric with Request Units (RUs). To learn more about how RUs are calculated, refer to Resource Usage.	Request Units

Storage

CockroachDB Metric Name	Short Name	Description	Usage	Metrics tab
`livebytes`	Storage usage	Number of bytes of live data (keys plus values)	The amount of data being stored in the cluster. In the Storage Used graph, this is the logical number of live bytes and does not account for compression or replication.	Overview

Table Statistics

CockroachDB Metric Name	Short Name	Description	Usage	Metrics tab
`jobs.auto_create_stats.currently_paused`	Auto Create Statistics Paused	Number of auto_create_stats jobs currently considered Paused	This metric is a high-level indicator that automatically generated statistics jobs are paused which can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.	SQL, Custom
`jobs.auto_create_stats.currently_running`	Auto Create Statistics Running	Number of auto_create_stats jobs currently running in Resume or OnFailOrCancel state	This metric tracks the number of active automatically generated statistics jobs that could also be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.	SQL, Custom
`jobs.auto_create_stats.resume_failed`	Auto Create Statistics Failed	Number of auto_create_stats jobs which failed with a non-retriable error	This metric is a high-level indicator that automatically generated table statistics is failing. Failed statistic creation can lead to the query optimizer running with stale statistics. Stale statistics can cause suboptimal query plans to be selected leading to poor query performance.	SQL, Custom
`jobs.create_stats.currently_running`	Create Statistics Running	Number of create_stats jobs currently running in Resume or OnFailOrCancel state	This metric tracks the number of active create statistics jobs that may be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating this metric with SQL latency and query volume metrics.	SQL, Custom

Changefeeds

CockroachDB Metric Name	Short Name	Description	Usage	Metrics tab
`changefeed.commit_latency`	P99, P90	Event commit latency: a difference between event MVCC timestamp and the time it was acknowledged by the downstream sink. If the sink batches events, then the difference between the oldest event in the batch and acknowledgement is recorded; Excludes latency during backfill	This metric provides useful context when assessing the state of changefeeds. This metric characterizes the end-to-end lag between a committed change and that change applied at the destination.	Changefeeds, Custom
`changefeed.emitted_bytes`	Emitted bytes	Bytes emitted by all feeds	This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the throughput bytes being streamed from the CockroachDB cluster.	Changefeeds, Custom
`changefeed.emitted_messages`	Emitted messages	Messages emitted by all feeds	This metric provides a useful context when assessing the state of changefeeds. This metric characterizes the rate of changes being streamed from the CockroachDB cluster.	Changefeeds, Custom
`changefeed.error_retries`	Errors	Total retryable errors encountered by all changefeeds	This metric tracks transient changefeed errors. Alert on "too many" errors, such as 50 retries in 15 minutes. For example, during a rolling upgrade this counter will increase because the changefeed jobs will restart following node restarts. There is an exponential backoff, up to 10 minutes. But if there is no rolling upgrade in process or other cluster maintenance, and the error rate is high, investigate the changefeed job.	Changefeeds, Custom
`changefeed.failures`	Failures	Total number of changefeed jobs which have failed	This metric tracks the permanent changefeed job failures that the jobs system will not try to restart. Any increase in this counter should be investigated. An alert on this metric is recommended.	Changefeeds, Custom
`changefeed.running`	Running	Number of currently running changefeeds, including sinkless	This metric tracks the total number of all running changefeeds.	Changefeeds, Custom
`jobs.changefeed.currently_paused`	Paused	Number of changefeed jobs currently considered Paused	Monitor and alert on this metric to safeguard against an inadvertent operational error of leaving a changefeed job in a paused state for an extended period of time. Changefeed jobs should not be paused for a long time because the protected timestamp prevents garbage collection.	Changefeeds, Custom
`jobs.changefeed.protected_age_sec`	Protected Timestamp Age	The age of the oldest PTS record protected by changefeed jobs	changefeeds use protected timestamps to protect the data from being garbage collected. Ensure the protected timestamp age does not significantly exceed the GC TTL zone configuration. Alert on this metric if the protected timestamp age is greater than 3 times the GC TTL.	Changefeeds, Custom

Row-Level TTL

CockroachDB Metric Name	Short Name	Description	Usage	Metrics tab
`jobs.row_level_ttl.currently_paused`	Paused	Number of row_level_ttl jobs currently considered Paused	Monitor this metric to ensure the Row-Level TTL job does not remain paused inadvertently for an extended period.	Row-Level TTL, Custom
`jobs.row_level_ttl.currently_running`	Running	Number of row_level_ttl jobs currently running in Resume or OnFailOrCancel state	Monitor this metric to ensure there are not too many Row-Level TTL jobs running at the same time. Generally, this metric should be in the low single digits.	Row-Level TTL, Custom
`jobs.row_level_ttl.delete_duration`	Delete Latency (P90), Delete Latency (P99)	Duration for delete requests during row level TTL.	See Description.	Row-Level TTL
`jobs.row_level_ttl.resume_completed`	Resume Completed	Number of row_level_ttl jobs which successfully resumed to completion	If Row-Level TTL is enabled, this metric should be nonzero and correspond to the `ttl_cron` setting that was chosen. If this metric is zero, it means the job is not running.	Row-Level TTL, Custom
`jobs.row_level_ttl.resume_failed`	Resume Failed	Number of row_level_ttl jobs which failed with a non-retriable error	This metric should remain at zero. Repeated errors means the Row-Level TTL job is not deleting data.	Row-Level TTL, Custom
`jobs.row_level_ttl.rows_deleted`	Rows deleted	Number of rows deleted by the row level TTL job.	Correlate this metric with the metric `jobs.row_level_ttl.rows_selected` to ensure all the rows that should be deleted are actually getting deleted.	Row-Level TTL, Custom
`jobs.row_level_ttl.rows_selected`	Rows selected	Number of rows selected for deletion by the row level TTL job.	Correlate this metric with the metric `jobs.row_level_ttl.rows_deleted` to ensure all the rows that should be deleted are actually getting deleted.	Row-Level TTL, Custom
`jobs.row_level_ttl.select_duration`	Select Latency (P90), Select Latency (P99)	Duration for select requests during row level TTL.	See Description.	Row-Level TTL
`jobs.row_level_ttl.total_expired_rows`	Expired Rows	Approximate number of rows that have expired the TTL on the TTL table.	The approximate number of rows that have expired the TTL on all TTL tables.	Row-Level TTL
`jobs.row_level_ttl.total_rows`	Total Rows	Approximate number of rows on the TTL table.	The approximate number of rows on all TTL tables.	Row-Level TTL
`schedules.scheduled-row-level-ttl-executor.failed`	Failed Schedules	Number of scheduled-row-level-ttl-executor jobs failed	Monitor this metric to ensure the Row-Level TTL job is running. If it is non-zero, it means the job could not be created.	Row-Level TTL, Custom

Cockroach
University

Docs Hub

Essential Metrics for CockroachDB Standard Deployments

SQL

Request Units

Storage

Table Statistics

Changefeeds

Row-Level TTL

See also

Cockroach University

Docs Hub

Cockroach University

Docs Hub

Essential Metrics for CockroachDB Standard Deployments

SQL

Request Units

Storage

Table Statistics

Changefeeds

Row-Level TTL

See also

Cockroach
University

Cockroach
University