Monitor Replication Lag in OCI Using SQL and CLI

Replication lag is one of the most important metrics for Cloud Ops DBAs managing Oracle Data Guard or other replication mechanisms in Oracle Cloud Infrastructure (OCI). Lag occurs when redo logs are not shipped or applied to the standby database in time, leading to potential data loss during failover. Monitoring lag helps ensure your standby database is always in sync with the primary.

This blog explains how to monitor replication lag using both SQL queries inside the database and OCI CLI commands at the cloud infrastructure level.

Understanding Replication Lag

There are two main types of lag to monitor in Data Guard:

  1. Transport lag: Delay in shipping redo logs from the primary to the standby database.
  2. Apply lag: Delay in applying redo logs on the standby database after they are received.

Both lags must be monitored to ensure high availability and disaster recovery readiness.

Checking Replication Lag Using SQL on Primary

On the primary database, use the v$dataguard_stats view to check lag.

SET LINESIZE 200
COL name FORMAT A20
COL value FORMAT A15
COL unit FORMAT A15

SELECT name, value, unit
FROM   v$dataguard_stats
WHERE  name IN ('transport lag','apply lag');

Sample output

NAME            VALUE          UNIT
--------------- -------------- ----------------
transport lag   +00 00:00:05   day(2) to second(0) interval
apply lag       +00 00:00:09   day(2) to second(0) interval

Interpretation: Transport lag is 5 seconds and apply lag is 9 seconds, which is normal for asynchronous replication.

Checking Replication Lag on Standby

On the standby database, check Managed Recovery Process (MRP) progress and apply status.

SET LINESIZE 200
COL process FORMAT A10
COL status FORMAT A12
COL thread# FORMAT 999
COL sequence# FORMAT 999999

SELECT process, status, thread#, sequence#
FROM   v$managed_standby
WHERE  process LIKE 'MRP%';

Sample output

PROCESS   STATUS       THREAD# SEQUENCE#
--------- ------------ ------- ---------
MRP0      APPLYING           1     14523

Interpretation: The MRP process is applying redo logs, confirming replication is active.

Monitoring Replication Lag with OCI CLI

OCI Monitoring provides metrics to track Data Guard lag. You can fetch transport and apply lag using the CLI:

oci monitoring metric-data summarize-metrics-data \
--compartment-id ocid1.compartment.oc1..aaaaexample \
--namespace oci_database \
--query-text "DataGuardTransportLag[1m].mean(), DataGuardApplyLag[1m].mean()" \
--start-time 2025-09-05T14:00:00Z \
--end-time 2025-09-05T14:10:00Z

Sample output

{
  "data": [
    {
      "aggregated-datapoints": [
        {
          "timestamp": "2025-09-05T14:05:00+00:00",
          "DataGuardTransportLag[1m].mean()": 3,
          "DataGuardApplyLag[1m].mean()": 7
        }
      ]
    }
  ]
}

Interpretation: Transport lag is 3 seconds and apply lag is 7 seconds, which is consistent with the SQL-based check.

Cross-Verification with Archive Logs

You can also verify by comparing the last archive log applied on standby with the last archive log generated on primary.

On primary:

SELECT MAX(sequence#) AS last_primary_seq
FROM   v$archived_log
WHERE  applied = 'NO';

On standby:

SELECT MAX(sequence#) AS last_standby_seq
FROM   v$archived_log
WHERE  applied = 'YES';

Sample output on primary

LAST_PRIMARY_SEQ
----------------
14525

Sample output on standby

LAST_STANDBY_SEQ
----------------
14523

Interpretation: Standby has applied up to sequence 14523, while primary is at 14525, so standby is 2 logs behind.

Best Practices for Monitoring Replication Lag

  1. Monitor both transport and apply lag continuously.
  2. Use OCI CLI metrics for integration with monitoring dashboards.
  3. Cross-check with SQL queries for real-time confirmation.
  4. Set alert thresholds (for example, lag > 60 seconds) to notify DBAs.
  5. Regularly review network bandwidth and standby performance to minimize lag.

Replication lag monitoring in OCI is a key responsibility for Cloud Ops DBAs. By combining SQL queries (v$dataguard_stats, v$managed_standby, archive log comparisons) with OCI CLI metrics, DBAs can detect issues early and ensure their standby database is always in sync with the primary. Proactive monitoring guarantees readiness for switchover or failover events and safeguards against data loss.

Bash script that combines OCI CLI metric checks and in-database SQL queries to monitor Data Guard replication lag (transport lag and apply lag). The script logs results, creates a concise report, and emails it to a DBA distribution list. It is designed to be scheduled via cron and is production-ready after you replace the placeholders and apply secure credential management.

Save the script to a file (for example /opt/dbadmin/replication_lag_monitor.sh), edit configuration variables, secure it, test, then schedule via cron.

#!/bin/bash
#
# replication_lag_monitor.sh
# Purpose: Check Data Guard transport/apply lag using OCI CLI and SQL*Plus,
#          create a report, and email the DBA team.
# Requirements:
# - OCI CLI installed and configured for a profile with Monitoring read permissions
# - sqlplus available and reachable to both primary and standby DBs (use wallets/TNS)
# - mailx or sendmail configured for outgoing mail
# - coreutils timeout installed (optional but recommended)
#
# Example cron (run daily at 07:00 UTC):
# 0 7 * * * /opt/dbadmin/replication_lag_monitor.sh >/dev/null 2>&1

set -o errexit
set -o nounset
set -o pipefail

# -------------------------
# Configuration - EDIT THESE
# -------------------------
COMPARTMENT_OCID="ocid1.compartment.oc1..aaaaaaaaexample"
# primary database OCID used for OCI metric queries
PRIMARY_DB_OCID="ocid1.database.oc1.ap-mumbai-1.abcdexample"

# SQL*Plus connection strings - use wallet/TNS or OS-auth where possible
# Recommended: use wallet and TNS_ALIAS to avoid plaintext passwords
SQLPLUS_PRIMARY_CONN="dbadmin/YourSecurePassword@//primary-host:1521/PRODDB"
SQLPLUS_STANDBY_CONN="dbadmin/YourSecurePassword@//standby-host:1521/STANDBYDB"

# Email settings
MAIL_TO="dba-team@example.com"
MAIL_FROM="oci-monitor@example.com"
MAIL_SUBJECT="Data Guard Replication Lag Report - $(date -u '+%Y-%m-%d %H:%M UTC')"

# Working directory and files
WORKDIR="/var/tmp/replication_lag_monitor"
mkdir -p "${WORKDIR}"
REPORT_FILE="${WORKDIR}/replication_lag_report_$(date -u +%Y%m%d).txt"
OCI_ERR_LOG="${WORKDIR}/oci_errors.log"
SQL_ERR_LOG="${WORKDIR}/sql_errors.log"
> "${REPORT_FILE}"
> "${OCI_ERR_LOG}" || true
> "${SQL_ERR_LOG}" || true

# Time window for OCI metric queries (ISO8601 UTC)
END_TIME="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
START_TIME="$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ)"

# Timeouts
OCI_TIMEOUT=30   # seconds for each CLI call
SQLPLUS_TIMEOUT=60

# Helper: log with timestamp
timestamp() { date -u '+%Y-%m-%d %H:%M:%S UTC'; }
log() { echo "$(timestamp) - $*" | tee -a "${REPORT_FILE}"; }

# Wrapper for OCI CLI with timeout and error capture
run_oci() {
  local cmd="$*"
  echo "$(timestamp) - OCI CLI: ${cmd}" | tee -a "${REPORT_FILE}"
  if command -v timeout >/dev/null 2>&1; then
    timeout ${OCI_TIMEOUT} bash -c "${cmd}" >>"${REPORT_FILE}" 2>>"${OCI_ERR_LOG}" || echo "OCI CLI command failed or timed out; see ${OCI_ERR_LOG}" | tee -a "${REPORT_FILE}"
  else
    bash -c "${cmd}" >>"${REPORT_FILE}" 2>>"${OCI_ERR_LOG}" || echo "OCI CLI command failed; see ${OCI_ERR_LOG}" | tee -a "${REPORT_FILE}"
  fi
  echo >> "${REPORT_FILE}"
}

# Wrapper to run SQL via sqlplus safely; writes output to report and captures SQL errors
run_sqlplus() {
  local conn="$1"
  shift
  echo "$(timestamp) - SQL*Plus (${conn%%/*}@...):" | tee -a "${REPORT_FILE}"
  if command -v timeout >/dev/null 2>&1; then
    timeout ${SQLPLUS_TIMEOUT} sqlplus -s "${conn}" <<SQL >>"${REPORT_FILE}" 2>>"${SQL_ERR_LOG}" || echo "SQL*Plus command failed or timed out; see ${SQL_ERR_LOG}" | tee -a "${REPORT_FILE}"
WHENEVER SQLERROR CONTINUE
SET LINESIZE 200
SET PAGESIZE 100
SET TRIMOUT ON
SET TRIMSPOOL ON
$*
EXIT
SQL
  else
    sqlplus -s "${conn}" <<SQL >>"${REPORT_FILE}" 2>>"${SQL_ERR_LOG}" || echo "SQL*Plus command failed; see ${SQL_ERR_LOG}" | tee -a "${REPORT_FILE}"
WHENEVER SQLERROR CONTINUE
SET LINESIZE 200
SET PAGESIZE 100
SET TRIMOUT ON
SET TRIMSPOOL ON
$*
EXIT
SQL
  fi
  echo >> "${REPORT_FILE}"
}

# Start report
echo "Data Guard Replication Lag Report" > "${REPORT_FILE}"
echo "Generated: $(timestamp)" >> "${REPORT_FILE}"
echo >> "${REPORT_FILE}"

# Step 1: OCI - fetch Data Guard transport/apply lag metrics
log "Step 1: Fetch Data Guard Transport and Apply Lag metrics from OCI Monitoring (last 10 minutes)"
run_oci "oci monitoring metric-data summarize-metrics-data \
  --compartment-id ${COMPARTMENT_OCID} \
  --namespace oci_database \
  --query-text \"DataGuardTransportLag[1m].mean(),DataGuardApplyLag[1m].mean()\" \
  --start-time ${START_TIME} \
  --end-time ${END_TIME} \
  --output json"

# Step 2: SQL - check v$dataguard_stats on primary
log "Step 2: Query v\$dataguard_stats on PRIMARY to get transport/apply lag"
run_sqlplus "${SQLPLUS_PRIMARY_CONN}" <<'SQLBLOCK'
COL name FORMAT A20
COL value FORMAT A15
COL unit FORMAT A15
SELECT name, value, unit
FROM   v$dataguard_stats
WHERE  name IN ('transport lag', 'apply lag');
SQLBLOCK

# Step 3: SQL - check v$managed_standby on standby for MRP status
log "Step 3: Check Managed Recovery Process on STANDBY (v$managed_standby)"
run_sqlplus "${SQLPLUS_STANDBY_CONN}" <<'SQLBLOCK'
COL process FORMAT A10
COL status FORMAT A12
SELECT process, status, thread#, sequence#
FROM   v$managed_standby
WHERE  process LIKE 'MRP%';
SQLBLOCK

# Step 4: SQL - compare last archived sequence on primary vs standby
log "Step 4: Compare last archived sequence (primary vs standby)"
run_sqlplus "${SQLPLUS_PRIMARY_CONN}" <<'SQLBLOCK'
SET LINESIZE 200
SELECT 'PRIMARY' role, MAX(sequence#) AS last_primary_seq
FROM   v$archived_log
WHERE  completion_time IS NOT NULL;
SQLBLOCK

run_sqlplus "${SQLPLUS_STANDBY_CONN}" <<'SQLBLOCK'
SET LINESIZE 200
SELECT 'STANDBY' role, MAX(sequence#) AS last_standby_seq
FROM   v$archived_log
WHERE  applied = 'YES';
SQLBLOCK

# Step 5: SQL - show last few archive_dest_status rows for errors
log "Step 5: Review v$archive_dest_status for errors/exceptions (dest_id > 1)"
run_sqlplus "${SQLPLUS_PRIMARY_CONN}" <<'SQLBLOCK'
COL dest_id FORMAT 999
COL status FORMAT A15
COL synchronization_status FORMAT A20
COL error FORMAT A80
SELECT dest_id, status, protection_mode, synchronization_status, error
FROM   v$archive_dest_status
WHERE  dest_id > 1
ORDER BY dest_id;
SQLBLOCK

# Step 6: Quick health checks - DB roles
log "Step 6: Database role and open mode on primary and standby"
run_sqlplus "${SQLPLUS_PRIMARY_CONN}" <<'SQLBLOCK'
SELECT 'PRIMARY' role_indicator, database_role, open_mode, protection_mode FROM v$database;
SQLBLOCK

run_sqlplus "${SQLPLUS_STANDBY_CONN}" <<'SQLBLOCK'
SELECT 'STANDBY' role_indicator, database_role, open_mode, protection_mode FROM v$database;
SQLBLOCK

# Step 7: Quick thresholds - basic pass/fail heuristic (optional)
# Parse small values from v$dataguard_stats via SQL to compute numeric lag seconds
# We attempt to extract apply lag in seconds using a SQL expression; output used for quick alerting
log "Step 7: Quick numeric lag extraction (apply lag seconds) from PRIMARY (if available)"
run_sqlplus "${SQLPLUS_PRIMARY_CONN}" <<'SQLBLOCK'
SET FEEDBACK OFF
SET HEADING OFF
-- This SELECT attempts to convert INTERVAL DAY TO SECOND to seconds; may vary by DB version
SELECT CASE
         WHEN name = 'apply lag' THEN
           TO_NUMBER(REGEXP_REPLACE(value, '.*?([0-9]+):([0-9]{2}):([0-9]{2}).*', '\1\2\3')) -- fallback, not ideal for all formats
         ELSE NULL
       END
FROM v$dataguard_stats
WHERE name = 'apply lag';
SET FEEDBACK ON
SQLBLOCK

# Step 8: Consolidate logs and errors, prepare email body
EMAIL_BODY="${WORKDIR}/replication_lag_email_$(date -u +%Y%m%d).txt"
{
  echo "Data Guard Replication Lag Report"
  echo "Generated: $(timestamp)"
  echo
  cat "${REPORT_FILE}"
  echo
  echo "OCI CLI errors (if any):"
  if [ -s "${OCI_ERR_LOG}" ]; then
    tail -n 200 "${OCI_ERR_LOG}"
  else
    echo "None"
  fi
  echo
  echo "SQL*Plus errors (if any):"
  if [ -s "${SQL_ERR_LOG}" ]; then
    tail -n 200 "${SQL_ERR_LOG}"
  else
    echo "None"
  fi
} > "${EMAIL_BODY}"

# Step 9: Send email
log "Step 9: Send report to ${MAIL_TO}"
if command -v mailx >/dev/null 2>&1; then
  mailx -s "${MAIL_SUBJECT}" -r "${MAIL_FROM}" "${MAIL_TO}" < "${EMAIL_BODY}" && log "Email sent (mailx)"
elif command -v sendmail >/dev/null 2>&1; then
  {
    echo "From: ${MAIL_FROM}"
    echo "To: ${MAIL_TO}"
    echo "Subject: ${MAIL_SUBJECT}"
    echo
    cat "${EMAIL_BODY}"
  } | sendmail -t && log "Email sent (sendmail)"
else
  log "No mailx/sendmail found. Report saved at ${EMAIL_BODY}"
fi

log "Replication lag monitoring complete. Report: ${REPORT_FILE}"
exit 0

Explanation and Deployment Notes

Purpose
This script provides a repeatable, auditable daily check that combines cloud-level metrics (OCI Monitoring) with database-level facts (v$ views and archived log comparisons). It enables DBAs to detect transport lag and apply lag quickly and to correlate cloud metrics with in-database evidence.

Credentials and security
Do not store plaintext production passwords in scripts. Prefer one of these secure approaches:
• Oracle Wallet + TNS aliases so sqlplus uses wallet authentication and not user/password in the script.
• Fetch credentials at runtime from a secrets manager (OCI Vault, HashiCorp Vault). Use a short-lived token where possible.
• Restrict file permissions: chmod 700 /opt/dbadmin/replication_lag_monitor.sh and chmod 600 for any secret files.

OCI CLI permissions
The OCI user or dynamic group running the script needs read access to Monitoring and Database resources. Provide least-privilege policies, for example:
allow group DBAMonitors to read metrics in compartment <compartment>
allow group DBAMonitors to read databases in compartment <compartment>

Testing

  1. Run interactively and inspect the report: /opt/dbadmin/replication_lag_monitor.sh
  2. Check ${WORKDIR} for *.log files for troubleshooting.
  3. Validate email delivery by running the script manually and verifying mail logs.

Scheduling via cron
Edit the crontab for the monitoring user:
crontab -e
Add a line to run daily (adjust timezone as needed):
0 7 * * * /opt/dbadmin/replication_lag_monitor.sh >/dev/null 2>&1

Customizations and enhancements
• Threshold alerting: add parsing logic and send high-priority alerts when lag exceeds a threshold (e.g., > 60 seconds).
• Integration: push JSON outputs to a central observability platform, or send Slack/PagerDuty notifications.
• Parallel execution: run SQL checks concurrently to reduce run time in large estates.
• Multi-datacenter: extend to check cross-region standby replication in multi-region Data Guard setups.

Sample outputs you will see in the report (illustrative)
OCI Monitoring JSON excerpt:
{
“data”: [
{
“aggregated-datapoints”: [
{
“timestamp”: “2025-09-05T14:05:00+00:00”,
“DataGuardTransportLag[1m].mean()”: 3,
“DataGuardApplyLag[1m].mean()”: 7
}
]
}
]
}

v$dataguard_stats sample:
NAME VALUE UNIT
transport lag +00 00:00:03 day(2) to second(0) interval
apply lag +00 00:00:07 day(2) to second(0) interval

v$managed_standby sample:
PROCESS STATUS THREAD# SEQUENCE#
MRP0 APPLYING 1 14523

Troubleshooting tips
• OCI CLI timeouts: increase OCI_TIMEOUT or verify network connectivity from the monitoring host to OCI endpoints.
• SQL*Plus connection errors: verify TNS, wallet, listener, and that the monitoring host has network access to DB hosts.
• Email issues: confirm mailx or sendmail configuration and relay permissions.

Operational playbook snippet
If apply lag > threshold (e.g., 300 seconds):

  1. Check v$managed_standby on standby for MRP errors.
  2. Check v$archive_dest_status on primary for transport errors.
  3. Verify network performance (packet loss, latency) between primary and standby.
  4. Check storage I/O on standby; high I/O wait may slow apply.
  5. Escalate to networking/storage teams if infrastructure issues are suspected.
  6. If required, plan temporary switch to synchronous mode or failover per DR process (with coordination).

Closing note
This script is a practical foundation to operationalize replication monitoring in Oracle Cloud. Secure credentials, tune timeouts and thresholds for your environment, and integrate it into your incident management flow to reduce mean time to detect and resolve replication issues.

Oracle DBA

Experienced OCM-certified Oracle Database Administrator with over 18 years of expertise in designing, implementing, and managing complex database solutions. My expertise spans performance optimization, security, and high-stakes solution implementation. Adept at managing complex environments with precision.

No Comments

    Leave a Message

    Your email address will not be published. All fields are mandatory. **