SAP HANA Replication Setup (HANA System Replication - HSR)
SAP HANA System Replication (HSR) is a robust, log-based replication mechanism that provides fault tolerance for SAP HANA systems by maintaining an identical copy of the primary (source) HANA system at a secondary (target) HANA system. It's designed for both HA within a data center and DR across data centers.
I. Overview and Core Concepts of HSR
- Purpose:
- High Availability (HA): For fast automatic failover within the same data center (typically via synchronous replication).
- Disaster Recovery (DR): For recovery after a site failure in a geographically separate data center (typically via asynchronous replication).
- Log-Based Replication: HSR works by shipping redo log entries from the primary system to the secondary system(s). This ensures data changes are continuously replicated.
- Full System Replication: HSR replicates the entire HANA system, including all tenants in a multi-tenant database container (MDC) scenario, configuration files, and persistency layers (data and log volumes).
- No Shared Storage: HSR works on a shared-nothing architecture, meaning the primary and secondary systems have their own independent storage. This makes it ideal for DR.
- Multi-tier Replication: HSR supports multi-tier replication, where a primary replicates to a secondary, and that secondary can then replicate to a tertiary system (e.g., Primary -> Secondary (HA) -> Tertiary (DR)).
II. HSR Replication Modes
The choice of replication mode dictates the RPO (Recovery Point Objective) and performance impact.
-
Synchronous (SYNC):
mode=sync
(Synchronous Disk): The primary system waits for the redo log to be written to the secondary's disk before committing the transaction on the primary.- RPO: Zero (0 data loss).
- Performance: Higher latency for transactions on the primary, as it waits for remote disk write.
- Use Case: Primarily for High Availability (HA) within a single data center or metro cluster where network latency is extremely low (e.g., <0.5ms round trip time - RTT).
mode=syncmem
(Synchronous Memory): The primary system waits for the redo log to be written to the secondary's memory before committing the transaction on the primary. The secondary then writes to its own disk asynchronously.- RPO: Near-zero. In theory, if the secondary fails before writing to disk, data could be lost. However, for typical failovers, this risk is minimal.
- Performance: Lower latency than
sync
as it only waits for memory write, not disk I/O at the secondary. - Use Case: Primarily for High Availability (HA) within a data center. Recommended over
sync
if network supports it.
-
Asynchronous (ASYNC):
mode=async
: The primary system commits the transaction without waiting for the redo log to be written to the secondary. Redo logs are sent to the secondary in the background.- RPO: Non-zero (potential for data loss). Data loss is limited to the replication lag (seconds to minutes), depending on network bandwidth, latency, and primary system activity.
- Performance: Minimal performance impact on the primary system.
- Use Case: Primarily for Disaster Recovery (DR) over long distances where latency is higher (e.g., >1ms RTT).
III. HSR Operation Modes
The operation mode defines how the data is handled at the secondary system.
-
logreplay
(Log Replay):- Principle: The secondary system continuously receives and replays (applies) the redo log entries, keeping its data and log volumes as up-to-date as possible.
- Advantages: Faster takeover time as the secondary is almost ready.
- Disadvantages: Secondary system consumes resources (CPU, memory) for replay.
- Use Case: Default and recommended for both HA and DR.
-
delta_datashipping
(Delta Data Shipping - legacy, rarely used):- Principle: The secondary system only receives redo logs and delta data backups, but does not apply them continuously. It effectively keeps a "cold" standby.
- Advantages: Secondary consumes minimal resources.
- Disadvantages: Slower takeover time as the secondary needs to apply all logs since the last delta.
- Use Case: For specific scenarios where very low resource consumption on the secondary is critical and a longer RTO is acceptable (e.g., for archival DR copies). Less common now.
IV. HSR Landscape Components
- Primary System: The active HANA instance serving production traffic.
- Secondary System: The standby HANA instance, receiving replication from the primary. It is in a "replication" state and cannot be directly accessed.
- Tertiary System: (Optional) In multi-tier replication, a third system receiving replication from the secondary.
- Replication State: The status of the replication (e.g.,
ACTIVE
,INITIALIZING
,ERROR
). - Failover / Takeover: The process of switching from the primary to the secondary system in case of primary failure.
V. High-Level Setup Steps
- Prerequisites:
- Install a clean secondary HANA system with the exact same SID, Instance Number, and System PKI SSFS password as the primary.
- Ensure identical hardware, OS version, and HANA version (including patch level).
- Configure network connectivity (dedicated replication network preferred).
- Ensure required ports are open (e.g., 3xx13, 3xx15, 4xx02 for
hdbclient
). - HANA SYSTEM user credentials for replication setup.
- Initial Data Transfer:
- Option 1 (Snapshot/Backup Transfer - Recommended for large DBs): Take a full data backup of the primary and restore it on the secondary. Then register the secondary for replication.
- Option 2 (Full Data Shipping): Initiate replication, and HANA will automatically perform a full data copy over the network. Suitable for smaller databases or if network permits.
- Configure Replication:
- Use
hdbnsutil --sr_enable
on primary. - Use
hdbnsutil --sr_register
on secondary, specifying the primary host and replication mode (sync
,syncmem
,async
). - Configure
listeninterface
parameter inglobal.ini
if using a dedicated replication network.
- Use
- Verify Replication Status: Use
hdbnsutil --sr_state
or HANA Studio/Cockpit.
VI. Monitoring and Administration
hdbnsutil --sr_state
: Command-line tool to check replication status.M_SERVICE_REPLICATION
view: In HANA Studio/Cockpit, provides detailed replication metrics (e.g., log shipping backlog, lag time).- SAP Solution Manager: Centralized monitoring for HSR.
- Alerting: Configure alerts for replication status changes, increasing log backlog, and network issues.
VII. Failover and Takeover Procedures
- Manual Takeover: Initiated manually by an administrator using
hdbnsutil --sr_takeover
on the secondary. Requires the primary to be completely down or disconnected. - Automatic Failover (HA Scenarios): Requires a cluster manager (e.g., Pacemaker on Linux, WSFC on Windows) to monitor the primary HANA system.
- The cluster detects a primary failure.
- It triggers the
hdbnsutil --sr_takeover
command on the secondary. - It manages the virtual IP address to switch to the new primary.
- It manages fencing (STONITH) to prevent split-brain.
- Failback: After primary system recovery, it is usually registered as the new secondary, and then a planned takeover is performed to switch back to the original primary (now recovered).
VIII. Multi-tier Replication
- Primary -> Secondary -> Tertiary:
- Primary replicates to a local Secondary (e.g.,
syncmem
for HA). - This Secondary then replicates to a remote Tertiary (e.g.,
async
for DR). - This setup combines HA and DR, where the Secondary can serve as a fast failover target, and the Tertiary provides recovery from a site-wide disaster.
- Primary replicates to a local Secondary (e.g.,
Important Configuration to Keep in Mind for HSR
-
Network Configuration (Crucial):
- Dedicated Replication Network: Always recommended for optimal performance and isolation. This network should be independent of the public network.
- Bandwidth: Sufficient bandwidth for the chosen replication mode.
- SYNC/SYNCMEM: Very high bandwidth (Gbps) for local HA.
- ASYNC: Good bandwidth to minimize lag (Mbps to Gbps depending on churn).
- Latency (RTT):
- SYNC/SYNCMEM: Ultra-low latency (<0.5ms RTT). Higher latency severely impacts primary performance.
- ASYNC: Tolerates higher latency (up to tens of milliseconds RTT), but lower is always better for faster log shipping.
listeninterface
Parameter (global.ini
): If using a dedicated replication network, set this parameter to the IP address or hostname of the dedicated interface on both primary and secondary systems (e.g.,listeninterface = .internal
). This ensures HANA services bind to and listen on the correct network for replication.- Firewall: Ensure all necessary ports are open between primary and secondary (e.g., 3xx13, 3xx15, 4xx02, 5xx13, 5xx14) and for client access.
-
File System and Storage:
- Separate Storage: Primary and secondary must have independent data and log volumes. No shared storage is used for HANA data itself.
- Log Volume Sizing: Ensure log volumes on both primary and secondary are adequately sized to accommodate peak transaction volumes and potential replication backlogs during network issues.
data
andlog
Volume Layout: Maintain identical volume layouts and mount points on primary and secondary for consistency.
-
HANA Parameters (
global.ini
):[system_replication]
Section:mode
:sync
,syncmem
, orasync
.operation_mode
:logreplay
(default and recommended) ordelta_datashipping
.enable_full_sync
: Controls full sync behavior on re-registration.reconnect_timeout
: How long the secondary waits to reconnect.
[persistence]
Section: Parameters related to log management and persistence (e.g.,log_buffer_size
).[logshipping]
Section (forasync
):logshipping_async_buffer_size
: Buffer for async shipping on primary.logshipping_max_delay_time_seconds
: Can be used to limit acceptable lag (can impact primary performance if exceeded).
-
Identical System Configuration:
- Hardware: Identical CPU architecture, RAM, and disk configurations for performance consistency.
- OS: Identical OS version, patch level, and kernel parameters.
- HANA Version: Exact same HANA version, SPS, and Revision number. Mismatches can cause replication failures.
- System PKI SSFS Password: Must be identical for secure communication.
- HANA SIDs and Instance Numbers: Must be identical.
-
Security and Users:
sapadm
User: Ensuresapadm
user andsapsys
group exist with correct permissions on both hosts.- HANA SYSTEM User: The
SYSTEM
user password must be the same on both systems initially. - Network Security: Implement network segmentation and firewalls.
-
Monitoring and Alerting:
- Proactive Monitoring: Configure alerts for HSR status changes (e.g.,
ACTIVE
toINITIALIZING
orERROR
), increasing log replication backlog, network latency spikes, and system resource utilization (CPU, memory, disk I/O). - Tools: SAP HANA Cockpit,
hdbnsutil
,M_SERVICE_REPLICATION
view, SAP Solution Manager, OS-level monitoring tools.
- Proactive Monitoring: Configure alerts for HSR status changes (e.g.,
-
Automation and Orchestration (for HA):
- Cluster Software Integration: For automatic failover, integrate HSR with a cluster manager (e.g., Pacemaker on Linux, WSFC on Windows). The cluster resource agent for HANA (e.g.,
SAP HANA Topology
,SAP HANA Database
) monitors HSR status and triggers takeover. - Fencing (STONITH): Essential in HA setups to prevent split-brain. The cluster must ensure a failed node is truly isolated before takeover.
- Cluster Software Integration: For automatic failover, integrate HSR with a cluster manager (e.g., Pacemaker on Linux, WSFC on Windows). The cluster resource agent for HANA (e.g.,
-
Regular DR Drills (for DR):
- Crucial: Regularly perform DR drills to validate the HSR setup, the takeover process, RPO/RTO metrics, and post-recovery procedures.
- Documentation: Maintain a detailed runbook for DR.
30 Interview Questions and Answers (One-Liner) for SAP HANA Replication Setup
- Q: What is the primary purpose of HANA System Replication (HSR)?
- A: To provide high availability (HA) and disaster recovery (DR) for SAP HANA systems.
- Q: Does HSR require shared storage between primary and secondary?
- A: No, it works on a shared-nothing architecture.
- Q: Which HSR replication mode offers zero data loss (RPO=0)?
- A:
sync
(Synchronous Disk).
- A:
- Q: Which HSR replication mode is typically used for long-distance DR?
- A:
async
(Asynchronous).
- A:
- Q: What is the primary disadvantage of
sync
mode?- A: Higher transaction latency on the primary due to waiting for remote disk write.
- Q: In
syncmem
mode, where is the redo log written on the secondary before commit on primary?- A: To the secondary's memory.
- Q: What is the default and recommended HSR operation mode?
- A:
logreplay
.
- A:
- Q: What does
logreplay
operation mode mean for the secondary system?- A: The secondary continuously receives and applies redo logs.
- Q: What command is used to check the HSR status from the OS level?
- A:
hdbnsutil --sr_state
.
- A:
- Q: What view in HANA Studio/Cockpit provides HSR replication metrics?
- A:
M_SERVICE_REPLICATION
.
- A:
- Q: Can HSR replicate only a single tenant database in an MDC setup?
- A: No, HSR replicates the entire system, including all tenants.
- Q: What is a key prerequisite for the secondary HANA system installation for HSR?
- A: It must have the exact same SID, Instance Number, and System PKI SSFS password as the primary.
- Q: What network parameter in
global.ini
is crucial for dedicated HSR networks?- A:
listeninterface
.
- A:
- Q: What is the
sr_register
command used for?- A: To register the secondary system with the primary for replication.
- Q: What is the purpose of
hdbnsutil --sr_takeover
?- A: To manually initiate a failover (takeover) to the secondary system.
- Q: What external component is needed for automatic HSR failover?
- A: A cluster manager (e.g., Pacemaker, WSFC).
- Q: What is a "tertiary" system in HSR?
- A: A third system receiving replication from the secondary in a multi-tier setup.
- Q: What is the RPO for
async
mode?- A: Non-zero (potential for data loss, typically seconds to minutes).
- Q: What is the performance impact of
async
mode on the primary?- A: Minimal performance impact.
- Q: Which component manages the virtual IP in an automatic HSR failover?
- A: The cluster manager.
- Q: What should be identical between primary and secondary for HSR, besides SID/Instance?
- A: Hardware, OS version, and HANA version (patch level).
- Q: What is "fencing" in the context of HSR and clustering?
- A: A mechanism to isolate a failed primary node to prevent split-brain.
- Q: What is the
reconnect_timeout
parameter in HSR?- A: How long the secondary system waits to reconnect to the primary.
- Q: What happens if
listeninterface
is not configured for a dedicated network?- A: HANA services might bind to the public network, causing replication issues or performance impact.
- Q: What does
delta_datashipping
operation mode imply for the secondary?- A: The secondary receives data but does not continuously apply logs; it's a "cold" standby.
- Q: Which option for initial HSR data transfer is recommended for large databases?
- A: Backup and restore.
- Q: Can an active HANA system connect to a secondary HSR system?
- A: No, the secondary is in a replication state and not directly accessible for clients.
- Q: What is the purpose of
enable_full_sync
parameter?- A: Controls whether a full sync is performed upon re-registering a secondary.
- Q: What kind of network latency is generally acceptable for
syncmem
mode?- A: Very low, typically less than 0.5 ms RTT.
- Q: What is a common pitfall if the
sapsys
group is not consistent during HSR setup?- A: Permissions issues preventing HANA from starting or accessing files.
5 Scenario-Based Hard Questions and Answers for SAP HANA Replication Setup
-
Scenario: You have configured HANA System Replication (HSR) in
syncmem
mode for your production S/4HANA system (primary) to a secondary system within the same data center, managed by Pacemaker. Recently, you've observed intermittent, brief performance degradations on the primary S/4HANA system, characterized by highCOMMIT
times inHANA_SQL_CLIENT_CONNECT
views and increased transaction latency. During these periods, monitoring shows thatM_SERVICE_REPLICATION
forsyncmem
has a non-zeroREPLICATION_LOG_BUFFER_LAG_SIZE
. The network team confirms no significant packet loss on the dedicated HSR interconnect, but occasionally detects micro-bursts of high latency (e.g., 2-5ms RTT for a few seconds).- Q: Explain how brief network latency spikes can cause performance degradation on the primary system even in
syncmem
mode. What specific HANA and cluster configurations would you investigate and adjust to mitigate this impact without switching toasync
mode? - A:
- Explanation of Performance Degradation:
- In
syncmem
mode, the primary HANA system waits for the redo log buffer to be successfully written to the secondary's memory before it commits the transaction on the primary. - Even though it's memory-to-memory, the communication over the network is still crucial. When there are brief micro-bursts of high network latency (e.g., 2-5ms RTT), the round trip time for the acknowledgment from the secondary's memory to the primary is delayed.
- During this delay, new transactions on the primary that require logging (i.e., most DML operations) will queue up in the primary's log buffer, waiting for the remote acknowledgment. This backlog, reflected in
REPLICATION_LOG_BUFFER_LAG_SIZE
, directly translates to increasedCOMMIT
times and higher transaction latency for end-users on the primary, leading to perceived performance degradation. The primary is effectively "throttled" by the slowest link, which, in this case, is the occasional network latency spike.
- In
- Specific HANA and Cluster Configurations to Investigate and Adjust:
- HANA Network Configuration (
global.ini
):- Investigate
listeninterface
: Ensurelisteninterface
is correctly set to the dedicated, low-latency interconnect IP on both primary and secondary. If it's binding to a general-purpose network, contention can be higher. - Review
logshipping_sync_buffer_size
(if applicable/exists in version): Whilesyncmem
is memory-based, internal buffers on the primary for sending logs might be tunable. Ensure it's adequately sized. - Review
tcp_keepalive_time
/tcp_keepalive_interval
/tcp_keepalive_probes
(OS Level): These kernel parameters influence how quickly TCP connections are re-established or identified as broken. While not directly related to latency, they can affect overall network stability for HSR.
- Investigate
- Dedicated HSR Interconnect (Network Layer):
- Verify QoS: Work with the network team to confirm strict Quality of Service (QoS) is applied to the dedicated HSR interconnect, prioritizing this traffic above all else. This helps minimize the impact of micro-bursts.
- Jumbo Frames: Ensure Jumbo Frames (MTU 9000) are configured end-to-end on the dedicated HSR network. This reduces packet overhead and can improve throughput, mitigating the impact of latency.
- Redundant Interconnects: Implement multiple physical interconnects with bonding/teaming on the OS level and check if HANA is configured to utilize all paths effectively.
- HANA Log Volume Performance:
- Disk I/O Latency: Ensure that the underlying disk I/O performance (especially for the log volume) on both primary and secondary is optimal. While
syncmem
waits for memory, a slow secondary disk can still indirectly impact the overall replication flow as the secondary eventually needs to harden to disk.
- Disk I/O Latency: Ensure that the underlying disk I/O performance (especially for the log volume) on both primary and secondary is optimal. While
- Pacemaker Configuration (Optional, but good practice):
- While Pacemaker primarily handles failover, ensure its own heartbeat and communication over the interconnect are not inadvertently contributing to network congestion or instability. This is less likely the direct cause but good to verify.
- Application Workload Analysis:
- Identify Peak Loads: Use HANA Cockpit/Studio to identify specific times or transactions that correlate with the latency spikes. Are these I/O-intensive operations?
- Optimization: Optimize long-running transactions or batch jobs to reduce the pressure on the log volume and replication.
- HANA Network Configuration (
- Explanation of Performance Degradation:
- Q: Explain how brief network latency spikes can cause performance degradation on the primary system even in
-
Scenario: Your SAP HANA system replication (HSR) in asynchronous mode (
mode=async
,operation_mode=logreplay
) to your DR site (1000 km away) is consistently showing a growingREPLICATION_LOG_SHIPPING_LAG_SIZE
andREPLICATION_LOG_SHIPPING_LAG_TIME
inM_SERVICE_REPLICATION
, sometimes reaching several hours. This puts your RPO at unacceptable levels. The network team insists there's enough bandwidth allocated and no major outages.- Q: Beyond just network bandwidth, what are the likely non-network technical reasons for this chronic replication lag in
async
HSR over a long distance? What specific actions would you take with the DBAs and network team to reduce the lag and consistently meet a shorter RPO? - A:
- Likely Non-Network Technical Reasons for Chronic Replication Lag:
- Insufficient CPU/Memory on Secondary for Log Replay: In
logreplay
mode, the secondary continuously applies redo logs. If the secondary system is undersized (CPU/memory) relative to the primary's transaction volume, it simply cannot process and apply the logs fast enough, leading to a build-up of the backlog. - Slow I/O on Secondary Log/Data Volumes: Even if CPU/memory are sufficient, if the secondary's disk I/O (especially for writing log buffers and updating data files) is slower than the rate at which logs arrive, it becomes a bottleneck, causing lag. This is common if DR hardware is lower spec than primary.
- High Primary System Transaction Volume (Churn Rate): The primary is generating a very high volume of redo logs (e.g., due to frequent data changes, large loads, or inefficient DML operations). While the network might have the advertised bandwidth, the sustained average rate of log generation could be exceeding the effective network throughput or the secondary's processing capability.
- HANA Log Shipping Buffers (Async Specific): While
async
mode primarily buffers on the primary before sending, iflogshipping_async_buffer_size
is too small or if there are internal processing delays, it can contribute to lag. - Network Latency (still a factor): Even for
async
, high latency (50ms+ RTT for 1000km) means each packet has a significant travel time. While not waiting for acknowledgment, the sheer volume of packets needed can be slowed by high latency if not compensated by huge bandwidth.
- Insufficient CPU/Memory on Secondary for Log Replay: In
- Specific Actions to Take:
- DBA Team (HANA Level):
- Analyze Primary Churn: Use
M_VOLUME_IO_STATISTICS
andM_SERVICE_REPLICATION
on the primary to understand the average and peak redo log generation rate (GB/hour). - Secondary Resource Utilization: Monitor CPU, memory, and disk I/O on the secondary HANA system. Look for persistent high utilization, especially in
hdb_logreplayer
processes. If resources are constantly maxed out, the secondary is the bottleneck. - HANA
global.ini
Tuning (Async Specific):- Review
logshipping_async_buffer_size
inglobal.ini
on the primary. Increasing this might allow more data to be buffered before sending, potentially improving throughput (but uses more memory). - Review
log_buffer_size
andlog_segment_size_mb
to ensure efficient log management on the primary.
- Review
- Analyze Primary Churn: Use
- Network Team:
- Actual Throughput vs. Advertised: Don't just rely on advertised bandwidth. Measure actual achievable throughput and latency over the WAN link under peak load using tools like
iperf
. - QoS (Quality of Service): Re-verify and optimize QoS settings on all network devices along the path to prioritize HSR traffic.
- Packet Loss/Errors: Check for any intermittent packet loss or network errors on the WAN link, even if not a full outage. Small errors can drastically impact TCP throughput.
- Path Optimization: Ensure the most direct, lowest-latency network path is being used.
- Actual Throughput vs. Advertised: Don't just rely on advertised bandwidth. Measure actual achievable throughput and latency over the WAN link under peak load using tools like
- Basis/Infrastructure Team:
- Scale Secondary Resources: If the secondary's resources are the bottleneck, plan to scale up its CPU, memory, and/or storage I/O performance to match or exceed the primary's churn rate. This might involve upgrading hardware or optimizing underlying virtualization layers.
- Storage Performance Review: Perform storage I/O tests on the secondary's log and data volumes to confirm they meet the required performance (e.g., high write IOPS for logs).
- OS Tuning: Ensure OS-level network and I/O tuning (e.g., kernel parameters) are optimized for high throughput on both primary and secondary.
- Procedural / Strategy Adjustments:
- DR Drill Validation: During drills, specifically measure the
REPLICATION_LOG_SHIPPING_LAG_TIME
at the point of simulated disaster. This provides a clear RPO measurement. - Alerting: Set up proactive alerts for
REPLICATION_LOG_SHIPPING_LAG_TIME
exceeding acceptable thresholds, enabling early intervention. - Consider Multi-Tier (if feasible): If
async
direct to DR is consistently failing, evaluate a multi-tier setup (Primary -> Secondary (HA,syncmem
) -> Tertiary (DR,async
)). This can offload theasync
replication from the primary.
- DR Drill Validation: During drills, specifically measure the
- DBA Team (HANA Level):
- Likely Non-Network Technical Reasons for Chronic Replication Lag:
- Q: Beyond just network bandwidth, what are the likely non-network technical reasons for this chronic replication lag in
-
Scenario: You need to implement a High Availability (HA) solution for a new critical SAP S/4HANA system on HANA 2.0. The business demands near-zero RPO and an RTO of less than 15 minutes. The system will run on two identical physical servers within the same data center.
- Q: Design the comprehensive HA solution, detailing the specific HSR mode and operation mode, the role of the cluster software, and how the RPO and RTO targets are met. Include important considerations for the network and shared storage within this specific HA design.
- A:
- Comprehensive HA Solution Design:
- HANA System Replication (HSR):
- Mode:
syncmem
(Synchronous Memory). - Operation Mode:
logreplay
. - Reasoning for
syncmem
: Provides near-zero RPO (as data is committed on primary only after written to secondary's memory) with better performance thansync
(disk) mode, which is suitable for within a data center.logreplay
ensures the secondary is continuously applying logs, minimizing takeover time (RTO). - Setup: Primary HANA instance on Server A, Secondary HANA instance on Server B. HSR configured between them.
- Mode:
- Cluster Software (e.g., Pacemaker on SLES/RHEL, WSFC on Windows):
- Role: The cluster manager is essential for automatic failover. It acts as the orchestration layer for HA.
- Resource Agents: It will use SAP HANA-specific resource agents (e.g.,
SAP HANA Topology
andSAP HANA Database
for Pacemaker) to monitor:- The health of the primary HANA instance.
- The HSR replication status (
ACTIVE
andlogreplay
running smoothly). - The virtual IP address for the HANA database.
- Failover Logic: If the primary HANA instance or the primary server fails, or if HSR replication status degrades to an unacceptable level, the cluster will:
- Trigger
hdbnsutil --sr_takeover
on the secondary HANA system. - Move the virtual IP address to the secondary server (which is now the new primary).
- Ensure proper fencing (STONITH) of the failed primary node to prevent split-brain.
- Trigger
- SAP ASCS/ERS (or JCS/ERS) HA:
- Setup: Deploy ASCS instance on one server (e.g., Server A) and the Enqueue Replication Server (ERS) on the other (Server B).
- Clustering: ASCS/ERS resources (virtual IP, ASCS service, ERS service, shared
sapmnt
mount) are also managed by the same cluster software. The cluster ensures ASCS and ERS are always on different nodes. - Role: Provides HA for the central services, which are critical SPOFs for the SAP system.
- Application Servers (Dialog Instances):
- Deploy at least one application server instance on each physical host (Server A and Server B) to provide inherent redundancy and load balancing. These typically do not need to be clustered themselves.
- SAP Web Dispatcher (for Fiori/Web GUI) deployed in a redundant setup (e.g., two WDs with a floating IP or behind a hardware load balancer).
- HANA System Replication (HSR):
- Meeting RPO and RTO Targets:
- RPO (Near-Zero): Achieved by
syncmem
HSR. Transactions on the primary are committed only after their redo logs are in the secondary's memory. This ensures virtually no data loss. - RTO (Less than 15 minutes):
- HANA Takeover:
logreplay
mode ensures the secondary is ready, so takeover is very fast (minutes). The cluster automates this. - ASCS/ERS Failover: Also very fast (minutes) with cluster automation.
- Application Server Reconnection: Application servers automatically reconnect to the new primary HANA and ASCS instance (via virtual IPs). Their startup is generally independent of the failover.
- The total automated failover time for the critical components (DB, ASCS) should easily fall within 15 minutes.
- HANA Takeover:
- RPO (Near-Zero): Achieved by
- Important Considerations for Network and Shared Storage:
- Network for HSR:
- Dedicated Interconnect: A dedicated, high-bandwidth (e.g., 10 Gbps or higher), ultra-low latency (<0.5ms RTT) network interconnect is mandatory between Server A and Server B for
syncmem
HSR. This isolates HSR traffic from other network I/O. listeninterface
: Set thelisteninterface
parameter in HANA'sglobal.ini
on both servers to the IP of this dedicated interconnect.- Jumbo Frames: Configure Jumbo Frames (MTU 9000) on the dedicated HSR interconnect for optimal throughput.
- Dedicated Interconnect: A dedicated, high-bandwidth (e.g., 10 Gbps or higher), ultra-low latency (<0.5ms RTT) network interconnect is mandatory between Server A and Server B for
- Network for Cluster Heartbeat/Quorum:
- A separate, highly reliable network for cluster heartbeats is also crucial. This can be the same as the public network if redundant NICs are used, or a separate private network.
- Shared Storage for ASCS/ERS:
- The
/sapmnt/<SID>
filesystem (containing profiles, global files, kernel executables) must be on shared storage (e.g., SAN via Fibre Channel, or a clustered filesystem like GFS2). This shared storage must be accessible by both Server A and Server B. - Storage HA: The shared storage solution itself must be highly available (redundant controllers, multi-pathing).
- The
- Virtual IPs: All SAP components (HANA, ASCS, Web Dispatcher) should use virtual IPs managed by the cluster, allowing seamless failover.
- Fencing (STONITH): Absolutely critical in the cluster configuration to ensure a failed node is truly shut down and cannot cause split-brain scenarios when the other node takes over resources.
- Network for HSR:
- Comprehensive HA Solution Design:
-
Scenario: Your company is considering a multi-tier HANA System Replication setup: Primary (on-prem) -> Secondary (on-prem, for HA) -> Tertiary (cloud, for DR). The Primary-Secondary link needs to be
syncmem
, and the Secondary-Tertiary link will beasync
. You've been tasked with outlining the complexities and potential single points of failure in this multi-tier setup that are not present in a simple 2-tier HSR.- Q: Detail the specific complexities and potential single points of failure (beyond standard network/hardware) introduced by this 3-tier HSR configuration, particularly focusing on how a failure at the secondary system might impact the entire replication chain and what mitigation strategies exist.
- A:
-
Specific Complexities Introduced by 3-Tier HSR:
- Increased Network Demands: Requires three distinct replication networks (or logical separations with QoS) – P-S (very low latency/high bandwidth) and S-T (high bandwidth for async). Managing and troubleshooting these multiple links adds complexity.
- Resource Consumption at Secondary: The secondary system now has two roles: a standby for the primary and a source for the tertiary. It consumes resources not only for receiving and applying logs from the primary but also for sending logs to the tertiary. This dual role increases its CPU, memory, and network utilization requirements.
- Chain of Dependency: The tertiary system's RPO is now directly dependent on the secondary system. If the secondary falls behind in replaying logs from the primary, or if its network to the tertiary is congested, the tertiary will also lag.
- Complex Failover/Failback Scenarios: The failover process becomes more nuanced. For example:
- Primary fails: Secondary takes over as new primary. Tertiary must then re-register with the new primary.
- Secondary fails: Tertiary's replication breaks. Primary is still active. Rebuilding the secondary and re-establishing its replication to the primary, and then the tertiary's replication to the secondary, is a multi-step process.
- Licensing in Cloud (Tertiary): Cloud licensing models might differ, requiring careful planning for the tertiary system.
- Monitoring Complexity: Monitoring needs to cover two distinct replication legs (P-S and S-T), their individual lags, and overall chain health.
-
Potential Single Points of Failure (SPOFs) Introduced (beyond standard network/hardware):
- The Secondary System Itself: This is the most critical SPOF introduced.
- Failure of Secondary: If the secondary system fails (e.g., hardware crash, OS corruption, HANA crash), it breaks the replication chain to the tertiary. The tertiary will stop receiving updates, and its RPO will continuously increase. While the primary continues to run, your DR capability to the tertiary is lost until the secondary is recovered.
- Performance Bottleneck at Secondary: If the secondary's resources (CPU, I/O, network) are insufficient to handle both receiving logs from primary AND sending logs to tertiary, it will become a bottleneck, increasing lag for the tertiary and potentially impacting the primary if it's
syncmem
.
- Network Link between Secondary and Tertiary: If this link fails, the tertiary goes stale.
- Resource Contention on Secondary: Shared resources on the secondary (CPU, memory, log volume) must handle both incoming and outgoing replication, increasing the risk of contention and performance issues.
- The Secondary System Itself: This is the most critical SPOF introduced.
-
Mitigation Strategies:
- Robust Secondary System:
- Oversize Secondary: Ensure the secondary system is provisioned with more resources (CPU, memory, disk I/O, network) than just a simple standby, to handle the dual role of receiving logs and acting as a source for the tertiary.
- Local HA for Secondary (Optional but Recommended): For critical tiers, implement HA for the secondary itself. This might mean the secondary is also part of an HSR pair (P -> S1 -> S2 -> T) or a local cluster, providing redundancy for the mid-tier. This adds complexity but eliminates the secondary as a single point of failure.
- Dedicated and Optimized Network Segments:
- Ensure the S-T replication link is fully optimized for async HSR (high bandwidth, QoS).
- Isolate the P-S network from the S-T network where possible.
- Advanced Monitoring and Alerting:
- Implement granular monitoring for both replication legs, with alerts for any lag or status change.
- Monitor resource utilization on the secondary (CPU, memory, log replay queue).
- Automated Re-registration of Tertiary:
- If the secondary fails and is recovered, ensure scripts/automation are in place for the tertiary to automatically re-register with the recovered secondary once it's up and replicating from the primary.
- Cloud-Specific Considerations:
- Factor in cloud network latency variability and cost for cloud outbound data transfer (which can be significant for HSR).
- Utilize cloud provider's networking capabilities (e.g., Direct Connect, ExpressRoute) for stable connectivity.
- Robust Secondary System:
-
-
Scenario: Your SAP S/4HANA system runs on HANA 2.0. You use HSR for HA (Primary-Secondary in
syncmem
mode with Pacemaker). During a planned maintenance, the primary server was shut down for a kernel update. After the shutdown, the automatic failover to the secondary failed, and the ASCS instance also went down. The cluster logs show "Resource agent for SAP HANA database failed to start secondary as primary" and "fencing failed". As a result, the entire SAP system became unavailable.- Q: Based on "fencing failed," explain the specific cluster state that most likely occurred, why it prevented the HANA takeover and ASCS startup, and what immediate and long-term actions you would take to restore services and prevent recurrence in a real outage.
- A:
-
Specific Cluster State: Split-Brain (or Attempted Split-Brain Prevention)
- The "fencing failed" message is a critical indicator that the cluster's STONITH (Shoot The Other Node In The Head) mechanism did not successfully isolate the supposedly failed primary node.
- When the primary server was shut down, the cluster detected it as offline. Its first action should be to fence (power off, revoke storage access) the primary to ensure it's truly down and won't try to come back online and cause data corruption (split-brain) if the secondary takes over.
- Why it prevented takeover: Because fencing failed, the cluster lost its "quorate" state or could not guarantee that the primary node was safely offline. Cluster software is designed to prioritize data integrity over immediate availability. If it cannot guarantee that the failed node is isolated, it will refuse to bring up critical resources (like HANA and ASCS) on the secondary to prevent a split-brain scenario where both nodes attempt to control the same resources simultaneously, which would lead to severe data corruption. This protective mechanism is precisely what happened.
- ASCS Impact: The ASCS also went down because it was likely configured as a cluster resource dependent on the cluster being in a healthy, quorate state. If the cluster couldn't achieve a safe state for HANA, it couldn't activate other critical resources like ASCS either.
-
Immediate Actions to Restore Services:
- Manual Fencing Verification:
- Action: Immediately verify the physical state of the "failed" primary server. Is it truly powered off? Is its network disconnected? Manually power it off if it's still running or stuck.
- Rationale: Ensure the primary is unequivocally offline.
- Verify Quorum Status:
- Action: On the remaining secondary cluster node, check the cluster status (
crm status
for Pacemaker). Verify if the quorum is intact. - Rationale: Determine if the secondary still believes it's part of a valid cluster.
- Action: On the remaining secondary cluster node, check the cluster status (
- Manual HANA Takeover (if safe):
- Action: If you have definitively confirmed the primary is offline (and only then), manually trigger HANA takeover on the secondary (
hdbnsutil --sr_takeover
). - Rationale: Bypass the cluster's refusal if you can manually guarantee safety.
- Action: If you have definitively confirmed the primary is offline (and only then), manually trigger HANA takeover on the secondary (
- Force Quorum (Last Resort, Dangerous):
- Action: If quorum is lost and manual takeover is not possible, as a last resort, and only if absolutely certain the other node is offline, you might need to force quorum on the secondary node (e.g.,
crm_standby -q
orcorosync -f -J
for Corosync, orForce Quorum
option in WSFC). This is extremely dangerous and should only be done under expert guidance. - Rationale: To bring the cluster back to an active state.
- Action: If quorum is lost and manual takeover is not possible, as a last resort, and only if absolutely certain the other node is offline, you might need to force quorum on the secondary node (e.g.,
- Start ASCS/SAP: Once HANA is up on the secondary, start the ASCS instance and then the application servers.
- Manual Fencing Verification:
-
Long-Term Actions to Prevent Recurrence:
- Thorough Fencing Mechanism Validation:
- Action: Test the STONITH mechanism repeatedly outside of full DR drills (e.g., by simulating a power-off of a non-production server).
- Action: Ensure the fencing agent (e.g., iLO, IPMI, SAN-level fencing) is correctly configured, has network connectivity, and appropriate credentials/permissions.
- Action: Implement multiple fencing methods (e.g., power fencing and storage fencing) if possible, configured with a priority.
- Rationale: Fencing is the cornerstone of cluster data integrity. It must work reliably.
- Cluster Quorum Configuration Review:
- Action: Review the quorum configuration. For a 2-node cluster, a witness (e.g., disk witness, file share witness, or
qdevice
for Pacemaker) is often crucial to maintain quorum even if one node goes down. Without a witness, a 2-node cluster loses quorum if one node fails. - Rationale: A robust quorum setup prevents loss of cluster services if a single node fails.
- Action: Review the quorum configuration. For a 2-node cluster, a witness (e.g., disk witness, file share witness, or
- Cluster Software Logs Analysis:
- Action: Perform a deep dive into the cluster logs (
journalctl -u pacemaker
,journalctl -u corosync
, Windows Event Viewer Failover Clustering) to understand why fencing failed. Was it a network issue to the fence device? Incorrect credentials? Device not responding? - Rationale: Pinpoint the exact failure reason for the fencing.
- Action: Perform a deep dive into the cluster logs (
- Network Redundancy for Fencing:
- Action: Ensure the network path to the fencing device is redundant and highly available.
- Rationale: A network problem that affects cluster heartbeats but also prevents fencing is a double failure that must be avoided.
- Regular HA Drills:
- Action: Conduct regular, documented HA drills that include testing automatic failover (and thus, implicit fencing).
- Rationale: Continuous validation of the entire HA stack.
- Thorough Fencing Mechanism Validation:
-
Comments
Post a Comment