Disaster recovery plans

Let's focus specifically on Disaster Recovery (DR) plans in SAP Basis. DR is about preparing for and recovering from catastrophic events that could wipe out your primary data center.

Detailed Notes for Disaster Recovery Plans in SAP BASIS

I. Understanding Disaster Recovery (DR)

Disaster Recovery (DR) is the process of recovering and resuming business-critical operations, specifically your SAP system, after a major disruptive event that renders the primary data center or region unusable.

Goal: To ensure business continuity and data integrity following a widespread disaster.
Scope: Involves a geographically separate DR site (secondary data center).
Key Metrics:
- Recovery Point Objective (RPO): The maximum amount of data (measured in time) that can be lost from the point of disaster to the last consistent data copy. A lower RPO means less data loss.
- Recovery Time Objective (RTO): The maximum tolerable downtime for your SAP system. A lower RTO means faster recovery.
Key Concept: Replication of data and infrastructure to a remote site for recovery.

II. Components Covered by DR Plans in SAP Systems

A comprehensive SAP DR plan must cover all layers of the SAP system stack:

Database Layer: The most critical component. All SAP business data resides here. DR for the database is paramount.
- Examples: SAP HANA, Oracle, SQL Server, IBM DB2, Sybase ASE.
Central Services (ASCS/JCS): Contains the Enqueue Server and Message Server. While not directly data, their configuration and startup are vital for the SAP system to function.
Application Servers (Dialog Instances): Provide the processing power. In DR, these are often "built on demand" or recovered from templates at the DR site.
Shared Storage: Holds global SAP files (/sapmnt/<SID>), profiles, and kernel executables. Replication of this content to the DR site is crucial.
Interfaces/Connectivity: Network links, firewalls, load balancers, DNS, and application interfaces (e.g., PI/PO, BTP Integration Suite) need to be part of the DR scope.

III. Disaster Recovery Strategies and RPO/RTO Trade-offs

The choice of DR strategy is a trade-off between cost, RPO, and RTO.

Backup & Restore:
- Principle: Regular full and incremental backups are taken at the primary site and securely transferred (often via network or physical media) to the DR site.
- RPO: Can be hours to a day (age of last backup + transaction logs).
- RTO: Can be very long (hours to days) due to restoration time, especially for large databases.
- Advantages: Lowest cost, simplest to implement initially.
- Disadvantages: Highest RPO/RTO, data loss.
Log Shipping / Database Replication (Asynchronous):
- Principle: Transaction logs (database changes) are continuously shipped from the primary database to a standby database at the DR site and applied. The standby database is typically in a "recovery" or "standby" mode.
- RPO: Minutes to a few hours (depending on log shipping frequency and network latency). Some data loss is possible.
- RTO: Hours to a few hours (less than backup/restore as the database is partially ready).
- Database-Specific Technologies:
  - Oracle Data Guard: Physical Standby (asynchronous mode).
  - SQL Server AlwaysOn Availability Groups (Asynchronous): Can span data centers.
  - HANA System Replication (HSR) - Asynchronous (Mode=async, Operation Mode=logreplay): Log blocks are replicated to the secondary system's disk.
  - DB2 HADR: Asynchronous.
- Advantages: Improved RPO/RTO over backup/restore, less bandwidth than synchronous.
- Disadvantages: Potential for minor data loss.
Storage Replication (Asynchronous SAN-to-SAN):
- Principle: Entire storage volumes (LUNs) are replicated from the primary SAN to a secondary SAN at the DR site.
- RPO: Very low (seconds to minutes), as changes are committed locally first, then asynchronously replicated. Potential for minimal data loss.
- RTO: Minutes to a few hours (activation of storage and server boot-up).
- Advantages: Database-agnostic, replicates entire LUNs, good RPO/RTO for many applications.
- Disadvantages: High cost for enterprise-grade SANs, requires identical storage arrays.
Database AlwaysOn / HANA System Replication (HSR) - Synchronous (Extended Distance):
- Principle: While synchronous replication is primarily for HA within a data center, some technologies (e.g., Oracle Data Guard Far Sync, SQL Server AlwaysOn with distributed AGs) can extend near-zero RPO over longer distances, often with intermediary nodes.
- HSR - Synchronous (Mode=sync, Operation Mode=logreplay) over long distance: Extremely sensitive to latency. Not commonly used for true long-distance DR unless network latency is exceptionally low and guaranteed.
- RPO: Near-zero.
- RTO: Minutes to a few hours.
- Advantages: Lowest RPO, fast failover potential.
- Disadvantages: Highest cost, extremely demanding on network bandwidth and latency, not suitable for all distances.

IV. Key Steps in a DR Plan Execution

A DR plan isn't just about technology; it's a documented, actionable process.

Declaration of Disaster: Formal process to declare a disaster and initiate DR.
Activation of DR Site:
- Power up DR infrastructure (servers, network).
- Activate replicated storage.
- Recover/Activate Database: Perform database recovery (e.g., activate standby, restore from backup).
- Start Central Services (ASCS/JCS) at DR site.
- Start Application Servers at DR site (from templates/builds).
Connectivity Switchover:
- Update DNS entries to point SAP virtual hostnames/IPs to the DR site.
- Adjust network routes, firewalls, and load balancers.
- Notify users and integrated systems of new access points.
System Verification and Testing:
- Perform smoke tests (logon, critical transactions, background jobs).
- Verify interfaces to other systems.
- Data consistency checks.
Post-Recovery Activities:
- De-briefing, lessons learned.
- Replenish DR site with current data if the primary is down for extended periods.
- Planning for primary site rebuild and failback.

V. Important Configuration to Keep in Mind for DR

DR Site Readiness:
- Identical or Compatible Hardware: Ensure the DR site has hardware that can support the SAP system (CPU, RAM, storage compatibility).
- Network Infrastructure: Adequate networking for client access and inter-DC replication.
- Licensing: Ensure all necessary software licenses (OS, DB, SAP) are valid for the DR site.
Network Configuration:
- High-Bandwidth, Low-Latency Link: Crucial for efficient data replication, especially for log shipping or storage replication.
- Firewall Rules: All necessary ports must be open between primary and DR sites, and between DR systems/clients.
- DNS Management: A robust plan for updating DNS entries to point to the DR site's virtual IPs/hostnames during a failover. TTL (Time To Live) for DNS entries should be low to ensure quick propagation.
- Virtual IPs/Hostnames: SAP systems should use virtual hostnames consistently, making the switch to DR seamless from a client perspective.
Data Replication Configuration:
- Database: Configure native database replication (e.g., Oracle Data Guard, SQL Server AlwaysOn AG, HANA HSR) or storage replication. Choose asynchronous mode for typical long-distance DR.
- sapmnt Directory: Implement replication for the shared /sapmnt/<SID> directory.
  - Methods: Asynchronous storage replication (if sapmnt is on shared storage), file-level replication tools (e.g., rsync for Linux, DFSR for Windows), or simply transfer updated kernel/profiles regularly.
  - Important: Ensure consistency of profiles, kernel executables, and global files.
- Backups: Continue regular backups at the primary site and ensure their transfer to and availability at the DR site. These are often the fallback for any replication failure.
SAP Profile Parameters:
- DEFAULT.PFL / Instance Profiles: Have up-to-date copies at the DR site.
- Database Parameters: Ensure database connection parameters in SAP profiles point to the DR database virtual hostname/listener after failover. This usually involves activating a DR-specific profile or changing entries post-recovery.
- Hardcoded Values: Avoid hardcoding primary site hostnames or IPs in custom programs or variants. Use logical names that can be mapped at DR.
Application Server Recovery:
- Templates/VMs: Maintain up-to-date templates or virtual machine images of your SAP application servers at the DR site for rapid deployment.
- Automated Deployment: Develop scripts for quick deployment and configuration of application servers, ensuring they connect to the recovered database and central services at DR.
- Licensing: Be aware of licensing implications if building new VMs/instances at DR.
Offline Data/Binaries:
- Ensure SAP installation media, kernel files, database binaries, and any necessary third-party software are accessible at the DR site.
Documentation & Procedures:
- Comprehensive DR Runbook: A detailed, step-by-step guide for performing a DR failover and recovery. This must be a living document, updated regularly.
- Contact Lists: Up-to-date contact information for all relevant teams (Basis, DBAs, Network, Storage, Application, Business).
- Escalation Matrix: Clear escalation paths.
DR Test Drills:
- Regularity: Conduct DR drills at least annually, preferably more often.
- Types: Full failover tests, partial failover tests, application-level verification.
- Lessons Learned: Document all issues encountered during drills and implement corrective actions. Update the runbook.
- Surprise Drills: Occasionally conduct unannounced drills to test true preparedness.
Monitoring:
- Implement monitoring for replication status between primary and DR databases/storage.
- Monitor the health of the DR site infrastructure (servers, network, storage).

30 Interview Questions and Answers (One-Liner) for Disaster Recovery Plans in SAP BASIS

Q: What is the primary goal of Disaster Recovery (DR)?
- A: To recover from catastrophic site-wide failures.
Q: What does RPO stand for in DR?
- A: Recovery Point Objective.
Q: What does RTO stand for in DR?
- A: Recovery Time Objective.
Q: Which DR strategy typically has the highest RPO and RTO?
- A: Backup & Restore.
Q: Which type of HSR (HANA System Replication) is generally used for long-distance DR?
- A: Asynchronous HSR.
Q: What is the main disadvantage of asynchronous replication for DR?
- A: Potential for some data loss (non-zero RPO).
Q: What is Oracle Data Guard used for in DR?
- A: To create and maintain standby databases for disaster recovery.
Q: What is the purpose of a "DR Runbook"?
- A: A detailed, step-by-step guide for performing a DR failover and recovery.
Q: Are SAP application servers typically replicated in a DR setup?
- A: No, they are often built clean at the DR site or recovered from templates.
Q: How is the /sapmnt directory typically handled for DR?
- A: Replicated using storage replication or file-level replication tools.
Q: Why is DNS management crucial during a DR failover?
- A: To redirect SAP virtual hostnames/IPs to the DR site.
Q: What is the role of a "warm standby" in DR?
- A: A DR site where systems are partially running or quickly recoverable, reducing RTO.
Q: What does "cold standby" imply in a DR context?
- A: A DR site where systems are not running and need a full restore/build process.
Q: What is the key network requirement for efficient DR data replication?
- A: High-bandwidth, low-latency inter-data center link.
Q: Why are regular DR test drills important?
- A: To validate the DR strategy and ensure it works as expected.
Q: What is the primary concern with a Backup & Restore DR strategy regarding data?
- A: Data loss (high RPO).
Q: What is the main advantage of storage replication for DR?
- A: Database-agnostic and provides low RPO/RTO.
Q: Can a single SAP system have both HA and DR?
- A: Yes, HA provides local redundancy, DR provides remote recovery.
Q: What is the first step in a DR plan execution?
- A: Declaration of a disaster.
Q: What should be done with DNS TTL during DR planning?
- A: Set to a low value for faster propagation during failover.
Q: What are two types of storage replication for DR?
- A: Synchronous and Asynchronous.
Q: Which type of replication allows for greater geographical distance?
- A: Asynchronous replication.
Q: What critical data must be available at the DR site for database recovery?
- A: Database backups and transaction logs.
Q: What should be avoided in SAP profiles that could hinder DR?
- A: Hardcoding primary site physical hostnames or IPs.
Q: How does a "mutual DR" setup work?
- A: Each data center acts as the primary for some systems and the DR for others.
Q: What is the importance of "data consistency checks" after DR recovery?
- A: To ensure no data corruption or loss occurred during the failover.
Q: What licensing aspect should be considered for a DR site?
- A: Ensuring all software licenses (OS, DB, SAP) are valid for the DR environment.
Q: How does "log shipping" differ from "full backup restore" for DR?
- A: Log shipping continuously applies incremental changes, while full backup requires restoring an entire backup.
Q: What is the purpose of "smoke tests" in DR recovery?
- A: To quickly verify basic system functionality (logon, key transactions) after recovery.
Q: What is the key to improving RTO for application servers in DR?
- A: Having pre-built templates/VMs and automated deployment scripts.

5 Scenario-Based Hard Questions and Answers for Disaster Recovery Plans in SAP BASIS

Scenario: Your company's critical SAP ERP system (Oracle database on Linux) uses an asynchronous log shipping DR strategy to a remote DR site. During the recent annual DR drill, the database was successfully activated at the DR site, and the ASCS instance started. However, when attempting to start the SAP application servers (dialog instances) at the DR site, they consistently failed with errors indicating they could not connect to the database listener. Further investigation shows the database listener on the DR database host is up and listening on the correct port.
- Q: What is the most likely reason the SAP application servers cannot connect to the DR database listener despite it being active, and what specific Basis-level configuration elements would you investigate and rectify to resolve this during a real disaster?
- A:
  - Most Likely Reason: The most likely reason is an incorrect or outdated tnsnames.ora configuration file on the application servers at the DR site. After a DR failover, the database listener (and potentially the database service name or virtual IP) at the DR site will be active. If the tnsnames.ora file on the application server contains entries that still point to the primary site's database listener IP/hostname, or if the service name configuration for the DR database is different, the application servers will fail to connect.
  - Specific Basis-Level Configuration Elements to Investigate and Rectify:
    1. tnsnames.ora File on Application Servers:
      - Action: On each application server at the DR site, navigate to $ORACLE_HOME/network/admin/ and inspect the tnsnames.ora file.
      - Verify: Ensure that the SERVICE_NAME or SID and the HOST (IP address or hostname) and PORT for the database entry point correctly to the DR database listener.
      - Rectification: This file should ideally be a template at the DR site, pre-configured for the DR database details, or replicated from the primary with a post-recovery script to update it. If it's incorrect, edit it to reflect the DR database listener details.
      - Rationale: This is the primary configuration file that tells SAP where to find the Oracle database.
    2. DEFAULT.PFL and Instance Profiles:
      - Action: Review the DEFAULT.PFL and the instance profiles for the application servers at the DR site.
      - Verify: Check parameters like dbs/ora/schema, dbs/ora/tnsname, DB_SID, DB_HOSTNAME. While tnsname often refers to an entry in tnsnames.ora, DB_HOSTNAME explicitly defines the database host. Ensure these point to the DR database's hostname/virtual IP.
      - Rectification: If these are hardcoded to primary values, they need to be updated. A robust DR plan includes profile adaptation.
      - Rationale: These profiles guide the SAP kernel in connecting to the database.
    3. Network Connectivity (Firewall/Routing):
      - Action: From the DR application server, attempt to ping the DR database host and telnet to the database listener port (e.g., telnet <DR_DB_HOST> 1521).
      - Verify: Ensure there are no firewall rules blocking communication between the DR application servers and the DR database server, or routing issues within the DR network.
      - Rationale: Even if tnsnames.ora is correct, network blockage will prevent connection.
    4. sqlnet.ora and listener.ora (on DB server):
      - Action: While the listener is up, check listener.ora on the DR DB server to confirm it's listening on the expected IP/hostname and port, and sqlnet.ora for any client-side configuration that might inadvertently block connections.
      - Rationale: Ensures the listener is correctly configured to accept connections from all relevant sources.
Scenario: Your company's SAP ECC system is crucial for daily operations. You have implemented a storage-level asynchronous replication strategy from your primary data center to a DR site. Your RPO target is 15 minutes. During the recent quarterly DR drill, the team found that the replicated data on the DR storage was consistently 30-45 minutes behind the primary, failing to meet the RPO.
- Q: What are the primary technical reasons why asynchronous storage replication might fail to meet its RPO target, and what specific actions would you recommend to the storage, network, and Basis teams to ensure the RPO target is consistently met during future operations and drills?
- A:
  - Primary Technical Reasons for Missing RPO (Asynchronous Storage Replication):
    1. Insufficient Network Bandwidth: The most common culprit. If the network link between the primary and DR site cannot handle the volume of data changes (churn rate) being generated at the primary, a backlog builds up, increasing the RPO.
    2. High Network Latency: Even with sufficient bandwidth, high latency (delay in data packets reaching the DR site) can slow down the replication process and reduce the effective throughput, causing delays.
    3. Storage Performance Bottlenecks at Primary or DR:
      - Primary: If the primary storage struggles to write data fast enough to meet application demands and prepare it for replication, it can impact the source of replication.
      - DR: If the DR storage cannot write the replicated data blocks fast enough, it creates a bottleneck at the target, causing the primary to slow down or build up a larger queue.
    4. Replication Engine Overload/Configuration: The storage array's replication engine itself might be overwhelmed, or its internal buffers/queues might not be optimally sized, leading to delays. Incorrect replication policies or filters could also cause issues.
    5. I/O Spike Periods: Sudden, large spikes in I/O on the primary system (e.g., during month-end closing or large data loads) can temporarily overwhelm the replication link and create a backlog that takes time to clear.
  - Specific Actions to Recommend:
    1. Network Team:
      - Monitor Link Utilization: Continuously monitor bandwidth utilization and latency on the inter-DC link.
      - Upgrade Bandwidth: If consistently saturated, upgrade the network link capacity.
      - QoS (Quality of Service): Implement QoS to prioritize storage replication traffic over other non-critical traffic on the WAN link.
      - Optimize Routing: Ensure optimal routing paths with minimal hops.
    2. Storage Team:
      - Monitor Replication Queues/Backlog: Utilize storage array management tools to monitor the replication queue size and backlog in real-time.
      - Performance Tuning: Tune array caches, write policies, and replication engine settings for optimal asynchronous performance.
      - Disk Performance: Ensure sufficient IOPS and throughput on both primary and DR storage arrays to handle the workload. Consider faster disk tiers if necessary.
      - Replication Policies: Review the replication policies (e.g., consistency groups) to ensure they are configured efficiently.
      - Dedicated LUNs: Ensure critical SAP LUNs have sufficient dedicated resources for replication.
    3. Basis Team:
      - Identify I/O Intensive Operations: Work with application teams to identify and analyze periods of unusually high database I/O (e.g., ST04/DB02 reports, HANA Studio monitoring).
      - Stagger Batch Jobs: If possible, stagger large batch jobs or data loads to spread out the I/O spikes throughout the day, reducing sudden peaks that overwhelm replication.
      - Database Log File Management: Optimize database log file sizing and switching to manage the rate of change.
      - Collaboration: Work closely with Network and Storage teams to correlate SAP workload patterns with replication performance.
      - Document and Review: Document expected churn rates and monitor actual rates, adjusting infrastructure or expectations accordingly.
    4. Regular Testing:
      - Consistent Drills: Conduct regular (e.g., quarterly) DR drills to continuously validate RPO and identify bottlenecks.
      - Detailed Metrics: During drills, meticulously measure the actual RPO achieved.
Scenario: You have successfully recovered your SAP S/4HANA system at the DR site after a simulated disaster. All technical checks (DB, ASCS, App Servers, interfaces) are green. However, the business users are reporting data inconsistencies in a few critical financial reports, specifically that some transactions from the last hour before the disaster are missing, while others are present. Your DR strategy for HANA was asynchronous HSR.
- Q: Why might this specific type of "partial" data inconsistency occur with asynchronous HSR during a DR event, and what steps would you take with the business and DBAs to reconcile the data and prevent such issues in future DR events (short of switching to synchronous replication)?
- A:
  - Reason for Partial Data Inconsistency with Asynchronous HSR:
    - Asynchronous Nature: In asynchronous HSR, transaction log buffers/segments are committed to the primary HANA system's disk before they are sent over the network to the secondary (DR) system.
    - Network Latency/Buffer: There's an inherent delay between the transaction being committed on the primary and it being fully replicated and hardened on the secondary's disk.
    - Disaster at Critical Moment: If the disaster strikes during this replication lag, any transactions that were committed on the primary but had not yet been fully replicated and hardened on the secondary will be lost. Transactions that had been replicated and hardened will be present. This leads to partial data loss and thus inconsistency, as some transactions from the "last hour" might have made it, while others didn't. This is the definition of a non-zero RPO.
    - No Application-Level Consistency: Asynchronous replication operates at the database level. It doesn't guarantee application-level consistency across multiple related transactions if they involve external systems or complex processes, if the disaster occurs mid-process.
  - Steps to Reconcile Data and Prevent Future Issues (without switching to synchronous):
    1. Reconciliation with Business and DBAs:
      - Identify Missing Transactions: Work with the business and DBAs to precisely identify the missing transactions. This typically involves comparing transaction IDs or key data from external sources (e.g., interface logs, external systems) or primary site remnants (if any) against the recovered DR database.
      - Manual Posting/Correction: For the identified missing transactions, the business will likely need to manually re-post them in the recovered SAP system.
      - Financial Impact Assessment: Assess the financial impact of the missing data and potential need for re-runs of financial processes.
      - Root Cause Analysis (DR Process): Analyze the HSR monitoring logs (e.g., in HANA Studio/Cockpit) to determine the exact replication lag at the time of the simulated disaster. This confirms the RPO.
    2. Preventive Measures for Future DR Events (Improving RPO with Asynchronous HSR):
      - Optimize Network Link:
        
        Bandwidth: Ensure sufficient network bandwidth between primary and DR sites to handle peak transaction volumes for HSR. This is paramount for asynchronous to keep the lag low.
        
        Latency: Minimize network latency between sites. While impossible to eliminate, optimizing routing and using direct links helps.
      - HANA global.ini Parameters:
        
        log_buffer_size: Ensure the log buffer size on the primary is adequate for bursty workloads, allowing logs to be efficiently written before being sent.
        
        log_segment_size_mb: Optimize log segment size.
        
        system_replication_max_delay_time_seconds (if available/applicable): Some newer HANA versions allow setting a maximum acceptable replication delay. If this delay is exceeded, the primary might temporarily pause commits or switch to a different mode to prevent data divergence, impacting primary performance but improving RPO consistency.
      - Continuous Monitoring:
        
        Implement robust, real-time monitoring of HSR replication status, including the log shipping backlog (queue size and time lag) via HANA Studio, SAP_HANA_SR_OVERVIEW views, or Solution Manager.
        
        Set up alerts if the backlog exceeds your 15-minute RPO target, allowing proactive intervention (e.g., temporary throttling of non-critical batch jobs on the primary during high transaction volumes, or escalating to network team).
      - Regular DR Drills with RPO Validation:
        
        During drills, precisely measure the achieved RPO by comparing data from the primary just before the "disaster" to the recovered DR system. This provides a realistic understanding of the actual data loss.
        
        Use real-world transaction volumes during drills if possible.
      - Business Process Adaptation (if RPO cannot be consistently met):
        
        If the 15-minute RPO is technically unachievable with asynchronous HSR given network constraints and transaction volume, discuss with the business.
        
        Can critical data be manually extracted/uploaded more frequently?
        
        Can the business tolerate a slightly higher RPO if absolutely necessary?
Scenario: You are part of a team planning a new SAP S/4HANA implementation. The business demands an RPO of 1 hour and an RTO of 8 hours for the full SAP system recovery in a DR event. You have two data centers, 500 km apart. You are considering two primary options for database DR: A) Asynchronous HANA System Replication (HSR) or B) Asynchronous Storage-Level Replication (SAN-to-SAN).
- Q: Compare these two options specifically for this scenario, highlighting their pros and cons related to meeting the RPO/RTO, their architectural implications for SAP Basis, and which one you would generally recommend and why.
- A:
  - Comparison of Asynchronous HSR vs. Asynchronous Storage Replication for S/4HANA DR (500 km, RPO 1hr, RTO 8hr):
    
    Option A: Asynchronous HANA System Replication (HSR)
    - Pros for this Scenario:
      - Application-Aware (Database Level): HSR is native to HANA, understands database transactions, and manages consistency at the database level.
      - Lower Bandwidth: Typically requires less raw network bandwidth than full storage replication, as it only sends log blocks/pages, not entire disk block changes (though still needs good bandwidth for rapid sync).
      - Granular Monitoring: Excellent monitoring tools within HANA and Solution Manager for replication status, lag, and RPO.
      - Flexible RPO: logreplay (async) mode can generally achieve RPO in minutes to low hours, fitting the 1-hour target.
      - Built-in Data Integrity: HANA manages the consistency of the replicated data.
    - Cons for this Scenario:
      - Database Specific: Only protects the HANA database. Other non-HANA components (e.g., shared sapmnt, application server binaries) need separate replication strategies.
      - Resource Intensive (DR Side): The DR HANA instance needs to be running and consuming resources to apply the logs, even if not fully active.
      - Manual Intervention: Still requires manual activation and follow-up steps at the DR site.
    - Architectural Implications for Basis:
      - Requires a separate HANA appliance/VM at the DR site.
      - Needs a separate strategy for sapmnt (e.g., rsync, DFSR, or a separate storage replication for that specific volume).
      - Application servers at DR need to be configured to point to the DR HANA DB after failover.
    Option B: Asynchronous Storage-Level Replication (SAN-to-SAN)
    - Pros for this Scenario:
      - Application-Agnostic: Replicates entire LUNs, meaning it protects the HANA database, sapmnt filesystem, and any other data residing on the replicated LUNs, all in one go. Simplifies overall data replication.
      - Simplified Management: Once configured, replication is managed at the storage level, reducing complexity at the OS/DB layer for what's included in the replication.
      - Good RPO for Target: Can achieve RPO in seconds to minutes, easily meeting the 1-hour target.
    - Cons for this Scenario:
      - Higher Bandwidth: Typically requires higher network bandwidth as it replicates all block changes, regardless of whether they are database changes or OS logs.
      - Higher Cost: Enterprise-grade SANs with replication capabilities are expensive. Requires identical or compatible storage arrays at both sites.
      - Consistency Issues (Potential): Storage replication is "crash consistent." It replicates the state of the disk. While good, it doesn't have database-level awareness. You might need to use SAN vendor tools that integrate with HANA (e.g., snapshot integration) to get "application consistent" snapshots for clean recovery, which adds complexity.
      - Resource Usage (DR Side): The DR storage array will consume resources for replication. The DR HANA instance does not need to be running until activated.
    - Architectural Implications for Basis:
      - Requires significant coordination with Storage team.
      - DR HANA instance is "cold" until storage is activated and HANA is started.
      - Less complexity for sapmnt replication as it's included with the LUNs.
  - Recommendation:
    For this specific S/4HANA scenario (RPO 1hr, RTO 8hr, 500km distance), I would generally recommend Asynchronous HANA System Replication (HSR) in logreplay mode for the database, combined with a file-level replication (like rsync or DFSR) for the /sapmnt directory.
    - Reasoning:
      1. Optimized for HANA: HSR is the native, deeply integrated, and most robust DR solution for HANA. It understands HANA's logging and persistence layers, ensuring maximum data integrity and efficient replication. Given S/4HANA is heavily tied to HANA, using the native solution is almost always preferred.
      2. Bandwidth Efficiency: While storage replication replicates everything, HSR is more intelligent about what it sends, often making it more efficient over distances.
      3. Meeting RPO/RTO: Asynchronous HSR (logreplay) can reliably achieve a 1-hour RPO over 500km, and its RTO for database activation is typically very fast (minutes), allowing ample time for ASCS and application server startup within the 8-hour RTO.
      4. Manageable Complexity: While it requires managing two replication mechanisms (HSR for DB, file-level for sapmnt), these are standard and well-documented. Storage replication for the entire system introduces dependency on specific array models and can be more complex for database-application consistent recovery.
      5. Cost: Often more cost-effective than investing in high-end, replicated SANs if the primary SAN isn't already designed for it.
    While SAN replication has its merits, especially for heterogenous environments or for simplifying replication for all workloads on a given LUN, for a core S/4HANA/HANA system, leveraging the native HANA capabilities is usually the more robust and higher-performing choice for DB DR.
Scenario: Your company performs its annual SAP DR drill for its production ERP system (SQL Server on Windows). The drill proceeds mostly as planned until the post-recovery data consistency checks. Several business users report that newly created custom batch jobs from the week before the disaster are missing in SM37 at the DR site, while all standard SAP jobs are present. Additionally, newly created SM69 external commands are also missing. The Basis team confirmed that the SQL Server AlwaysOn Availability Group (AG) replication (asynchronous) was healthy and caught up at the time of the simulated disaster.
- Q: Why would newly created custom batch jobs and external commands be missing at the DR site, even with healthy database replication, and what specific Basis-level configuration or procedural gaps does this indicate in your DR plan? How would you update the DR plan to prevent this in the future?
- A:
  - Why Missing Data Despite Healthy DB Replication:
    - Batch Jobs (SM37) and External Commands (SM69) are not stored directly in the core SAP database tables replicated by SQL Server AlwaysOn Availability Group for application data.
    - Batch Jobs: Batch job definitions (job steps, variants, schedules) are stored in the SAP database (tables like BTCH0000 series, BTCJ*. etc.). However, the issue here is likely about job definitions that reside in the shared sapmnt file system if they involve external programs or specific scripts not embedded in the DB. More critically, if these are custom jobs or SM69 commands, they might rely on files or scripts located outside the database, which are not replicated by the DB AG.
    - External Commands (SM69): While the definition of the external command itself is stored in the SAP database (table SXPGCOSTAB), the actual script or executable that the command calls resides on the operating system file system of the application server(s) or global host. If these new scripts were created on the primary system's filesystem but not replicated to the DR site, the SM69 command, even if its definition exists in the DB, would fail or appear "missing" functionally.
    - Root Cause: This indicates a gap in the DR plan for replication of the SAP global host directories (/sapmnt/<SID>) or specific application server local directories where custom scripts/executables reside. SQL Server AG only replicates database files, not OS files.
  - Specific Basis-Level Configuration/Procedural Gaps:
    1. /sapmnt/<SID> Replication Strategy: The DR plan likely lacks a robust strategy for replicating changes to the shared sapmnt file system to the DR site, or the existing strategy failed to capture recent changes. Custom batch jobs might use ABAP programs that are kernel-level (in exe directory) or external programs (in global or other custom directories under sapmnt). SM69 commands directly reference OS paths.
    2. Local Application Server Files: If custom scripts or executables for batch jobs/external commands are placed on local drives of primary application servers (e.g., D:\usr\sap\jobs), and these drives are not part of sapmnt or replicated, they will be missing at DR.
    3. DR Drill Scope/Validation: The drill validation likely focused purely on database and core SAP startup, but missed verifying the completeness of the sapmnt content and related OS-level executables/scripts.
    4. Change Management: New custom jobs and external commands represent changes. The change management process might not adequately enforce the replication of associated OS-level files to the DR site.
  - Updating the DR Plan to Prevent This in the Future:
    1. Implement Robust /sapmnt Replication:
      - Asynchronous Storage Replication: If the /sapmnt is on shared storage at the primary, configure asynchronous storage replication for the LUNs containing /sapmnt to the DR site. This is the most comprehensive solution.
      - File-Level Replication Tools: If sapmnt is a local mount or cannot be storage-replicated, implement regular file-level replication (e.g., Windows DFSR, rsync from a central repository) to sync the /sapmnt directory to the DR site. Schedule this frequently (e.g., hourly, or during low-activity periods).
      - Kernel/Profile Synchronization: Ensure critical exe and profile directories within sapmnt are consistently updated.
    2. Standardize Custom Script Locations:
      - Policy: Enforce a policy that all custom scripts or executables called by SM37 jobs or SM69 commands must reside in a designated, centrally replicated location within /sapmnt/<SID>/global (or a similar shared and replicated directory).
      - Avoid Local Drives: Strictly prohibit placing such files on local drives of individual application servers.
    3. Update DR Runbook:
      - Verification Steps: Add specific steps to the DR runbook to verify the integrity and completeness of the /sapmnt directory at the DR site post-recovery. This should include checking sizes, timestamps, and presence of recently added custom files.
      - Script for sapmnt Sync: If file-level replication is used, include a step in the runbook to perform a final rsync/DFSR sync after the primary disaster is declared but before activating services at DR, to minimize RPO for these files.
    4. Integrate with Change Management:
      - Checklist: Ensure that any change request involving new custom batch jobs, external commands, or modifications to existing ones includes a checklist item to verify that associated OS-level files are replicated to the DR environment.
    5. DR Drill Scope Expansion:
      - Test Cases: Include specific test cases in DR drills that exercise newly created custom jobs and external commands to validate their functionality and presence at the DR site.
      - Application-Level Validation: Work closely with application owners to ensure their validation checklists cover all aspects of recent changes.

Rakshit Ranjan Singh

Search This Blog