Job Summary (in one sentence):
You will be responsible for designing, maintaining, testing, and governing the Disaster Recovery (DR) program for IT systems and manufacturing-related systems (OT/ICS), ensuring that critical services can be recovered within agreed-upon objectives (RTO/RPO) and in compliance with regulations and standards.
Daily/Recurring Tasks:
• Maintain recovery documentation and playbooks.
• Monitor the health of backups/replications and the status of SLAs (successful backups, replication lag).
• Coordinate and execute exercises (tabletop, partial simulation, controlled failover).
• Conduct Business Impact Analyses (BIAs) and reprioritize projects.
• Manage vendors, budgets, and resources for DR projects.
• Support response to real incidents and act as the DR point of contact.
• Maintain compliance with ISO 22301, NIST, SEMI, or other applicable standards. Quick Key
Definitions (for internalization):
• Recovery Time Objective (RTO): Maximum acceptable time to recover a service.
• Recovery Point Objective (RPO): Maximum amount of data that can be lost (e.g., 15 min).
• Business Impact Analysis (BIA): Business impact analysis to categorize criticality.
• Playbook/Runbook: Detailed steps to execute during recovery.
• Tabletop/Full failover: DR testing levels.
• Cold/Warm/Hot site: Types of recovery sites (from empty to production-ready).
• Continuous Data Protection (CDP): Protection with continuous, real-time replication.
Hardware / Physical Infrastructure (common in typical companies):
• On-premises data centers: racks, SAN/NAS, production switches, and redundant topologies (leaf/spine).
• Enterprise storage: NetApp, EMC/Dell PowerMax/PowerStore, Pure Storage, HPE 3PAR.
• Storage arrays with async and sync replication.
• x86 servers (Dell/HP/Lenovo), blades, and virtualization (VMware ESXi, vSphere).
• Tape libraries/LTO drives (IBM, HPE) for long-term archiving (if used).
• Electrical redundancies: UPS, generators, transfer switches, PDUs.
• Critical facilities equipment: redundant HVAC for cleanrooms and humidity/particle control (important in semiconductors).
• OT infrastructure: PLCs, DCS, control systems, industrial gateways, physical or logical segmentation from IT.
Software / Commercial Platforms (frequently used):
• Backup/replication/orchestration: Veeam, Rubrik, Commvault, Veritas NetBackup, Dell EMC Avamar/NetWorker, Zerto (replication/orchestration).
• Virtualization/DR orchestration: VMware Site Recovery Manager (SRM), vSphere Replication, Microsoft Azure Site Recovery, AWS Elastic Disaster Recovery.
• Cloud providers (DR target/backup): AWS, Azure, Google Cloud — native solutions and recovery services.
• Storage replication & snapshot management: storage provider tools (NetApp SnapMirror, Pure Cloud Snapshots).
• Monitoring/observability/SIEM: Splunk, Elastic Stack (ELK), Datadog, Prometheus + Grafana, Nagios, Zabbix.
• ITSM/runbook/automation: ServiceNow, BMC Remedy, Ansible, Rundeck, SaltStack, Terraform (for infrastructure as code and reproducibility). • Database replication / HA: Oracle Data Guard, Microsoft SQL Server Always On, MySQL Group Replication / Percona XtraDB, PostgreSQL streaming replication, GoldenGate.
• Storage-level replication & CDP vendors: Zerto, Actifio (or native snapshot/replica solutions).
• Audit / compliance: GRC tools or modules in ServiceNow / Archer.
Useful or common open-source software/tools:
• Backup/scripting: Bacula, Restic, Borg, rsync, Duplicity (more common in non-critical or support infrastructure).
• Monitoring/metrics: Prometheus + Grafana, Zabbix, ELK stack (Elasticsearch, Logstash, Kibana).
• Orchestration and automation: Ansible, Terraform, Salt, Jenkins/GitLab CI for test pipelines.
• Infrastructure as code and testing: containers/Kubernetes (for cloud-native apps), Velero (backup/restore for Kubernetes).
• ChatOps/collaborative runbooks: Git scripts, Notion/Confluence for documentation.
OT/ICS platforms and considerations:
• SCADA/PLC tools from Siemens, Rockwell/Allen-Bradley, Schneider; Modbus/PROFINET/OPC UA protocols.
• Important: separate DR plans for OT and IT. In OT, there is often manual/secure recovery and procedures approved by plant engineering.
Types of DR tests you should know:
• Tabletop exercise (discussion of roles and steps).
• Walkthrough (step-by-step walkthrough of runbooks).
• Full failover test (actual recovery to the DR site or cloud).
• Partial failover/application-level tests.
• Parallel run (running the recovery system in parallel with production to validate integrity).

