IT Disaster Recovery Program Manager. What the role entails — definitions, responsibilities, and common hardware and software platforms. What you need to know for a job interview.

Seguridad Informática_Planes de Contingencia_ Recuperación en Desastres_Análisis de Riesgos_FILEMAKER _GRC_Cursos_Consultoria-026

Job Summary (in one sentence):

You will be responsible for designing, maintaining, testing, and governing the Disaster Recovery (DR) program for IT systems and manufacturing-related systems (OT/ICS), ensuring that critical services can be recovered within agreed-upon objectives (RTO/RPO) and in compliance with regulations and standards.

Daily/Recurring Tasks:

• Maintain recovery documentation and playbooks.

• Monitor the health of backups/replications and the status of SLAs (successful backups, replication lag).

• Coordinate and execute exercises (tabletop, partial simulation, controlled failover).

• Conduct Business Impact Analyses (BIAs) and reprioritize projects.

• Manage vendors, budgets, and resources for DR projects.

• Support response to real incidents and act as the DR point of contact.

• Maintain compliance with ISO 22301, NIST, SEMI, or other applicable standards. Quick Key

Definitions (for internalization):

• Recovery Time Objective (RTO): Maximum acceptable time to recover a service.

• Recovery Point Objective (RPO): Maximum amount of data that can be lost (e.g., 15 min).

• Business Impact Analysis (BIA): Business impact analysis to categorize criticality.

• Playbook/Runbook: Detailed steps to execute during recovery.

• Tabletop/Full failover: DR testing levels.

• Cold/Warm/Hot site: Types of recovery sites (from empty to production-ready).

• Continuous Data Protection (CDP): Protection with continuous, real-time replication.

Hardware / Physical Infrastructure (common in typical companies):

• On-premises data centers: racks, SAN/NAS, production switches, and redundant topologies (leaf/spine).

• Enterprise storage: NetApp, EMC/Dell PowerMax/PowerStore, Pure Storage, HPE 3PAR.

• Storage arrays with async and sync replication.

• x86 servers (Dell/HP/Lenovo), blades, and virtualization (VMware ESXi, vSphere).

• Tape libraries/LTO drives (IBM, HPE) for long-term archiving (if used).

• Electrical redundancies: UPS, generators, transfer switches, PDUs.

• Critical facilities equipment: redundant HVAC for cleanrooms and humidity/particle control (important in semiconductors).

• OT infrastructure: PLCs, DCS, control systems, industrial gateways, physical or logical segmentation from IT.

Software / Commercial Platforms (frequently used):

• Backup/replication/orchestration: Veeam, Rubrik, Commvault, Veritas NetBackup, Dell EMC Avamar/NetWorker, Zerto (replication/orchestration).

• Virtualization/DR orchestration: VMware Site Recovery Manager (SRM), vSphere Replication, Microsoft Azure Site Recovery, AWS Elastic Disaster Recovery.

• Cloud providers (DR target/backup): AWS, Azure, Google Cloud — native solutions and recovery services.

• Storage replication & snapshot management: storage provider tools (NetApp SnapMirror, Pure Cloud Snapshots).

• Monitoring/observability/SIEM: Splunk, Elastic Stack (ELK), Datadog, Prometheus + Grafana, Nagios, Zabbix.

• ITSM/runbook/automation: ServiceNow, BMC Remedy, Ansible, Rundeck, SaltStack, Terraform (for infrastructure as code and reproducibility). • Database replication / HA: Oracle Data Guard, Microsoft SQL Server Always On, MySQL Group Replication / Percona XtraDB, PostgreSQL streaming replication, GoldenGate.

• Storage-level replication & CDP vendors: Zerto, Actifio (or native snapshot/replica solutions).

• Audit / compliance: GRC tools or modules in ServiceNow / Archer.

Useful or common open-source software/tools:

• Backup/scripting: Bacula, Restic, Borg, rsync, Duplicity (more common in non-critical or support infrastructure).

• Monitoring/metrics: Prometheus + Grafana, Zabbix, ELK stack (Elasticsearch, Logstash, Kibana).

• Orchestration and automation: Ansible, Terraform, Salt, Jenkins/GitLab CI for test pipelines.

• Infrastructure as code and testing: containers/Kubernetes (for cloud-native apps), Velero (backup/restore for Kubernetes).

• ChatOps/collaborative runbooks: Git scripts, Notion/Confluence for documentation.

OT/ICS platforms and considerations:

• SCADA/PLC tools from Siemens, Rockwell/Allen-Bradley, Schneider; Modbus/PROFINET/OPC UA protocols.

• Important: separate DR plans for OT and IT. In OT, there is often manual/secure recovery and procedures approved by plant engineering.

Types of DR tests you should know:

• Tabletop exercise (discussion of roles and steps).

• Walkthrough (step-by-step walkthrough of runbooks).

• Full failover test (actual recovery to the DR site or cloud).

• Partial failover/application-level tests.

• Parallel run (running the recovery system in parallel with production to validate integrity).

Contáctanos! / Contact Us.

Contactanos-IADARA-Consultoria Especializada-Desarrollos a la Medida-Ciberseguridad-FileMaker
Contactanos-IADARA-Consultoria Especializada-Desarrollos a la Medida-Ciberseguridad-FileMaker

Please let us know how can we help you filling the following form or gives a call: +52 55 2060 4781 , number in Mexico.

Contáctenos llenando este formato o puede llamar al +52 55 2060 4781 en México.
Por favor, díganos sus necesidades y requerimientos.

    Related Posts