Site Reliability Engineer (SRE)
Description;
To begin with – 40% Project – 60% operations – Service Now Ticket Resolution
• If the person is great at his skills then above will change to 60% Projects and 40% Operations
• Automation – Someone who can review existing scripts and create and tweak new scripts as needed – manual tasks to automation (look for ways to streamline)
years of progressively responsible experience in enterprise technology operations, multi-cloud platform management
Customer Support
Provide daily support to customers, focusing on server operations, troubleshooting, as well as backup and restore operations. Ensure high levels of customer satisfaction and maintain strong customer relationships.
Strategy and Support
Collaborate on delivering the outcome of strategic roadmaps that evolve the infrastructure portfolio, providing subject matter expertise as these roadmaps are planned and executed. Act as a technical SME involved in major incidents affecting hybrid cloud platforms, server and storage workloads, or related automation systems. Assist in the identification of underlying root causes and remediation activities.
Pillars:
• Operational Excellence
• Automation and Modernization
• Customer Support
• Documentation and Compliance
• Strategy and Support
MUST HAVE;
• Exchange On-Line, Archiving Technologies – Barracuda or Veritas, Network File Shares, Backups, Patching, M365 Stack, Cloud – Azure and AWS, Windows Server OS and Windows Server T-Shoot
• Modern Email Infrastructure such as M365, Sendgrid etc.
• Microsoft 365 – Teams including Channels, OneDrive
• Full life Cycle Testing and deployment of MS-released patching – all environments.
• Responsible for supporting EDR for enterprise Storage and cloud repositories.
• Microsoft – Windows Servers Eco-System
• Microsoft Entra ID user and access management and third-party integrations.
• Backup solutions (Rubrik On-Prem and RSC, Veritas)
• Azure MFA/SSO/SSPR
• PING FED and DIR – SSO and SSPR
• Monitor and maintain server, storage, and network systems to ensure high availability and performance.
• Manage incident, problem, and change management processes for infrastructure components.
• Perform root cause analysis (RCA) on recurring or high-impact infrastructure incidents.
• Execute infrastructure health checks, capacity planning, and performance tuning.
• Maintain system documentation, operational procedures, and configuration records.
• Respond to monitoring alerts and perform first- and second-level troubleshooting.
• Collaborate with other IT teams (e.g., application support, security, DevOps) to resolve cross-functional issues.
• Participate in on-call rotation and provide after-hours support as needed.
• Identify and implement automation opportunities to improve operational efficiency.
• Excellent communication and documentation skills
• Strong sense of ownership and operational awareness
• Able to work cross-functionally across dev, infrastructure, and legacy support teams
• Comfortable maintaining reliability for systems you didn’t originally build
Nice Have;
health care experience
Hybrid Work Environment
Job Features
Job Category | Site Reliability Engineer (SRE) |