Responsibilities:
• Lead the deployment, configuration, and optimization of monitoring tools (e.g., Logic Monitor, Dynatrace).
• Develop and maintain automation scripts and workflows to streamline routine IT operations.
• Ensure real-time visibility into system health, performance, and security.
• Drive proactive maintenance, capacity planning, and continuous improvement initiatives.
• Integrate monitoring and automation with incident management, reporting, and alerting systems.
• Oversee alerting, reporting, and escalation processes, ensuring SLA compliance.
• Lead incident response and troubleshooting for monitoring and automation issues.
• Maintain comprehensive documentation of monitoring architectures and automation procedures.
• Mentor and guide L2 Support Analysts, fostering knowledge sharing and skills development.
• Collaborate with infrastructure, application, and security teams for comprehensive monitoring coverage across Infrastructure Monitoring and Application Performance Monitoring (APM).
• Conduct regular reviews of monitoring data and automation effectiveness.
• Participate in governance, reporting, and service review meetings.
• Support audit and compliance activities, including timely remediation of audit findings.
• Stay current with emerging monitoring and automation technologies and best practices.
• Manage vendor relationships and coordinate with third-party tool providers.
• Promote adoption of automation to reduce manual effort and improve operational efficiency.
Requirements:
• Monitoring Tools & Coverage: Expert in LogicMonitor and Dynatrace, including coverage modeling and KPI/dashboard creation for both Infrastructure Monitoring and Application Performance Monitoring (APM).
• Alerting & Event Management: Skilled in alert threshold tuning, event enrichment, and noise reduction to ensure actionable monitoring.
• Automation & Scripting: Proficient in PowerShell, Bash, and Python for automating IT operations and incident response workflows.
• Monitoring Operations: Experience managing monitoring pipelines, tool upgrades, and rollback procedures.
• Integration & ITSM: Ability to integrate monitoring systems with ITSM, SIEM, and escalation processes.
• Capacity & Performance Reporting: Strong skills in capacity planning, SLO reporting, and synthetic monitoring.
• Disaster Recovery & Observability: Experience with disaster recovery validation and establishing observability standards for new services and locations.
• Knowledge Management: Maintaining alert playbooks, tuning documentation, and knowledge base for monitoring best practices.
• Mentorship & Team Development: Proven ability to mentor L2 analysts in monitoring, automation, and operational excellence.
Jelentkezés a pozícióra
Töltsd ki az adatokat, és csatold az önéletrajzod.