Site Reliability Engineer

Location: Cambridge, Massachusetts, United States
Date Posted: 06-06-2018
Job Description:
 
We  looking for an SRE to help us evolve out platform we are seeking the rare individual who can politely but firmly interface with customers at any level, can research highly complex multi-dimensional problems and relentlessly seek the root cause and be able to recommend and defend system and network architecture that are perform optimally and are absolutely bullet-proof. You're right for the job if you are comfortable contributing to major incident response in technical team of engineer's laser focused on restoring service across complex distributed architectures. You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization. You will work as a key part of our SRE team and interface with Customers, Engineering, Operations and other teams to support our next generation "always up" cloud based Content Distribution and Edge Compute platform.
 
RESPONSIBILITIES WILL INCLUDE:
 
·    Building software and systems to provision and manage infrastructure through automation
·    Deployment, support and monitoring of new platforms and application stacks
·    Measurement and optimization of system performance
·    Capacity planning and management
·    Explore and evaluate new technologies and solutions to push our capabilities forward, getting ahead of our customers’ needs, getting people incentivized to transform, innovate and continually improve
 
Minimum Qualifications
  • Understanding of incident management processes and procedures.
  • Calm under pressure when participating in major incident response.
  • Technical understanding of core infrastructure, cloud services, platforms and micro-services.
  • Ability to understand and capture key data from logs.
  • Ability to understand traffics flows and key dependencies between services.
  • Ability to effectively triage - be able to detect and determine symptom vs cause.
  • Detect and quantify impact.
  • Analyze trends to pro-actively prevent incidents.
  • Focus on immediate restoration vs root cause.
  • Research and recommend alternative actions for incident resolution - Develop procedures and documentation to support this.
  • Create and maintain procedural documentation.
  • Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
  • Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
  • Build tools to improve visibility, pro-actively detect issues and restore system availability.
  • Develop automation and self-healing with DevOps, Engineering and SRE partners.
  • Strong focus on collecting and inferring metrics.
  • Clear communication skills.
  • Ability contribute to multiple incidents at any given time.
  • Analyzes systems and makes recommendations to prevent possible problems. Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
  • Scripting and software development to automate and help enhance existing solutions.
Additional responsibilities may include:
  • Actively provide data for and participate in root cause analysis.
  • Share knowledge globally between SRE teams.
  • Analyze systems and make recommendations to prevent possible incidents.
  • Strive for continuous improvement and make recommendations based on SRE process.
  • Other duties and responsibilities as assigned.
 
About Ericsson UDN:
 
Ericsson UDN is developing the first distributed edge compute platform that enables services such as content delivery to deployed on service provider access networks.  Our goal is to enable the next generation of applications such as VR/AR, gaming, IoT, and big data to be deployed in a tiered compute infrastructure that enables massive scale with ultra-low latency. Our first service, a massively scaleable Content Delivery service is already in production.  Successful candidates will have the opportunity to work with the largest names in the business.
or
this job portal is powered by CATS