Site Reliability Engineering

Apply for this Position

Application Form     (* indicates required field)

Add LinkedIn Profile (optional) Login to enable automatically, or enter below: how?

Please attach the following files

Please upload either a word or PDF version of your cover letter for this position.
Please upload either a word or PDF version of your current resume.
Such as letters of recommendation, work examples, etc.

You will receive confirmation after we have received your application.

The Site Reliability Engineering (SRE) team at our client’s Company is a highly motivated technical group determined to triage and restore mission critical services with a passion for improving and maintaining uptime.

The primary responsibility of the SRE team is to monitor critical infrastructure/applications, manage fault tolerance across an enterprise cloud business and provide the necessary coverage to protect Ariba's Business Commerce 24x7 within the Cloud Operations organization. The successful candidate will possess the necessary experience to have strong knowledge of Unix systems, networking protocols, desire to build the necessary tools in order to accomplish the task at hand. The candidate must understand incident management and methodologies possess excellent verbal and written communication skills and be able to interact effectively with engineering and other operations teams. 

Primary Responsibilities:

  • Proactively monitor availability and performance of the Ariba Cloud using key performance tools.
  • Effectively and quickly respond to monitoring alerts, incident tickets and overall technical support for the Ariba product suite
  • Perform extensive application and web site troubleshooting to quickly resolve issues.
  • Work closely with subject matter experts within various Engineering teams
  • Ensure user tickets and monitoring alerts are handled according to pre-defined SLA's for response time, updates and closure.
  • Develop and automate manual tasks to improve day-to-day monitoring and scalability of time critical operations.
  • Handle communication and notification on major site issues to executive management teams.
  • Document standard operating procedures to effectively utilize ITIL best practices.
  • Ensure effective shift turnovers for continuous 24/7 support.

Minimum Qualifications

  • 5-7 years of experience working in a Unix environment
  • Experience working in a 24 x 7 enterprise environment

  • Triage and support system applications including but not limited to Apache, DNS, Sendmail, SSH, TCP/IP, NFS and common Internet protocols.
  • Excellent knowledge of operating system internals, file system structures and machine architectures in a Linux operating environment.
  • Basic knowledge of Oracle database administration
  • Ability to write and maintain Perl and Shell scripts to automate processes and enhance productivity.
  • Experienced working in a dynamic, fast-paced environment with well-developed practices and procedures.
  • Outstanding interpersonal, analytical, and communication skills
  • Must be reliable and dependable with ability to multi-task across various functions
  • BA/BS degree in MIS/CS or equivalent experience.

Location: Palo Alto, CA

Location: palo alto, CA
Apply for this Position
Apply at: