Senior Site Reliability Engineer

Reston, VA

Cloudpermit is seeking a Senior Site Reliability Engineer with in-depth knowledge of modern Cloud and CI/CD architecture and meaningful experience in its applications to multi-tenant SaaS products. In this role, you will live and breathe cloud-native architectural practices while engineering scalable solutions that value simplicity over complexity. In this role, scalability is the key objective in all architecture, design, and code. This individual will aim for long-term success, rather than short-term gains.

About Cloudpermit

Cloudpermit is the fastest-growing community development, SaaS (software-as-a-service) company in North America. We provide local governments and their citizens with cloud-based software products for all land management processes, including permitting, code/zone enforcement, business licensing, city planning, and more. Headquartered in Reston, VA, Cloudpermit was founded over ten years ago and is growing rapidly across North America. Cloudpermit is committed to delivering efficient, accessible, and smart land management software for agencies and citizens nationwide.

Apply Now

Responsibilities and Duties

Implement Cloud native design principles to ensure the reliability, scalability, and performance of our large-scale, cloud-based multi-tenant SaaS solution and infrastructure, including automation, monitoring, and incident response.
Develop and maintain automation tools and infrastructure as code to simplify operations and enhance efficiency using cloud-based tools and technologies, including automated continuous integration and delivery pipelines authored with Jenkins, Terraform, Ansible, and various Git tools.
Ensure the availability, reliability, and performance of critical systems and applications, maintaining the current 99.99% availability.
Implement and maintain robust monitoring systems to track system health and performance, and configure alerts for critical issues.
Respond to incidents and outages, diagnose problems, and implement solutions to restore service. Analyze incidents to identify root causes, implement preventive measures to avoid future issues, and communicate the plan of action effectively to all stakeholders.
Plan for future capacity needs and ensure that systems can handle anticipated workloads while keeping Cloud costs at 1% of SaaS revenue.
Conduct disaster recovery exercises to discover and document RTO & RPO, and document recovery procedures to minimize mean time to recover.
Identify areas for improvement in systems, processes, and tools, and implement changes to enhance reliability, performance, and improve the developer experience.

Qualifications, Skills, & Past Experience

Bachelor’s Degree.
5+ years of experience in Cloud native SaaS design, automation and deployments of large-scale infrastructure on Google Cloud or AWS or Azure.
5+ years of experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust.
5+ years of systems design, software development, operations, automation, and process improvement experience including CI/CD pipelines build processes.
3+ years of experience in Agile practice.

Company Culture

Focus - We are focused on our passion, work ethic, and goals, so we can continue to push forward and create innovative software.

Collaboration - We work and succeed as a team at Cloudpermit because we're stronger together. We accept each other’s strengths so we can learn from one another and become better.

Respect - Respecting our customers, and respecting each other, is of the utmost importance. We trust our team to do great work and stay open to new ideas.

The Cloudpermit team is comprised of motivated, positive, and tech-savvy team members. We enjoy working as a team to solve problems and thrive on collective and personal success. At Cloudpermit, you will be assigned engaging and challenging projects and will have opportunities to give input and direction. The Cloudpermit work environment is inclusive, challenging, and rewarding.