Lancium is seeking a DevOps Engineer to help scale our High-Throughput Compute Grid as we bring our service offering to market. We are on track to grow our infrastructure to over 15,000 CPU cores and 1000 GPUs by the end of the year. If you are experienced in large-scale infrastructure and looking to be involved in the early stages of a new cloud offering focused on High Throughput Computing, this might be the right place for you.
- Work with data center personnel to provision and install new compute clusters.
- Manage configuration of compute clusters and networks using Ansible.
- Monitor status of the Lancium Compute Grid and remotely troubleshoot error conditions and performance issues.
- Coordinate with data center personnel to troubleshoot, remove and replace failed equipment.
- Work with software developers to test and migrate new code into production.
- Maintain existing CI workflows and help build processes and infrastructure to implement CD workflows.
- Maintain and extend reach of APM metrics and exception reporting in the Lancium Compute codebase.
- Manage Lancium Compute’s monitoring, alerting, and logging infrastructure.
- Manage Lancium Compute’s DNS and DHCP infrastructure.
- Extensive experience administering Linux servers
- Experience with Configuration Management using tools such as Ansible, Salt, Puppet, or Chef
- Server hardware troubleshooting
- Flexibility to readily respond to changing circumstances and expectations; open to new ideas and procedures
- Excellent verbal communication skills- Ability to handle multiple simultaneous tasks under pressure
- High motivation to work with minimal supervision in a collaborative environment
- Strong organization and time management skills, with the ability to prioritize and triage workflow
- Experience with Scientific, High-Throughput, or High-Performance Computing environments
- Familiarity with Singularity containers
- Familiarity with programming in Ruby, Python, or Java
- Experience using Git for version control
- Familiarity with Prometheus, Grafana, and/or Sentry
- Familiarity with ISC Bind and DHCP servers
- Familiarity with Data Center networking
The DevOps Engineer will primarily work in Lancium’s Charlottesville office but the position does require travel to Lancium’s Houston Operations Center along with remote data center locations. Work locations may include data centers with both climate-controlled and non-climate-controlled conditions. There may be exposure to extreme temperatures, noise and vibration, and mechanical or electrical hazards.
Full-time with approximately 15% travel
- Health Insurance
- Dental Insurance
- Vision Insurance
- Life Insurance
- Voluntary Short and Long Term Disability Insurance
- Paid Holidays and Time Off
Lancium is a technology company creating software and technical solutions that enable the faster growth of renewable energy. Our products include Lancium Smart Response™ for server power management, the Lancium Compute Platform for high throughput computing applications and Lancium Clean Compute Centers™ that absorb excess renewable energy. These solutions help ensure that renewable energy can power our future.
Lancium’s technical headquarters are located in Northwest Houston. Sucessful candidates will join a dynamic company with considerable room for advancement as the company expands into new markets and geographies.