Evgenii Seliavka

Senior SRE | AI/ML Infrastructure

Career Summary

Site Reliability Engineer (ICT‑5) with 15+ years of experience building and scaling mission‑critical infrastructure at Apple, Twitter, and Capital One. Expert in cloud platforms (AWS, GCP), object storage, and distributed systems, with a proven record of driving reliability, automation, and cost efficiency across large‑scale environments. Skilled in leading cross‑functional teams and delivering resilient infrastructure solution that support high‑volume AI/ML workloads and core business platforms.

My expertise spans cloud platforms such as Google Cloud and AWS, infrastructure as code with Terraform, and containerization technologies like Docker and Kubernetes. I am proficient in CI/CD practices and tools, and well-versed in DevOps methodologies.

I have a strong foundation in Linux/Unix systems and distributed architectures including Hadoop. I work extensively with scripting languages such as Python and Bash, and general-purpose languages like Go, Java, and C. This allows me to efficiently automate tasks and reduce manual intervention.

I’m known for strong problem-solving and communication skills, and I collaborate effectively with development teams to ensure seamless deployment pipelines. My contributions help deliver high-quality software faster, reduce downtime, and enhance customer experience.

Work Experience

SRE

Apple

May 2023 - current

As a Site Reliability Engineer at Apple, I design, build, and support large-scale object storage systems that provide persistent storage layers for AI/ML workloads across both public and private cloud environments. My work ensures high availability, reliability, and performance of the storage infrastructure that underpins Apple’s compute-intensive AI/ML platforms.

Achievements:

Implemented tool for distributed copy of objects between various object stores(S3, GCS etc), made it available across the company.

Designed and implemented a distributed object copy tool supporting cross‑cloud transfers (S3, GCS, internal object stores), reducing large dataset migration time by 40%+.
Enabled company‑wide adoption of the tool, with 100+ engineering teams leveraging it for AI/ML workloads and storage workflows.
Scaled to handle petabyte‑level transfers and billions of objects with high reliability and fault tolerance.
Improved developer productivity by eliminating manual migration workflows, saving an estimated thousands of engineer hours annually.

Responsibilities:

Administration and management of the AI/ML infrastructure in AWS public cloud, GCP and on-premise.
Administration, support and management of various open-source, distributed storage and processing systems such as Alluxio / Apache Spark.
Refactoring and continuously improving existing code base through TDD approach.
Monitoring, SLA/SLO/SLI. Mentoring other engineers in the team. Incident management.

Technologies used:

AWS
GCP
Spark
Dask
Python
Bash
Fluxcd
IaC
Terraform
k8s
Linux

Lead DevOps/SRE

Capital One

Jan 2023 - May 2023

As an AI/ML infrastructure engineer I am responsible to designing, developing, and maintaining the technical infrastructure that supports artificial intelligence and machine learning projects. This includes building and managing large-scale computing systems, data storage and retrieval systems, and high-performance networking systems.

Achievements:

Led a major Terraform version upgrade project, resulting in notable improvements in infrastructure management processes and efficiency.

Upgraded the infrastructure codebase from Terraform version 0.11.5 to version 1.5.2, demonstrating a commitment to staying up-to-date with the latest technologies and best practices.
Implemented enhanced performance optimizations, resulting in a 30% reduction in infrastructure provisioning time and improved overall operational efficiency.
Leveraged new features and functionality introduced in the upgraded Terraform version, such as improved resource management and expanded module features, to streamline infrastructure deployments and enhance scalability.
Addressed potential vulnerabilities by incorporating the latest security enhancements available in Terraform version 1.5.2, ensuring a more secure infrastructure environment.

Responsibilities:

Administration and management of the AI/ML infrastructure in AWS public cloud.
Enhanced existing AI/ML pipelines through migration Jupyter Notebooks to Kubeflow.
Refactored internal microservice codebase in order to comply with best SDLC practices.
Upgraded the Terraform infrastructure as code (IaC) tool to leverage new features, enhancements, and performance optimizations introduced in the latest version.

Technologies used:

AWS
PostgreSQL
Kubeflow
Jupyter Notebooks
Python
Bash
Terraform
IaC
Docker
k8s
Linux

Staff SRE

Twitter

Feb 2020 - Jun 2023

As a Staff Hadoop Site Reliability Engineer (SRE), I was responsible for leading the design, building, and maintenance of highly available and scalable Hadoop clusters. I was working closely with cross-functional teams to identify opportunities for automation, improving reliability and performance, and ensuring the security of the Hadoop cluster. I provided technical guidance and mentorship to junior team members, participated in on-call rotations, and collaborate with other senior engineers to establish best practices and drive technical innovation.

Achievements:

Implementation and integration Kerberos security on top of the cloud / on-prem clusters, without HDFS service interruption to enable GDPR

100% HDFS service availability during rollout.
100% of clusters running in secure setup.
No impact on Twitter users.
Streamlined process of provisioning Kerberos security on cloud and on-prem clusters.

E2E automation for scaling and collocation Apache Flume apps on physical boxes

Increase Apache Flume fleet by 75% across 3 DCs.
Save company $30M in CapEx for 2022-2023

Responsibilities:

Administration and management of the hadoop fleet in the company, tens of thousands of hosts.
Troubleshooting and debugging technical issues for my team and across the company.
Continuous automation of all processes within the team and company.
Monitoring, SLA/SLO/SLI.
Working with the open source community.
Development of the different applications and scripts.
Refactoring and continuous improving existing code base.
Cloud migration process, implementation, documentation, review and feedback.
Mentoring peers within the team and company, interviewing new candidates.
Capacity planning and cost reduction.
Incident management.
Technical leadership, knowledge transfer, team roadmap planning, project management.
Docs creation and review, such as Technical Design Docs (TDD), Production Readiness Docs (PRD), etc.
Creating / Improving existing educational/training documentation (wiki/run books).

Technologies used:

Vertica
MySQL
PostgreSQL
Kerberos
Scala
Apache Hadoop
Apache Hbase
Apache Flume
Apache Spark
MapReduce
Java
Go
Python
Bash
Terraform
IaC
Puppet
Docker
k8s
GCP
Linux

Senior SRE

Twitter

Nov 2018 - Feb 2020

As a Senior Site Reliability Engineer (SRE), I was responsible for designing, building, and maintaining highly available and scalable Hadoop clusters. I was working closely with cross-functional teams to identify opportunities for automation, improving reliability and performance, and ensuring the security of the Hadoop cluster. I participated in on-call rotations to provide 24/7 support and troubleshoot production issues.

Achievements:

Bring test driven development methodology to the team

80% of our code base became covered with unit tests.
Significant reduction in bugs across new and existing codebase.
Overall 25% improvement in team velocity.

Technologies used:

Vertica
MySQL
PostgreSQL
Kerberos
Scala
Apache Hadoop
Apache Hbase
Apache Flume
Apache Spark
MapReduce
Java
Go
Python
Bash
Terraform
IaC
Puppet
Docker
k8s
GCP
Linux

Senior Big Data Engineer

IAS

Jan 2018 - Nov 2018

As a Senior Big Data Engineer I was responsible for designing, implementing, and maintaining the data infrastructure for large-scale data processing systems. My primary focus was on troubleshooting, developing and optimizing the systems that store, process, and analyze massive volumes of data in real-time.

Achievements:

Developed and opens source liquibase plugin for Hive and Impala database migration

Significantly reduced e2e testing time from hours to minutes.
Overall 10-15% improvement in team velocity.

Responsibilities:

Development and enhancing various data pipelines, troubleshooting, integration with existing CI/CD pipelines using tools such as Fabric, Ansible and Jenkins.
MySQL/Impala database schema migration using Maven and Liquibase.
YARN jobs troubleshooting and optimization, such as filtering, limiting, reducing skew.
Propose CDH cluster configuration changes.
Integration testing existing data pipelines with Docker.
Migration of the various Bash script to Python.
Unit test coverage of the existing pig scripts with Pig Unit as an enhancement of the existing CI/CD pipelines.
Hands on experience with Hive and Impala for interactive data analyzing as well as building permanent DS pipelines utilizing these ecosystem components.
Data impact analysis in case of any code changes.
Experience in writing custom UDF's for extending Pig core functionality.
Develop Impala/Hive Liquibase plugin for schema migration in CI/CD pipelines.

Technologies used:

NoSQL
MySQL
ETL
Apache Spark
Apache Airflow
Hadoop
MapReduce
Hbase
Cloudera Impala
Hive
Chef
Ansible
Docker
Vagrant
AWS
Linux
Python
Bash
Java

Senior Hadoop Administrator

Cigna

Aug 2017 - Jan 2018

As a Senior Hadoop Administrator I was responsible for maintaining and optimizing the Hadoop infrastructure that supports big data processing. My primary focus was on ensuring the high availability, scalability, and performance of Hadoop clusters to support the organization's big data needs.

Achievements:

Migrate legacy Bash provisioners to Ansible configuration management tool

Significantly increase code stability, reusability

Responsibilities:

Installation, administration and management kerberized CDH cluster.
Automation node provisioning with Ansible.
CDH upgrades, patching and installation of ecosystem products through CM and CLI along with CM upgrade.
MR/Spark jobs troubleshooting.
Involved in backup and disaster recovery planning and implementation procedures.
Troubleshooting different ecosystem products such as Hue, Hive and Oozie.
Performance bottleneck illumination for CDH and ecosystem components especially Apache Hbase.

Technologies used:

Kerberos
Apache Spark
MapReduce
Hadoop
Hbase
Linux
Ansible
Python
Bash

Senior DevOps Engineer

Grid Dynamics

Feb 2015 - Aug 2017

As Senior DevOps Engineer I was responsible for overseeing the development, deployment, and maintenance of a customer software systems. My role involved collaboration with development teams to ensure software code is released in a reliable, efficient, and automated manner.

Achievements:

Adopt BATS testing framework in organization

Increase unit test coverage for organization's bash code base from 0 to 35%
Improve code stability in production
Reduce by 5% number of critical production bugs

Responsibilities:

Construction, adjustment and maintenance of the Hadoop cluster.
Operation experience with MR1/MR2.
Develop new features and supporting existing ETL processes.
Operation experience with Hive, Impala, Pig, Oozie.
Implement deployment automation procedures with Ansible, Puppet, Vagrant, Docker.
Building CI/CD pipelines for various projects.
Operation experience with column oriented database Infobrigth.
Development service scripts and programs.
Development and maintenance of working documents

Technologies used:

NoSQL
MySQL
ETL
Apache Spark
Apache Airflow
Hadoop
MapReduce
Hbase
Cloudera Impala
Hive
Chef
Ansible
Docker
Vagrant
AWS
Linux
Python
Bash
Java

Hadoop Administrator/PostgreSQL DBA

Okko

Jan 2012 - Feb 2015

As Hadoop Administrator / PostgreSQL DBA I was responsible for managing and maintaining Hadoop clusters that store and process large amounts of data as well as design, implementation, and maintenance of PostgreSQL databases. The role involved configuring and optimizing the Hadoop environment to ensure that it meets the performance and security requirements of the organization as well as ensure that the PostgreSQL databases was optimized for performance, highly available, and secure.

Achievements:

Build a HA PostgreSQL cluster using OSS CRM

Improved DB availability by 200%
Increased scalability by 200%
Significantly improved DB performance, fault tolerance and simplified maintenance

Develop and open source tool to manage Citrix Netscaler app

Enhance user experience through graceful disabling of the Citrix Netscaler management services
Streamlined deployment pipeline
Significantly increase development velocity in the team, code deploys from ones a week to multiple per day

Responsibilities:

Construction, adjustment and maintenance of the Postgresql database cluster.
Construction, adjustment and maintenance of the Hadoop cluster.
Construction, adjustment and maintenance of the Hbase cluster.
Setup and operation experience with Hadoop Hive and Hadoop Impala.
Performance optimization with Hadoop,Hbase,Postgresql.
Construction, adjustment and maintenance of the Redis servers cluster.
Setup and maintenance Citrix Netscaler.
Configuring the Web Servers Apache, Tomcat6, Nginx.
Development service scripts and programs.
Carrying configurations of hardware and software in the event of unforeseen situations that could lead to malfunction.
Decision emergencies.
Development and maintenance of working documents.

Technologies used:

NoSQL
PostreSQL
Hadoop
MapReduce
Hbase
Cloudera Impala
Hive
Puppet
Ansible
AWS
Linux
Python
Bash
Java

Systems engineer / PostgreSQL DBA

Megafon

Aug 2008 - Jan 2012

As Systems Engineer I was responsible for designing, implementing, and maintaining complex computer systems and networks as well as design, implementation, and maintenance of PostgreSQL databases. The role involved collaborating with other technical teams to develop solutions that meet the business requirements of the organization as well as ensure that the PostgreSQL databases was optimized for performance, highly available, and secure.

Responsibilities:

Install new hardware.
Studying and preparing to install the new version of the software on entrusted equipment.
Configuration and maintenance of servers.
Configuration and maintenance Oracle RDBMS.
Configuration and maintenance Postgresql RDBMS.
Carrying configurations of hardware and software in the event of unforeseen situations that could lead to malfunction.
Decision emergencies.
Development and maintenance of working documents.
Project management for the implementation of managing and monitoring systems.
Protection management and monitoring systems.

Technologies used:

Oracle
PostreSQL
Solaris
RDBMS
HP-UX
Unix
Unix Shell Scripting
Linux
Bash
Perl

Skills & Tools

Languages

Python
Go
Bash
Java
C

Development

Git
SVN
Github

Virtualization

KVM
LXC
Docker
Kubernates

Cloud

Provisioning

Ansible
Chef
Puppet
Terraform

CI/CD

Jenkins
Github actions

RDBMS

PostgreSQL
MySQL

Distributed Systems

Apache Hadoop
Apache Spark
Cloudera Impala

OS

Linux
Unix

Others

DevOps
Code Review
HDFS
YARN
MapReduce
Flume
Kafka
Hive
Unit Testing
Vertica
Oracle
Liquibase
Fabric
Vagrant
VirtualBox
MVN
JDK
awk
sed

Education

Bachelor's Degree in Information Systems and Technology

Saint Petersburg ITMO State University

2002 - 2008

Publications

Kerberizing Hadoop Clusters at Twitter

Source link
How we scaled Reads On the Twitter Users Database

Source link

OSS

Impala/Hive Liquibase plugin

Author of an open source Liquibase plugin for Cloudera Impala and Apache Hive
Contributed to dnsmasq open source project

Enhance --conf-dir parameter to load files in a deterministic order in version 2.81

Language

Interests

Hiking
Fitness
Travelling