Career Summary

As an SRE/DevOps Engineer, my primary responsibility is to ensure the smooth running of software applications in a production environment. I am responsible for creating, maintaining and optimizing the systems and processes that help deliver reliable software to customers. My key skills include expertise in automation, monitoring, and troubleshooting of complex systems. Deep understanding of operation systems such as Linux and Unix, distributed systems - Hadoop. I have experience in cloud platforms, such as Google Cloud and AWS. I am proficient in infrastructure as code tools, such as Terraform. I am also familiar with containerization technologies - Docker and Kubernetes and adept at using them to deploy and manage applications. I have experience with continuous integration and delivery (CI/CD) pipelines and well-versed in DevOps methodologies and practices. I have excellent communication skills and work collaboratively with other members of the development team to ensure smooth deployment and operation of software applications. I am proficient in scripting languages such as Python, Bash and general purpose languages such as Go, Java and C. My ability to write and maintain code, helps automate tasks and reduce manual intervention. I also have strong problem-solving and critical-thinking skills, which allows me to quickly diagnose and resolve issues that arise. Overall, as an SRE/DevOps Engineer, I play a crucial role in ensuring the reliability and scalability of software applications. My expertise helps development teams deliver high-quality software faster, while minimizing the risk of downtime and other issues that can impact customer experience.

Work Experience

Lead DevOps/SRE

Capital One
Apr 2023 - current

As an AI/ML infrastructure engineer I am responsible to designing, developing, and maintaining the technical infrastructure that supports artificial intelligence and machine learning projects. This includes building and managing large-scale computing systems, data storage and retrieval systems, and high-performance networking systems.

Responsibilities:

  • Administration and management of the AI/ML infrastructure in AWS public cloud.
  • Enhanced existing AI/ML pipelines through migration Jupyter Notebooks to Kubeflow.
  • Refactored internal microservice codebase in order to comply with best SDLC practices.
  • Upgraded the Terraform infrastructure as code (IaC) tool to leverage new features, enhancements, and performance optimizations introduced in the latest version

Achievements:

Led a major Terraform version upgrade project, resulting in notable improvements in infrastructure management processes and efficiency.

  • Upgraded the infrastructure codebase from Terraform version 0.11.5 to version 1.5.2, demonstrating a commitment to staying up-to-date with the latest technologies and best practices.
  • Implemented enhanced performance optimizations, resulting in a 30% reduction in infrastructure provisioning time and improved overall operational efficiency.
  • Leveraged new features and functionality introduced in the upgraded Terraform version, such as improved resource management and expanded module features, to streamline infrastructure deployments and enhance scalability.
  • Addressed potential vulnerabilities by incorporating the latest security enhancements available in Terraform version 1.5.2, ensuring a more secure infrastructure environment.

Technologies used:

  • AWS
  • PostgreSQL
  • Kubeflow
  • Jupyter Notebooks
  • Python
  • Bash
  • Terraform
  • IaC
  • Docker
  • k8s
  • Linux

Staff SRE

Twitter
Feb 2020 - Jan 2023

As a Staff Hadoop Site Reliability Engineer (SRE), I was responsible for leading the design, building, and maintenance of highly available and scalable Hadoop clusters. I was working closely with cross-functional teams to identify opportunities for automation, improving reliability and performance, and ensuring the security of the Hadoop cluster. I provided technical guidance and mentorship to junior team members, participated in on-call rotations, and collaborate with other senior engineers to establish best practices and drive technical innovation.

Responsibilities:

  • Administration and management of the hadoop fleet in the company, tens of thousands of hosts.
  • Troubleshooting and debugging technical issues for my team and across the company.
  • Continuous automation of all processes within the team and company.
  • Monitoring, SLA/SLO/SLI.
  • Working with the open source community.
  • Development of the different applications and scripts.
  • Refactoring and continuous improving existing code base.
  • Cloud migration process, implementation, documentation, review and feedback.
  • Mentoring peers within the team and company, interviewing new candidates.
  • Capacity planning and cost reduction.
  • Incident management.
  • Technical leadership, knowledge transfer, team roadmap planning, project management.
  • Docs creation and review, such as Technical Design Docs (TDD), Production Readiness Docs (PRD), etc.
  • Creating / Improving existing educational/training documentation (wiki/run books).

Achievements:

Implementation and integration Kerberos security on top of the cloud / on-prem clusters, without HDFS service interruption to enable GDPR

  • 100% HDFS service availability during rollout.
  • 100% of clusters running in secure setup.
  • No impact on Twitter users.
  • Streamlined process of provisioning Kerberos security on cloud and on-prem clusters.

E2E automation for scaling and collocation Apache Flume apps on physical boxes

  • Increase Apache Flume fleet by 75% across 3 DCs.
  • Save company $30M in CapEx for 2022-2023

Technologies used:

  • Vertica
  • MySQL
  • PostgreSQL
  • Kerberos
  • Scala
  • Apache Hadoop
  • Apache Hbase
  • Apache Flume
  • Apache Spark
  • MapReduce
  • Java
  • Go
  • Python
  • Bash
  • Terraform
  • IaC
  • Puppet
  • Docker
  • k8s
  • GCP
  • Linux

Senior SRE

Twitter
Nov 2018 - Feb 2020

As a Senior Site Reliability Engineer (SRE), I was responsible for designing, building, and maintaining highly available and scalable Hadoop clusters. I was working closely with cross-functional teams to identify opportunities for automation, improving reliability and performance, and ensuring the security of the Hadoop cluster. I participated in on-call rotations to provide 24/7 support and troubleshoot production issues.

Achievements:

Bring test driven development methodology to the team

  • 80% of our code base became covered with unit tests.
  • Significant reduction in bugs across new and existing codebase.
  • Overall 25% improvement in team velocity.

Technologies used:

  • Vertica
  • MySQL
  • PostgreSQL
  • Kerberos
  • Scala
  • Apache Hadoop
  • Apache Hbase
  • Apache Flume
  • Apache Spark
  • MapReduce
  • Java
  • Go
  • Python
  • Bash
  • Terraform
  • IaC
  • Puppet
  • Docker
  • k8s
  • GCP
  • Linux

Senior Big Data Engineer

IAS
Jan 2018 - Nov 2018

As a Senior Big Data Engineer I was responsible for designing, implementing, and maintaining the data infrastructure for large-scale data processing systems. My primary focus was on troubleshooting, developing and optimizing the systems that store, process, and analyze massive volumes of data in real-time.

Responsibilities:

  • Development and enhancing various data pipelines, troubleshooting, integration with existing CI/CD pipelines using tools such as Fabric, Ansible and Jenkins.
  • MySQL/Impala database schema migration using Maven and Liquibase.
  • YARN jobs troubleshooting and optimization, such as filtering, limiting, reducing skew.
  • Propose CDH cluster configuration changes.
  • Integration testing existing data pipelines with Docker.
  • Migration of the various Bash script to Python.
  • Unit test coverage of the existing pig scripts with Pig Unit as an enhancement of the existing CI/CD pipelines.
  • Hands on experience with Hive and Impala for interactive data analyzing as well as building permanent DS pipelines utilizing these ecosystem components.
  • Data impact analysis in case of any code changes.
  • Experience in writing custom UDF's for extending Pig core functionality.
  • Develop Impala/Hive Liquibase plugin for schema migration in CI/CD pipelines.

Achievements:

Developed and opens source liquibase plugin for Hive and Impala database migration

  • Significantly reduced e2e testing time from hours to minutes.
  • Overall 10-15% improvement in team velocity.

Technologies used:

  • NoSQL
  • MySQL
  • ETL
  • Apache Spark
  • Apache Airflow
  • Hadoop
  • MapReduce
  • Hbase
  • Cloudera Impala
  • Hive
  • Chef
  • Ansible
  • Docker
  • Vagrant
  • AWS
  • Linux
  • Python
  • Bash
  • Java

Senior Hadoop Administrator

Cigna
Aug 2017 - Jan 2018

As a Senior Hadoop Administrator I was responsible for maintaining and optimizing the Hadoop infrastructure that supports big data processing. My primary focus was on ensuring the high availability, scalability, and performance of Hadoop clusters to support the organization's big data needs.

Responsibilities:

  • Installation, administration and management kerberized CDH cluster.
  • Automation node provisioning with Ansible.
  • CDH upgrades, patching and installation of ecosystem products through CM and CLI along with CM upgrade.
  • MR/Spark jobs troubleshooting.
  • Involved in backup and disaster recovery planning and implementation procedures.
  • Troubleshooting different ecosystem products such as Hue, Hive and Oozie.
  • Performance bottleneck illumination for CDH and ecosystem components especially Apache Hbase.

Achievements:

Migrate legacy Bash provisioners to Ansible configuration management tool

  • Significantly increase code stability, reusability

Technologies used:

  • Kerberos
  • Apache Spark
  • MapReduce
  • Hadoop
  • Hbase
  • Linux
  • Ansible
  • Python
  • Bash

Senior DevOps Engineer

Grid Dynamics
Feb 2015 - Aug 2017

As Senior DevOps Engineer I was responsible for overseeing the development, deployment, and maintenance of a customer software systems. My role involved collaboration with development teams to ensure software code is released in a reliable, efficient, and automated manner.

Responsibilities:

  • Construction, adjustment and maintenance of the Hadoop cluster.
  • Operation experience with MR1/MR2.
  • Develop new features and supporting existing ETL processes.
  • Operation experience with Hive, Impala, Pig, Oozie.
  • Implement deployment automation procedures with Ansible, Puppet, Vagrant, Docker.
  • Building CI/CD pipelines for various projects.
  • Operation experience with column oriented database Infobrigth.
  • Development service scripts and programs.
  • Development and maintenance of working documents

Achievements:

Adopt BATS testing framework in organization

  • Increase unit test coverage for organization's bash code base from 0 to 35%
  • Improve code stability in production
  • Reduce by 5% number of critical production bugs

Technologies used:

  • NoSQL
  • MySQL
  • ETL
  • Apache Spark
  • Apache Airflow
  • Hadoop
  • MapReduce
  • Hbase
  • Cloudera Impala
  • Hive
  • Chef
  • Ansible
  • Docker
  • Vagrant
  • AWS
  • Linux
  • Python
  • Bash
  • Java

Hadoop Administrator/PostgreSQL DBA

Okko
Jan 2012 - Feb 2015

As Hadoop Administrator / PostgreSQL DBA I was responsible for managing and maintaining Hadoop clusters that store and process large amounts of data as well as design, implementation, and maintenance of PostgreSQL databases. The role involved configuring and optimizing the Hadoop environment to ensure that it meets the performance and security requirements of the organization as well as ensure that the PostgreSQL databases was optimized for performance, highly available, and secure.

Responsibilities:

  • Construction, adjustment and maintenance of the Postgresql database cluster.
  • Construction, adjustment and maintenance of the Hadoop cluster.
  • Construction, adjustment and maintenance of the Hbase cluster.
  • Setup and operation experience with Hadoop Hive and Hadoop Impala.
  • Performance optimization with Hadoop,Hbase,Postgresql.
  • Construction, adjustment and maintenance of the Redis servers cluster.
  • Setup and maintenance Citrix Netscaler.
  • Configuring the Web Servers Apache, Tomcat6, Nginx.
  • Development service scripts and programs.
  • Carrying configurations of hardware and software in the event of unforeseen situations that could lead to malfunction.
  • Decision emergencies.
  • Development and maintenance of working documents.

Achievements:

Build a HA PostgreSQL cluster using OSS CRM

  • Improved DB availability by 200%
  • Increased scalability by 200%
  • Significantly improved DB performance, fault tolerance and simplified maintenance

Develop and open source tool to manage Citrix Netscaler app

  • Enhance user experience through graceful disabling of the Citrix Netscaler management services
  • Streamlined deployment pipeline
  • Significantly increase development velocity in the team, code deploys from ones a week to multiple per day

Technologies used:

  • NoSQL
  • PostreSQL
  • Hadoop
  • MapReduce
  • Hbase
  • Cloudera Impala
  • Hive
  • Puppet
  • Ansible
  • AWS
  • Linux
  • Python
  • Bash
  • Java

Systems engineer / PostgreSQL DBA

Megafon
Aug 2008 - Jan 2012

As Systems Engineer I was responsible for designing, implementing, and maintaining complex computer systems and networks as well as design, implementation, and maintenance of PostgreSQL databases. The role involved collaborating with other technical teams to develop solutions that meet the business requirements of the organization as well as ensure that the PostgreSQL databases was optimized for performance, highly available, and secure.

Responsibilities:

  • Install new hardware.
  • Studying and preparing to install the new version of the software on entrusted equipment.
  • Configuration and maintenance of servers.
  • Configuration and maintenance Oracle RDBMS.
  • Configuration and maintenance Postgresql RDBMS.
  • Carrying configurations of hardware and software in the event of unforeseen situations that could lead to malfunction.
  • Decision emergencies.
  • Development and maintenance of working documents.
  • Project management for the implementation of managing and monitoring systems.
  • Protection management and monitoring systems.

Technologies used:

  • Oracle
  • PostreSQL
  • Solaris
  • RDBMS
  • HP-UX
  • Unix
  • Unix Shell Scripting
  • Linux
  • Bash
  • Perl

Skills & Tools

Languages

  • Python
  • Go
  • Bash
  • Java
  • C

Development

  • Git
  • SVN
  • Github

Virtualization

  • KVM
  • LXC
  • Docker
  • Kubernates

Cloud

  • GCP
  • AWS

Provisioning

  • Ansible
  • Chef
  • Puppet
  • Terraform

CI/CD

  • Jenkins
  • Github actions

RDBMS

  • PostgreSQL
  • MySQL

Distributed Systems

  • Apache Hadoop
  • Apache Spark
  • Cloudera Impala

OS

  • Linux
  • Unix

Others

  • DevOps
  • Code Review
  • HDFS
  • YARN
  • MapReduce
  • Flume
  • Kafka
  • Hive
  • Unit Testing
  • Vertica
  • Oracle
  • Liquibase
  • Fabric
  • Vagrant
  • VirtualBox
  • MVN
  • JDK
  • awk
  • sed

Education

  • Bachelor's Degree in Information Systems and Technology
    Saint Petersburg ITMO State University
    2002 - 2008

Publications

  • Kerberizing Hadoop Clusters at Twitter
  • How we scaled Reads On the Twitter Users Database

OSS

  • Impala/Hive Liquibase plugin
    Author of an open source Liquibase plugin for Cloudera Impala and Apache Hive
  • Contributed to dnsmasq open source project
    Enhance --conf-dir parameter to load files in a deterministic order in version 2.81

Language

  • Russian (Native)
  • English (Fluent)

Interests

  • Hiking
  • Fitness
  • Travelling