Vitold Radkevich

Senior SRE / Platform Engineer


Distributed systems | Reliability | Cloud platforms | AI-driven operations

S c r o l l d o w n t o f i n d m y C V

About me

Hi, I'm Vitold. I build and operate distributed systems in production environments.

With 7+ years of experience, I focus on system reliability, observability, and performance under real-world production load. I take ownership of backend and platform systems and ensure they remain stable, scalable, and observable in production.

My work sits at the intersection of backend engineering and SRE - building systems that not only work, but remain resilient under failures, high load, and real operational pressure.

I'm particularly interested in combining SRE practices with AI - using LLM-based systems to automate incident analysis, reduce alert fatigue, and improve on-call efficiency.

What I do

I design and operate backend and platform systems in production environments, including architecture decisions, scalability, observability, and incident response.

My focus includes:

Reliability engineering and system stability under production load

Incident response and on-call (SRE practices)

Observability design (metrics, logs, tracing, alerting)

Building internal engineering platforms

Automating operational workflows and reducing MTTR

Technical focus

Reliability & SRE:
Incident Commander experience, SLO / SLI design, Chaos Engineering (Chaos Mesh), production incident response.

Platform Engineering:
Internal engineering platforms for incident management, load testing (Gatling-based systems), and system reliability tooling.

Backend & Distributed Systems:
Java / Kotlin, Spring ecosystem, microservices, event-driven systems (Kafka, RabbitMQ).

Cloud & DevOps:
AWS / Azure / GCP, Kubernetes, Docker, CI/CD pipelines, production monitoring systems.

AI for Engineering:
LLM-based automation for incident analysis, AI-assisted SRE workflows, reducing operational load through automation.

Selected impact

I have experience acting as Incident Commander during production outages, working to reduce MTTR, improve system observability, and build internal SRE tools for incident management and chaos testing.

I have contributed to systems that improve reliability, automate operational workflows, and support engineering teams in high-pressure production environments.

How I work

I take full ownership of systems in production and focus on reliability, clarity under pressure, and reducing operational risk through automation and observability.

I'm used to working in international Agile teams and collaborating closely with engineers, product managers, and stakeholders in high-responsibility environments.

Certifications

AWS Certified Cloud Practitioner

Microsoft Certified - Career Essentials in Generative AI

AI Devs Certification (LLM systems & engineering applications)

Languages

English - B2

Polish - B2

Russian - Native

Outside of work

I enjoy an active lifestyle - skiing, hiking, cycling, swimming, and traveling. I'm also passionate about cars and modern technologies.

Certifications

Commercial projects

Project 13
Project 12
Project 11
Project 10
Project 9
Project 8
Project 7
Project 6
Project 5
Project 4
Project 3
Project 2
Project 1

Contact Me