Eight years in infrastructure: what I actually did and what I learned from it
Published: 2026-06-06
I've been doing infrastructure work for eight years. The short version: started in IT support in 2018, worked my way up to owning multi-cluster Kubernetes platforms. Along the way I worked in a Russian GDS the size of Amadeus, a ticketing startup, a fintech crypto division, and two products in a travel-tech holding simultaneously. This post is the longer version — what each job was actually like, what I built, and what I'd do differently.
Sirena-Travel — where it started (2022–2025)
Sirena-Travel is a Russian GDS: the system that processes airline seat inventory and booking transactions across the entire Russian air travel market. Think Amadeus or Sabre, but built and operated domestically. When I joined in August 2022, the infrastructure team was maintaining 500+ CentOS VMs running a polyglot stack — Java, Python, PHP, .NET, C++. Everything mission-critical, 24/7, millions of bookings.
The biggest project I worked on there was the VM-to-container migration. We went through three generations: XEN/libvirt → Docker Swarm → Kubernetes. The migration took the better part of two years and ended with compute utilisation up 40% and the same workloads running on a fraction of the hardware.
CI/CD was the other major investment. When I arrived, deploying anything took hours: manual SSH, custom deploy scripts written by different teams, no consistency. I rebuilt pipelines from scratch across all codebases in GitLab CI — unit and integration tests, SonarQube SAST, Trivy container scanning, Docker build, Helm deploy. Deploy time dropped from hours to under 10 minutes. We added promotion gates (dev → staging → prod) and auto-rollback on failed health checks. Production incidents caused by bad deploys dropped 65%.
Observability was Prometheus across 500+ nodes, 30+ Grafana dashboards, plus a Zabbix bridge for legacy systems that predated our monitoring stack. On-call rotation with runbooks and mandatory post-mortems. MTTR went from 120 minutes to 20 over two years.
The thing about enterprise infrastructure at that scale: you can't move fast, and that's mostly fine. A single bad deploy has serious downstream consequences. The discipline around promotion gates, runbooks, and post-mortems that I built at Sirena is something I carried forward to every job after.
Flowerave — building from zero (2024–2025)
In parallel with Sirena, I took a contract at Flowerave — a ticketing startup, think Ticketmaster or Eventim for the Russian market. I was the only DevOps engineer. There was nothing. No CI, no Kubernetes, no monitoring, no IaC.
I built the full stack on Yandex Cloud with Terraform: VPCs, managed PostgreSQL, Redis, S3, DNS. Production Kubernetes cluster with Nginx Ingress, HPA, network policies separating the microservices — catalogue, orders, payments, notifications. Self-hosted GitLab CE for source control and CI. Prometheus + Grafana + ELK + Sentry.
One thing I prioritised early: PostgreSQL backups. Daily full backup to S3 plus continuous WAL archiving — RPO under 1 hour, RTO under 30 minutes. Boring to set up. In 15 months there was one incident where we needed to restore from backup. It worked.
The metric I'm proudest of: infrastructure provisioning went from 3 days (doing things manually) to 30 minutes (Terraform). Zero production data loss in 15 months.
The thing you learn as a solo DevOps in a startup: scope management. Every developer wants observability, zero-downtime deploys, feature flags, canary releases. You can build all of it, but you'll spend your time building infrastructure instead of keeping it running. I said no to a lot of things and focused on reliability fundamentals: backups, monitoring, rollback capability.
MixVel — designing a real platform (2025–present)
MixVel is a private division within the Sirena-Travel holding — agent-facing flight search with its own dedicated GDS code. The scale is different from Flowerave: high-throughput real-time search, multiple teams, 6 Kubernetes clusters (dev, test, sre, loadgds, demo, plus cloud environments).
The most significant thing I built here is the FluxCD v2 hub-and-spoke GitOps platform. One hub cluster manages all spokes via encrypted kubeconfigs stored in SealedSecrets. No direct kubectl apply in any environment — everything goes through FluxCD reconciliation. The Kustomize structure has three layers: base (shared across everything), custom (per-product customisation), and per-env patches (environment-specific overrides). Over 90% of configuration is shared; per-environment patches are minimal diffs.
The data platform runs entirely via FluxCD HelmReleases: Kafka, RabbitMQ, Cassandra, ClickHouse, MongoDB, Redis, MinIO. Cluster provisioning is automated with Ansible — k3s install, Flux bootstrap, Cilium, node tuning. New cluster ready in under 30 minutes.
I also wrote several tools in this role:
env-view — a FastAPI dashboard deployed in every cluster that shows live state of ingress, services, and pods with HTTP/TCP health checks. The on-call team used to SSH into clusters to check things. Now there's a URL. Deployed as a Helm chart.
abot — a custom Alertmanager webhook receiver with per-team routing. Alertmanager's built-in routing is powerful but hard to extend when you need different escalation logic per team. abot is ~300 lines of Python, deployed as a Helm chart across all environments.
trivy-to-sonarqube — a Python CLI that converts Trivy container scan output to SonarQube's external issues format. Made it possible to surface security findings in the same place developers already looked at code quality issues.
The platform work taught me something that's hard to get from smaller setups: the cost of inconsistency compounds. When you have 6 clusters and they drift from each other, debugging cross-environment issues becomes genuinely hard. The 3-layer Kustomize structure and GitOps enforcement aren't overhead — they're the thing that lets you maintain 6 clusters without 6 times the toil.
MTS — a side project in fintech (early 2026)
From January to April 2026 I took a part-time contract at MTS (the Russian telecom). Their crypto/fintech division was building a blockchain settlement platform (VED) and two B2C exchange products. Different from anything else I'd worked on.
The infra ran on MWS (MTS Web Services), which uses a custom Terraform provider — not the standard ones you'd find in the Terraform registry. I built kubeadm Kubernetes clusters per environment: dev / stage / test / prod for both VED and B2C, 8 environments total, with node taints to isolate blockchain workloads from the B2C stack.
GitOps was ArgoCD here instead of FluxCD. The approach was argocd app sync + argocd app wait triggered from GitLab CI, with a 15-minute timeout on sync operations. The pipeline library was built with include: — Kaniko builds, ArgoCD Helm deploy, Trivy with a custom OPA policy enforcing non-root USER, SonarQube, Semgrep. Pipeline definitions for multi-environment deploys were generated with Jsonnet.
The interesting part was the Ansible side: I wrote roles covering the full cluster lifecycle — LVM volume setup, iptables configuration, log rotation, kubeadm install and config, WireGuard VPN for secure cluster access, and user provisioning per environment.
Short engagement, but dense. The part-time constraint meant I had to be deliberate about what I worked on and in what order. Everything that blocked the dev teams got done first.
Red Rose Traveltech — the current job (2026–present)
Red Rose is MixVel's subsidiary for the European market — B2B corporate travel. I cover infrastructure for both Red Rose and MixVel simultaneously, which means I'm running two separate product stacks from the same position.
The Red Rose stack is cleaner to set up than MixVel was, because I had the playbook from MixVel. Full IaC on Yandex Cloud via Terraform, CI/CD in GitLab with deploy cycles under 5 minutes, Prometheus + Grafana aligned to explicit SLO targets (99.9%+ on core booking APIs).
The thing that's different here is the operational automation via n8n. I've been running n8n across both MixVel and Red Rose for event-driven ops pipelines: Grafana alerts → Jira tickets, FluxCD reconciliation status digests, Slack deployment notifications, partner onboarding notifications on the Red Rose side, Terraform drift alerts. The pattern replaces ad-hoc scripts that accumulate in repos and then break silently when someone changes an API. n8n workflows are visible, restartable, and testable.
What I'd do differently
A few things I got wrong and learned from:
Starting with monitoring instead of ending with it. At Flowerave I set up full monitoring relatively early, which meant I had data when things broke instead of guessing. At Sirena, monitoring was retrofitted onto an existing system, which is a much harder problem. If I were starting a new project today, Prometheus and alerting would go in before the first microservice.
GitOps from day one. The discipline of "no direct kubectl apply" feels like overhead until you've debugged a cluster whose actual state has diverged from the repo for three weeks. At MixVel I enforced this from the start. It makes on-call significantly less unpleasant.
Documentation as a first-class deliverable. I started keeping runbooks seriously at Sirena after a few post-mortems where the recovery took longer than it should have because the institutional knowledge was in one person's head. The confluence-publisher tool I wrote at MixVel is the result of taking this seriously — Markdown in git, auto-synced to Confluence, treated the same as code.
Summary
- 2018–2022: IT support → Linux sysadmin, where it all started
- Sirena-Travel: 3 years, enterprise scale, VM-to-K8s migration, CI/CD from scratch, on-call, 500+ nodes
- Flowerave: 1.5 years, solo DevOps, startup from zero, reliability fundamentals
- MixVel: ongoing, FluxCD hub-and-spoke platform, 6 clusters, custom tooling, data platform
- MTS: 4-month contract, kubeadm + ArgoCD + Jsonnet, blockchain/fintech stack
- Red Rose: ongoing in parallel with MixVel, Yandex Cloud, n8n automation
Eight years compressed: the tools change but the problems don't. Bad deploys, missing backups, undocumented runbooks, infrastructure that only one person understands. The work is making those problems boring to solve.