[Talk::Overflow #16] DevOpsDays - Tel Aviv 2025
From Shiny Tools to Sharp Edges: The State of Modern DevOps
DevOpsDays Tel Aviv 2025 wasn’t just another conference — it surfaced some of the hard edges teams are bumping into as cloud-native maturity matures into real-world complexity. Between GitOps quests, cloud bill puzzles, AI/ML infra, and control-plane design limits, what stands out isn’t tooling hype but engineering trade-offs we have to wrestle with today: how to build reliable platforms without drowning in overhead, how automation and agents really impact ops flows, and how we confront brittleness in distributed systems at scale.
The agenda blended classical “DevOps best practices” with emerging pressures like agentic AI, cost debugging, and control-plane bottlenecks — a reflection of where the ecosystem actually is: pragmatic, messy, and tool-agnostic at the core.
Featured Talks
Building Docker: Behind the Scenes of the Container Revolution — Solomon Hykes
Solomon revisits where containerization actually came from and why those original design decisions still matter to platform engineers. This talk stood out because it grounded today’s orchestration tooling in historical design constraints, giving practical intuition for why containers behave the way they do — critical when debugging at scale.
2. Orchestrating Autonomous Agents in DevOps: Comparing Strands Agents, Deep Agents & AutoGen — Engin Diri
A rare practitioner discussion on how autonomous agents fit into DevOps workflows. Rather than buzz, Engin focused on trade-offs in autonomy vs control, observability implications, and failure modes — essential if you’re experimenting with AI helpers in production pipelines.
3. Would You Drive a Car Without a Dashboard? — Gabriella Nir
A crisp metaphor with real engineering consequences: if you can’t see the state of your system clearly, you can’t run it safely. Gabriella unpacks how teams often overlook basic observability hygiene and what it costs — fire-fighting time, outages, and trust.
4. Building a Production-Grade AI/ML Inference Platform on Kubernetes — Liad Drori
Moving beyond notebooks: this talk dug into the nitty-gritty of deploying and scaling ML inference on Kubernetes. The engineering insights here — resource isolation, rollout strategies, observability for models — are directly transportable to real AI workloads, not just demo pipelines.
5. Launch Party’s Over. Now What? A Guide to Real-World Ops with Crossplane & ACK — Guy Menahem
A grounded look at Crossplane + AWS ACK in production — not the sales pitch. Guy walked through how to wrestle with API drift, reconciliation loops, and environment fragmentation. If you’re contemplating GitOps beyond Kubernetes manifests, this is one of the few talks that shows you what actually breaks and how to fix it.
All Other Talks:
Behind the Cloud @Twitter 1.0 — Bobby Dorlus
Operational lessons from scaling and running massive distributed systems at Twitter’s early scale.
Newsflash: There is no Quality as a Service — Niv Yungelson
Why quality can’t be outsourced — it must be designed into pipelines and checks.
We need to talk about limits — Avishai Ish-Shalom
A conceptual talk on system limits and when abstractions leak — great framing for platform edge cases.
From Duct Tape to Declarative: Playtika’s Platform Overhaul at Unicorn Scale — S. Rosenberg & S. Mashiach
How a large platform team moves from brittle scripting to declarative definitions at scale.
Chasing Ghost Traffic: Catching the Cross-AZ Culprit Killing Your Cloud Bill — Aviv Zohari
Concrete techniques to detect and fix cross-zone network cost anomalies in cloud bills.
How to Build Quality-Driven Agentic AI in Noisy Big Data Environments — Itiel Shwartz
Data hygiene and feedback loops for AI tooling — what actually moves the needle in “agentic” workflows.
Beyond Argo Events: Leveraging NATS for Scalable Webhook Management — Or Navon
Using NATS to build scalable event delivery where webhook semantics break down.
5 Serverless Patterns You Should Stop Using (And What to Do Instead) — Ran Isenberg
Opinionated yet practical look at common anti-patterns in serverless applications.
Orchestrating Autonomous Agents in DevOps: Comparing Strands Agents, Deep Agents & AutoGen — Engin Diri
How different autonomous agent models compare and where they fit in DevOps flows.
Build a Self-Service Hub in Slack — Shaked Braimok-Yosef
Step-by-step on turning Slack into a lightweight self-service automation interface for teams.
Thinking Outside the Compositions: When Control Plane’s Logic Becomes the Bottleneck — Elhay Efrat
When control plane complexity itself throttles workflows — useful patterns for platform engineering.
Disaster Recovery in the Serverless Realm — Orel Bello
Challenges and strategies around recovering serverless applications under failure.
Building AI Agents with Serverless, Strands, and MCP — Hila Fish
Integrating serverless and agent frameworks — trade-offs in latency, cost, and observability.
Dancing with Failure — The Art of Timeouts & Retries — Alon Nativ
Concrete timeout and retry patterns that improve resiliency without causing cascading backoffs.
Oops-Driven Development — Shahar Shporer
Learning fast via controlled failure feedback — tactical tips for building healthier debug loops.
Truly Cloud Native AI Agents with Kagent and Khook — Anton Weiss
Exploring cloud native agent tooling — architectural considerations and integration patterns.
Would You Drive a Car Without a Dashboard? — Gabriela Nir
Metaphor-rich discussion on why clear observability surfaces are critical to team effectiveness.
The Hidden Complexity of Time in Serverless: A 5-Minute Reality Check
How time semantics and cold starts introduce hidden complexity in serverless environments.
Adjusting Your Mirrors: Finding and Fixing Blind Spots in Your Configurations — Dor Meiri
Patterns to detect config anti-patterns and blind spots before they bite you in production.
The Anatomy of a Patch: Backporting CVEs Without Breaking Things — Benji Kalman
How to backport security fixes safely, with minimal risk to running systems.
DevOpsDays Tel Aviv made it clear that the industry is done with hype and deeply focused on failure modes, trade-offs, and operational pain — from ghost traffic and control-plane bottlenecks to real disaster recovery in serverless systems. The common thread was survival: making existing platforms reliable under real load, not chasing the next shiny tool. Observability came through as core engineering work, not an add-on, reinforcing that you can’t design resilient systems without visibility baked into architecture and workflows. AI agents were everywhere, but the message was sober: autonomy without guardrails, reliability, and human oversight just creates new failure domains. Finally, platform engineering is evolving into complexity management, where success depends less on tool choice and more on building observable, debuggable abstractions teams can actually operate.
💬 Like what you read?
You’re reading Talk::Overflow #16 DevOpsDays — the weekly digest for developers who want to stay sharp and skip the noise.
→ Browse past issues
→ Suggest a talk
→ Share it with the friends
Stay curious. Stay kind.
— Talk::Overflow

