Rebuilding by Removing: How I Made My Sports Analytics Pipeline 20x Faster
From Overengineered Kafka Pipeline to a Simple, Lightning-Fast System: What I Learned by Ruthlessly Removing Complexity

Athletic performance data contains patterns. Some patterns reflect natural improvement through training. Others represent statistical outliers that warrant investigation. My thesis explored a fundamental question in sports analytics: how do you systematically identify performance anomalies across millions of competition results, and which detection methods prove most effective when validated against known cases?
The research involved analyzing over 1 million race results across track events, comparing 9 different detection approaches from statistical methods to machine learning models, and evaluating their performance against the Athletics Integrity Unit's database of sanctioned athletes. The goal was to understand which techniques actually work for anomaly detection in athletics and build tools that could support investigation workflows. I expected the system to handle frequent analysis requests, long-running computations, and potential scaling across multiple users, so I designed it as an event-driven pipeline.
Figure 1: Individual athlete profile showing typical performance patterns across career progression and statistical distributions.
To support this research, I built what seemed like the obvious architecture. Apache Kafka for event streaming. PostgreSQL for storage. Flask for the API. Separate worker processes for heavy computation. Four Docker containers working together. Message queues, producers, consumers, the full setup.
Three months in, the system worked. It also took 20 minutes to compare detection methods on a single event, crashed under load, and had become impossible to iterate on. The infrastructure I built to enable the research was actively blocking it.
This is the story of how I rebuilt that system by removing complexity, cutting containers while improving performance 20x. More importantly, it is about learning when architectural sophistication becomes the problem you are solving instead of the problem you actually have.
The Original Architecture: Building for Scale
The initial design followed a classic pattern. Analysis requests came in through a Flask REST API, which sent messages to Kafka topics. Worker processes listened to these topics, grabbed messages from the queue, ran the requested detection algorithms, and wrote results back to PostgreSQL. A separate Kafka consumer handled long-running computations in the background to avoid blocking the API.
This separation made sense in theory. The API layer could scale independently from the workers. Kafka provided guaranteed message delivery and replay capabilities. If a worker crashed mid-analysis, the message stayed in the queue for retry. Different detection methods could be routed to specialized workers based on how much computing power they needed.
The technology stack included:
Apache Kafka for message brokering and coordination
PostgreSQL for storing competition results, athlete records, and detected anomalies
Flask with worker processes for API and background processing
Docker Compose managing four containers
Python with pandas and scikit-learn for data processing and detection algorithms
Figure 2: Architecture Diagram — Original System
For the first couple of months, this architecture delivered. Analysis requests flowed through the system, detection algorithms identified suspicious performances, and results accumulated in the database. The separation between API, messaging, and computation felt clean. The system felt professional.
Where the Cracks Appeared
The problems emerged gradually, then all at once:
PostgreSQL was the first bottleneck. The database handled normal operations beautifully but struggled with analytical queries. Computing percentile distributions across an athlete's entire career meant scanning hundreds of rows. Calculating z-scores for an event required aggregating statistics across thousands of performances. Building temporal patterns for anomaly detection involved complex calculations over years of competition data. These queries consistently took seconds, often stretching into minutes for popular events like the 100-meter dash.
I optimized aggressively. Added indexes everywhere. Rewrote queries to be more efficient. Created pre-computed views for common aggregations. Each optimization provided small improvements, but the fundamental mismatch remained: PostgreSQL is optimized for transaction processing, not the columnar aggregations that dominated my analysis.
The second issue was operational complexity. Every new feature affected multiple parts of the system. Adding a detection method meant updating the Flask endpoint to accept new parameters, modifying the Kafka message format to include those parameters, updating the worker code to handle the new method, and ensuring everything stayed compatible. Debugging became detective work. Was the failure in message creation? Worker processing? Database writes? The logs were spread across four containers.
The breaking point came when I examined the code itself. Despite the clean architecture, the core analytics logic had grown into a 3,000-line monster. A single file contained implementations for all 9 detection methods. Each method followed a slightly different pattern. Each handled edge cases in its own way. Adding a new detection approach required copying 200 lines of template code, adjusting the core logic, and hoping the modifications did not introduce bugs in existing methods.
The system had the appearance of good architecture with the reality of unmaintainable code.
The Question That Changed Everything
Late one night, while debugging yet another Kafka timeout, I stepped back and asked a fundamental question: what problem was this architecture actually solving?
Was I processing thousands of analysis requests per second? No. Requests came in occasionally during batch evaluations, maybe a handful per minute at peak.
Did I need guaranteed message delivery across distributed system failures? Not really. This was research infrastructure, not a payment processing system. If an analysis failed, rerunning it was trivial.
Was I building for multiple teams with independent deployment cycles? No. I was the only developer. The API, workers, and detection algorithms all evolved together.
The honest answer was uncomfortable: I had copied production architecture patterns without understanding whether they solved problems I actually had. The infrastructure complexity was not enabling the research. It was preventing it.
Rebuilding from First Principles
I threw out the entire codebase and started over with three rules:
Analytical queries should feel instant. Not "optimized for the workload," but genuinely fast.
Detection methods should be trivial to add. Twenty lines of code, not two hundred.
If a component can be removed, remove it.
These rules led to different technology choices:
FastAPI's Background Tasks replaced Flask and Kafka. For occasional asynchronous operations, FastAPI's built-in Background Tasks provided everything I needed. No message brokers, no serialization overhead, no distributed coordination. Just simple async Python functions.
DuckDB joined PostgreSQL for analytics: DuckDB is purpose-built for analytical queries over large datasets. It connects directly to PostgreSQL, allowing queries to work with both systems. The same percentile calculation that took 5 seconds in PostgreSQL completed in 100 milliseconds in DuckDB through columnar storage and optimized execution. Aggregations across millions of rows became instantaneous.
Redis provided caching: Analysis results rarely change. Once you have computed anomaly scores for the men's 100-meter using z-score detection, those results remain valid until new competition data arrives. Redis caching with a 24-hour expiration meant most requests hit the cache and returned in milliseconds.
The new architecture consisted of three services: PostgreSQL + DuckDB for storage and analytics, Redis for caching, and a single FastAPI application handling both the API and background computation. No Kafka. No worker orchestration.
Figure 3: Architecture Diagram — New System
The Code Transformation
Architecture changes enabled code improvements, but the real transformation came from rethinking how detection methods should work.
I created a simple template that every detection method follows:
class BaseDetector:
def detect(self, data: DataFrame) -> List[Anomaly]:
raise NotImplementedError
def explain(self, anomaly: Anomaly) -> str:
raise NotImplementedError
Each detection method becomes a small class following this template. Z-score detection is 30 lines. Isolation forest is 40. Even complex Bayesian hierarchical models fit in under 100 lines. The system discovers and registers these detectors automatically on startup.
Adding a new detection method now means:
Create a file in the appropriate directory (statistical/, ml/, or advanced/)
Implement the template
Done. The system handles registration, API integration, and evaluation automatically.
The 3,000-line monster became 35 modular files across 12 directories. Each file has a single responsibility. Each detector is independently testable. The code base went from impossible to understand to something I could hold in my head.
The Performance Impact
The results exceeded expectations:
Query & Analysis Performance:
Loading 100m competition data + preprocessing: ~3 seconds → under 1 second (3x+ faster)
Single detection method execution: ~ 500ms → ~ 150ms (3x+ faster)
Full analysis run (Isolation Forest on 100m event): 129 seconds → 33 seconds (~4x faster)
Development Velocity: Testing an algorithm modification went from 20+ minutes (run the full analysis, wait for completion, check results) down to ~30–35 seconds. That shift changed everything. Instead of testing 3 ideas per hour, I could test 100+. Research that required careful planning of each experiment became an interactive exploration.
Code Maintainability: Adding copula-based anomaly detection took 45 minutes, including research time to understand the mathematics. The old system would have required hours of template code, schema updates, and integration testing.
Performance comparison chart showing old vs new system metrics
Figure 4a: Before vs After — User Interface and Timing Comparison:
Left: Original system with its cluttered interface and 129.055s total analysis time.
Right: Rebuilt system with a clean, focused UI delivering the same analysis in just 33.065s (≈4x faster overall).
Figure 4b: Detailed execution time breakdown for a full Isolation Forest analysis on 100m data. The new pipeline completes in ~33 seconds compared to over 2 minutes in the original system.
What This Taught Me About Architecture
The most important lesson was about appropriate complexity. Technology should match the problem you have, not the problem you imagine at scale.
Kafka is a brilliant technology. For systems processing thousands of events per second with complex routing requirements across multiple teams, it is often the right choice. I was not building that system. DuckDB is perfect for analytical workloads over large datasets. That matched my actual requirements exactly. Redis provides simple, effective caching. That was all I needed.
The second lesson concerned code organization versus architectural patterns. The original system had impressive separation at the architectural level but terrible separation within the code. The new system has a simpler architecture but much better modularity where it matters: in the actual implementation.
The third lesson surprised me: constraints breed clarity. By removing options (no message queue, no distributed workers), I had to think harder about what the system actually did. That forced clarity made every decision easier. Should this be async? Does this need caching? Is this pattern reusable? The answers became obvious once the unnecessary complexity disappeared.
The Trade-offs
The new architecture is not universally superior. It trades certain capabilities for simplicity:
Message delivery guarantees: Without Kafka, if the server crashes during analysis, that request is lost. For research infrastructure where rerunning analysis is cheap, this is acceptable. For systems where every request matters (payment processing, critical monitoring), it would not be.
Horizontal scalability: The application runs as a single process with async tasks. This limits scaling to vertical improvements (bigger server) rather than horizontal (more servers). For my workload, this is fine. For systems with unpredictable traffic spikes, it might not be.
Resource isolation: DuckDB runs in the same process as the FastAPI application, sharing memory. Heavy analytical queries can impact API responsiveness. Monitoring shows this has not been an issue in practice, but it is a real coupling that would matter at higher scale.
These are conscious trade-offs. Understanding what you gain and lose is more important than choosing the "correct" architecture.
When Each Approach Makes Sense
The event-driven architecture with Kafka and distributed workers makes sense when:
Request volume measures in thousands per second, not per minute
Different components genuinely need independent scaling and deployment
Multiple teams own different parts of the system
Message ordering and delivery guarantees are business-critical requirements
The simpler architecture with a single application and background tasks makes sense when:
Request volume is moderate and predictable
One team owns the entire system
Development velocity matters more than theoretical scalability
The problem does not require distributed coordination
Most projects fall into the second category. Most teams build the first architecture anyway, often because that is what they know or what looks impressive in system design discussions.
Where It Stands Now
The rebuilt system powers my research, analyzing millions of athletic performances across track and field events. By comparing 9 detection methods from statistical approaches to Bayesian inference, it evaluates precision and recall against known doping violations. It supports investigation workflows where analysts explore athlete performance trajectories, compare different detection approaches, and understand why specific performances were flagged as anomalous.
Figure 5: System evaluation view comparing the performance of the detection methods. The table shows precision, recall, and F1 scores for each approach, with consensus outliers listed below where multiple methods independently flagged the same performances.
Figure 6: Athlete profile view showing anomaly detection in action. Multiple methods flagged this 2016 performance as statistically unusual, which was later validated by a 4-year doping sanction.
The research findings are being prepared for conference submission (as of this writing), comparing statistical versus machine learning approaches for anti-doping detection and understanding which methods work best for different event characteristics.
None of this would have been possible with the original architecture. The infrastructure complexity was genuinely blocking the scientific work. Simplifying the system unlocked the research.
Would I have learned these lessons without building the complex version first? Probably not. Sometimes you need to experience the pain of overengineering to develop intuition for appropriate engineering. The key is recognizing when complexity serves the problem versus when it serves your assumptions about what problems you might have in the future.
The best architecture is not the one with the most sophisticated components or the one that looks best in system design diagrams. It is the one you can understand completely when debugging at 2 AM. More importantly for research, it is the one that gets out of your way and lets you focus on questions that actually matter.
For me, that turned out to be three services, clear interfaces, and code I can hold in my head.
Sometimes the path to better performance is removing what you do not need.
Building a data-intensive system? Happy to chat about architecture decisions. Drop a comment or DM🚀


