Skip to main content

EduPlatform: When Every Feature Touches 14 Services

How an edtech company discovered that splitting a monolith by code file instead of business domain created a distributed monolith worse than the original

EdTech / Online Learning 110 Engineers 18-Month Transformation

Company Profile

EduPlatform Global

EduPlatform Global is an online learning platform serving 2.8 million students and 45,000 instructors across 60 countries. With 110 engineers organized into 8 product teams, the company delivers live classes, recorded courses, assessments, and certification programs for universities and corporate training departments.

Their platform originally ran as a Ruby on Rails monolith. Between 2021 and 2023, the team extracted it into what they called a "microservices" architecture -- 67 services in total. But the migration happened without architectural governance, and the result was something far worse than what they started with.

2.8M

Students

110

Engineers

67

Services

8

Product Teams

The Situation

A Distributed Monolith

EduPlatform spent two years migrating from a Rails monolith to microservices. But nobody defined service boundaries based on business domains -- they simply split the monolith by code file. The User model became its own service. The Enrollment model became another. The Payment module became a third. Every business operation that previously called methods in the same process now made network calls across services that were never designed to be independent. The result was a distributed monolith with all the complexity of microservices and none of the benefits.

Tightly Coupled Services

Most features required changes to 8-14 services simultaneously. The student enrollment flow alone touched 14 services -- if any single one failed, enrollment broke completely. Services were not independent; they were a monolith connected by HTTP calls instead of function calls.

Circular Dependencies

23 pairs of services had circular dependencies -- Service A called Service B, which called Service A right back. Debugging a single request meant tracing calls across 6-8 services, often in circles. No API contracts existed; services called each other's internal methods via a shared database.

The Deploy Train

Average feature deployment required coordinated releases across 6+ services. The team ran a "deploy train" every two weeks involving 30+ engineers to orchestrate the release order. Independent deployment -- the primary benefit of microservices -- was completely impossible.

Feature Delivery Collapsed

Feature delivery time increased from 2 weeks to 8 weeks after the microservices migration. What was supposed to make teams faster made them four times slower. Every pull request required approval from at least 3 other teams because changes rippled across service boundaries.

Cascading Failures

The service mesh complexity caused cascading failures that took down the entire platform. Three full outages in a single semester, each triggered by one service failure that propagated through the dependency chain. Students could not access courses during midterm week.

Warning Signs

The symptoms were clear for months, but the team kept attributing them to growing pains rather than fundamental architectural problems.

Velocity

4x Slower Delivery

Feature delivery time quadrupled from 2 weeks to 8 weeks after the microservices migration. The team was shipping fewer features with more engineers than before the migration started.

Process

Biweekly Deploy Train

Coordinated releases every two weeks involving 30+ engineers. Half a day lost to deployment coordination every sprint. Services could not deploy independently because of shared state and circular calls.

Reliability

3 Full Outages Per Semester

Service mesh complexity caused cascading failures. One service going down pulled others with it because of synchronous dependency chains and no circuit breakers. Students lost access during critical exam periods.

Impact

Enrollment Failures

Students unable to enroll during peak registration periods. The enrollment flow touched 14 services, and peak traffic exposed every weakness in the dependency chain simultaneously.

Collaboration

Cross-Team PR Bottleneck

8 teams, but every pull request required approval from at least 3 other teams. Service boundaries did not align with team boundaries, so every change crossed organizational lines.

The Breaking Point

Fall Semester Enrollment Outage

A 12-hour outage during fall semester enrollment affected 340,000 students. The enrollment service triggered a cascade through 14 dependent services, each failing in sequence. The on-call team spent 8 hours just identifying which service had failed first because the distributed tracing was incomplete and the circular dependencies made the failure path impossible to follow.

University Partner Ultimatum

University partners representing 40% of revenue demanded SLA guarantees. Three major universities began evaluating competing platforms. The enrollment outage happened during their busiest week, and their students were the ones affected. The message was clear: fix this or lose the contracts.

Board Mandate

The board mandated: "Fix the architecture or find a buyer." The revenue at risk from university partner churn exceeded the total cost of remediation. For the first time, architecture debt was framed not as an engineering problem but as a business survival question.

The Playbook: 18 Months to Recovery

EduPlatform structured their remediation as four phases, starting with emergency stabilization and ending with governance that prevents the same mistakes from recurring.

1
Phase 1 Month 1-3

Emergency Stabilization

  • Implemented circuit breakers on all inter-service calls to prevent cascading failures
  • Created the "enrollment critical path" -- identified and hardened the 14 services that enrollment depended on
  • Added comprehensive distributed tracing with Jaeger across all service boundaries

Result: Zero enrollment outages for spring semester

2
Phase 2 Month 4-8

Domain Redesign

  • Hired a domain-driven design consultant to facilitate bounded context workshops with every team
  • Redefined service boundaries around business domains (enrollment, content delivery, assessment, payments) instead of code structure
  • Identified that 67 services could be consolidated to 12 bounded contexts; broke all circular dependencies with event-driven communication

Result: Feature deployment reduced from 6+ services to 1-2 services per change

3
Phase 3 Month 9-14

Service Consolidation

  • Merged 67 services into 18 well-bounded services -- not 12 as originally planned, a pragmatic compromise based on team structure and deployment needs
  • Implemented proper API contracts using OpenAPI specs and consumer-driven contract testing between all services
  • Eliminated shared database access -- each service owns its data, communicating through well-defined APIs and events

Result: Feature delivery from 8 weeks to 2 weeks; deploy train eliminated

4
Phase 4 Month 15-18

Architecture Governance

  • Created Architecture Decision Records (ADR) process requiring documented rationale for every new service or major API change
  • Implemented automated architecture fitness functions that detect coupling violations, circular dependencies, and shared database access in CI
  • Quarterly "Architecture Health" review correlating architecture metrics with business outcomes; team topology reorganized around domains, not layers

Result: Architecture health score tracked and improving quarter over quarter

Results: Before vs After

Comparison of key metrics before and after the 18-month architecture remediation

Key Metrics

Feature Delivery

8 weeks

2 weeks

75% reduction

Services per Feature

14

1.5 avg

89% reduction

Service Count

67

18

Well-bounded

Coordinated Deploys

Every 2 weeks

None

Eliminated

Lessons Learned

1

Distributed Monolith Is Worse

Microservices without bounded contexts is just a distributed monolith with network latency added. You get all the operational complexity of microservices -- deployment coordination, distributed tracing, network failures -- with none of the independence benefits.

2

Code File Is Not a Domain

Splitting by code file instead of business domain creates services that are coupled by definition. The User model, Enrollment model, and Payment model all serve the same business process. Separating them into services did not create independence -- it created network overhead between tightly coupled components.

3

Architecture Debt Compounds Fastest

Architecture debt compounds faster than code debt. Each new service added more coupling to the existing mesh. Every feature built on top of the wrong boundaries made the boundaries harder to fix. The cost of remediation grew exponentially while the team thought they were making progress.

4

67 Services Were Not Microservices

Having 67 services did not mean having microservices. It meant having a monolith with extra steps -- network hops, serialization overhead, deployment coordination, and distributed debugging. The number of services is irrelevant; what matters is whether each service can be developed, deployed, and scaled independently.

5

DDD Before Migration

Domain-Driven Design should happen before the migration, not after. EduPlatform spent two years splitting a monolith without understanding their domains, then spent 18 months undoing the damage. The DDD workshops that eventually fixed the architecture would have cost a fraction of the remediation if done upfront.

6

The Deploy Train Tells the Truth

The "deploy train" was the clearest symptom that service boundaries were wrong. If you need coordinated releases, your services are not independent. The deploy train is not a process problem to be optimized -- it is an architecture problem to be fixed at the root.

"If adding a feature requires coordinated deployments across more than 2 services, you don't have microservices. You have a distributed monolith. Fix the boundaries before adding more services."

-- Lesson from EduPlatform Global's architecture team, shared at a platform engineering conference

Frequently Asked Questions

A distributed monolith is a system that has the deployment topology of microservices but the coupling characteristics of a monolith. The clearest indicators: you cannot deploy one service without coordinating with others, a change in one service requires changes in multiple other services, services share a database or call each other's internal APIs, and you have circular dependencies between services. If your "microservices" require a deployment train, you almost certainly have a distributed monolith.

Service boundaries should follow business domains, not code structure. Use Domain-Driven Design techniques like event storming and bounded context mapping to identify natural boundaries where business processes are relatively independent. A well-bounded service owns a complete business capability: its data, its logic, and its API. The test is simple -- can this service be developed, tested, deployed, and scaled by a single team without coordinating with others? If not, the boundary is in the wrong place.

DDD should happen before the first service is extracted from the monolith. The entire point of DDD is to identify the right boundaries -- splitting a monolith without understanding your domains guarantees you will draw boundaries in the wrong places. Invest 4-8 weeks in event storming workshops, context mapping, and domain modeling before writing any migration code. This upfront investment prevents the far more expensive mistake of building 67 tightly coupled services that need to be consolidated later.

Code-level debt -- duplicate functions, missing tests, outdated libraries -- is localized and can be fixed incrementally. Architecture debt affects the entire system structure: wrong service boundaries, missing API contracts, circular dependencies, shared databases. Architecture debt compounds faster because every feature built on wrong boundaries reinforces those boundaries. Fixing architecture debt requires coordinated, system-wide changes that are orders of magnitude more expensive than fixing code debt. Prevent it with upfront design; code debt can be managed with ongoing cleanup.

Yes, and EduPlatform proved it. They used a strangler fig approach to merge services incrementally. Start by identifying which services belong to the same bounded context. Route traffic through a new consolidated service while keeping the old services running. Migrate data ownership one entity at a time. Replace synchronous inter-service calls with internal method calls within the consolidated service. The key is doing it incrementally -- merging two services at a time rather than attempting a big-bang consolidation.

Deploy coordination is a symptom of coupling, not a process problem. Fix it at the root: implement proper API contracts with versioning so services can evolve independently. Replace synchronous calls with event-driven communication where possible. Eliminate shared database access so each service owns its data. Use consumer-driven contract testing to verify compatibility without coordinated releases. Once services are truly independent -- they own their data, expose stable APIs, and communicate through events -- each team can deploy on their own schedule.

Is Your Architecture Creating Hidden Debt?

Architecture debt compounds faster than code debt. Learn the techniques to identify wrong boundaries, fix coupling, and prevent distributed monolith patterns.