Skip to main content

Feature Flags for Safe Refactoring

Deploy refactored code to production without risk. Use feature flags to run old and new paths side by side, roll out gradually, and roll back instantly if anything goes wrong.

What Are Feature Flags?

More Than Feature Releases

Most teams think of feature flags as a way to hide unfinished features from users. That is one use case, but it is the least interesting one. The real power of feature flags is enabling safe, incremental refactoring of production systems. You deploy the new code alongside the old code, route a small percentage of traffic through the new path, verify it works, and gradually increase until the old path is retired.

This turns a risky big-bang deployment into a controlled, reversible experiment. If anything goes wrong at 5% rollout, you flip the flag back. No rollback deployment, no hotfix, no incident. The old code is still right there, serving the other 95%.

Decouple Deploy from Release

Deploy refactored code to production without exposing it. The code is live but dormant behind a flag. You choose when to activate it, independently of the deployment schedule.

Zero-Risk Rollback

If the refactored code causes issues, flip the flag off. Rollback takes seconds, not minutes. No deployment pipeline, no approval process, no downtime. The old code path immediately takes over.

Gradual Rollout

Start at 1% of traffic. Monitor error rates, latency, and business metrics. Increase to 5%, then 25%, then 50%, then 100%. Each step validates the refactoring works under real production load.

Types of Feature Flags

Not all flags are created equal. Each type has a different lifespan, ownership, and cleanup strategy. Understanding these distinctions prevents the "flag sprawl" problem that turns feature flags into their own form of tech debt.

Release Flags

Hide unfinished features until they are ready. These are the most common type and should be short-lived - days to weeks. Once the feature launches, the flag must be removed. A release flag that has been "on" for everyone for three months is dead code waiting to confuse someone.

Lifespan: Days to weeks  |  Owner: Product team  |  Cleanup trigger: Feature GA

Experiment Flags

A/B tests and multivariate experiments. These route different user segments to different variants to measure which performs better. They need careful statistical analysis before removal and often involve analytics integration.

Lifespan: Weeks to months  |  Owner: Data/product team  |  Cleanup trigger: Experiment concluded

Ops Flags (Kill Switches)

Circuit breakers for production systems. These let you disable expensive features during high load or outages - like turning off recommendation engines during a traffic spike. Unlike other flags, ops flags may be long-lived and that is acceptable. They are operational controls, not temporary toggles.

Lifespan: Long-lived (permanent)  |  Owner: SRE/Ops team  |  Cleanup trigger: Feature decommissioned

Permission Flags

Gate access to features based on user attributes - subscription tier, account age, geographic region, or entitlements. These are long-lived by design and often evolve into your authorization system. They are the one flag type where permanence is expected.

Lifespan: Long-lived (permanent)  |  Owner: Product/engineering  |  Cleanup trigger: Business rule change

Refactoring Flags

The focus of this article. These flags wrap refactored code paths, allowing old and new implementations to coexist in production. They should be among the shortest-lived flags you create. A refactoring flag that stays active for more than two sprints is a red flag in itself - it means the migration stalled.

Lifespan: 1-4 weeks  |  Owner: Engineering team  |  Cleanup trigger: 100% rollout confirmed stable

The Refactoring Flag Pattern

The core pattern is simple: keep the old code path, add the new code path beside it, and use a flag to decide which one executes. This gives you a safe migration path with instant rollback at every stage.

Phase 1: Parallel Paths

Deploy the new code path alongside the old one. Flag defaults to OFF (old path). Both paths exist in production but only the old one executes. This is your safety net.

flag: OFF = 100% old path

Phase 2: Gradual Rollout

Enable the flag for 1% of traffic. Monitor metrics. Increase to 5%, 10%, 25%, 50%, 100%. At each step, compare error rates, latency, and business metrics between old and new paths.

flag: 1% -> 5% -> 25% -> 100%

Phase 3: Cleanup

Once at 100% for a stability period (typically one week), remove the flag, delete the old code path, and simplify the code. This step is critical - skipping it creates flag debt.

flag: REMOVED, old path: DELETED

Instant Rollback at Every Stage

The key advantage over traditional deployments: at any point during the rollout, you can flip the flag back to 0% and the old code path takes over immediately. No redeployment required. No downtime. No panicked hotfix. This is what makes feature flags the safest approach to refactoring production systems.

Implementation Patterns

There are three progressively more sophisticated ways to implement feature flags for refactoring. Choose based on the size and complexity of the code being refactored.

Pattern 1: Simple If/Else

The simplest approach. Works well for small, isolated refactoring where you are replacing one function or method with another. Quick to implement, easy to understand, and easy to clean up.

// Before: Direct call to old implementation
function processOrder(order) {
  return legacyOrderProcessor.process(order);
}

// After: Flag-controlled routing
function processOrder(order) {
  if (featureFlags.isEnabled('use-new-order-processor')) {
    return newOrderProcessor.process(order);
  }
  return legacyOrderProcessor.process(order);
}

Best for: Single function replacements, small isolated changes. Cleanup: Delete the if/else, keep only the new path, remove the flag check.

Pattern 2: Strategy Pattern

When the refactoring involves replacing an entire class or module with a new implementation. The flag selects which strategy (implementation) to use at construction time, keeping the branching logic out of the business code.

// Define the interface both implementations share
interface PaymentProcessor {
  charge(amount, customer): Result;
  refund(transactionId): Result;
}

// Factory selects implementation based on flag
function createPaymentProcessor(): PaymentProcessor {
  if (featureFlags.isEnabled('new-payment-processor')) {
    return new StripeV2Processor(config);
  }
  return new LegacyStripeProcessor(config);
}

// Business code is completely unaware of the flag
const processor = createPaymentProcessor();
const result = processor.charge(99.99, customer);

Best for: Class or module replacement where both share an interface. Cleanup: Delete the old class, remove the factory's flag check, simplify to direct instantiation.

Pattern 3: Branch by Abstraction

For large-scale refactoring that touches many call sites. Instead of flagging every call site, you introduce an abstraction layer that routes traffic. This is especially powerful when combined with the Strangler Fig pattern.

// Step 1: Create abstraction layer
class OrderService {
  constructor(featureFlags) {
    this.flags = featureFlags;
    this.legacy = new LegacyOrderService();
    this.modern = new ModernOrderService();
  }

  async createOrder(data) {
    const useNew = this.flags.isEnabled('modern-orders', {
      userId: data.userId,
      percentage: this.flags.getRolloutPercentage('modern-orders')
    });

    if (useNew) {
      return this.modern.createOrder(data);
    }
    return this.legacy.createOrder(data);
  }

  // Each method can be migrated independently
  async cancelOrder(orderId) {
    // This method already migrated - no flag needed
    return this.modern.cancelOrder(orderId);
  }
}

// Step 2: All call sites use the abstraction
// No flag logic scattered through the codebase
const orderService = new OrderService(featureFlags);
await orderService.createOrder(orderData);

Best for: Large systems with many call sites, multi-sprint migrations. Cleanup: Remove abstraction layer once all methods are migrated, replace with direct calls to the new implementation.

Flag Lifecycle Management

Every flag has a lifecycle: creation, enablement, monitoring, and cleanup. The teams that succeed with feature flags are obsessive about the last step. The teams that fail leave a graveyard of stale flags that nobody dares to touch.

1

Create with Metadata

Every flag needs an owner, a creation date, a planned expiration date, and a description of what it controls. If your flag system does not support metadata, track it in a spreadsheet or wiki. A flag without an owner is a flag that never gets cleaned up.

2

Enable Gradually

Start with internal users or a canary group. Then 1% of production traffic. Then ramp up in stages. At each stage, compare key metrics between the old and new paths. Do not jump from 0% to 100% - the whole point of flags is to validate incrementally.

3

Monitor Actively

Set up dashboards that compare old vs new path metrics: error rates, p50/p95/p99 latency, business metrics (conversion, revenue). Set alerts for regression thresholds. If the new path's error rate exceeds the old path's by more than 0.1%, automatically roll back or alert on-call.

4

Clean Up Relentlessly

Once at 100% for a stability period (one to two weeks), schedule the cleanup: remove the flag check from code, delete the old code path, remove the flag from the flag management system, and close the tracking ticket. This is not optional. A flag at 100% that is not cleaned up is flag debt.

Flag Debt Is Real Tech Debt

Every active flag adds a decision point to your code. Every decision point adds testing complexity, cognitive load, and potential for bugs. A codebase with 200 active flags has 200 places where behavior can unexpectedly change. Feature flags that are not cleaned up become the very tech debt you were trying to eliminate.

Tools Landscape

You can implement feature flags with a simple config file, or you can use a dedicated platform with targeting, analytics, and audit trails. The right choice depends on your team size, compliance requirements, and how many flags you expect to manage.

LaunchDarkly

The market leader for feature management. Sophisticated targeting rules, percentage rollouts, audit logs, and integrations with monitoring tools. Best for enterprise teams with compliance requirements. The pricing scales with monthly active users.

Best for: Enterprise teams, regulated industries, large-scale rollouts

Unleash

Open-source feature flag platform you can self-host. Full control over your data, no per-seat pricing, and a solid feature set. The self-hosted version is free; the managed version has additional enterprise features like change requests and audit logs.

Best for: Teams wanting self-hosted, data sovereignty, budget-conscious orgs

Flagsmith

Open-source with both cloud and self-hosted options. Clean UI, remote config support, and segment-based targeting. Good middle ground between DIY and full enterprise platform. Supports both feature flags and remote configuration in one tool.

Best for: Mid-size teams, combined flags + remote config needs

Custom Implementation

A database table, a config file, or environment variables. Simple, free, and under your full control. Works fine for small teams with fewer than 20 flags. Breaks down when you need targeting rules, audit trails, or percentage-based rollouts across distributed systems.

Best for: Small teams, simple on/off flags, getting started quickly

Common Mistakes

Feature flags solve one problem (risky deployments) but can create new problems if not managed carefully. These are the traps teams fall into most often.

1

Flag Sprawl

Creating flags is easy. Removing them is work nobody wants to do. Over time, you accumulate hundreds of flags and nobody knows which ones are still relevant. Set a hard rule: every flag has an expiration date. If the flag is not cleaned up by that date, it triggers an alert or blocks the next sprint planning until resolved.

2

Nested Flags

Flag A enables new checkout flow. Flag B enables new payment processor. Flag B only works inside Flag A's new path. Now you have four possible states (A off/B off, A on/B off, A off/B on, A on/B on) and only two of them are valid. Nested flags create combinatorial explosions in testing. Avoid them by ensuring flags are independent or by combining related flags into a single flag with explicit states.

3

Testing Complexity

Every flag doubles the number of code paths you need to test. With 5 independent flags, that is 32 combinations. Do not try to test all combinations. Instead, test the two states that matter: the flag fully on and the flag fully off. Test the rollout percentages only in integration tests with synthetic traffic. Keep flags independent so you do not need to test combinations.

4

Stale Flags

A flag that has been at 100% for six months is not a feature flag - it is dead code with a conditional wrapper. The old code path it protects is never going to be needed again, but it still exists, still gets loaded, and still confuses new developers reading the code. Stale flags are one of the most common forms of self-inflicted tech debt.

5

Flags in the Wrong Layer

Putting flag checks in the UI when the refactoring is in the data layer, or vice versa. Flag checks should be as close to the code being toggled as possible. A flag check in a React component that decides which API endpoint to call is fragile - put the flag check in the API router or service layer instead, where the actual branching happens.

Flag Hygiene: Preventing Flag Debt

The irony of using feature flags to manage tech debt is that poorly managed flags become their own form of tech debt. These practices keep your flag system healthy.

Expiration Dates

Every non-permanent flag gets an expiration date at creation time. Refactoring flags: 4 weeks. Release flags: 2 weeks after GA. Experiment flags: the experiment end date plus 1 week. When a flag expires, it shows up in a report that blocks sprint planning until addressed.

Flag Count Limits

Set a maximum number of active flags per team or service. A good starting limit is 10-15 active flags per service. When you hit the limit, you must clean up an existing flag before creating a new one. This creates natural pressure to complete the cleanup step that teams otherwise skip.

Weekly Flag Audit

A 5-minute weekly review of all active flags: Which flags have been at 100% for more than a week? Those need cleanup tickets. Which flags have not been touched in two weeks? Those need investigation. Which flags have no owner? Those need adoption or removal. Automate this report if possible.

Linting Rules

Add custom lint rules that flag (pun intended) common problems: flag checks more than one level deep (nested flags), flag checks without corresponding test cases for both states, and flag names that do not follow your naming convention. Some teams add a TODO comment requirement that links each flag check to the cleanup ticket.

Cleanup as Definition of Done

A refactoring task is not done when the new code is at 100%. It is done when the flag is removed, the old code is deleted, and the flag is deregistered. Include flag cleanup in the acceptance criteria of every refactoring story. If the cleanup is deferred to a separate ticket, that ticket gets created in the same sprint.

Naming Conventions

Use a consistent naming pattern that encodes the flag type and team: refactor.orders.new-processor, release.checkout.apple-pay, ops.search.kill-switch. This makes auditing trivial and lets you filter by type or team in your flag management tool.

Frequently Asked Questions

There is no universal number, but a good rule of thumb is 10-15 active flags per service. If you have more than that, you likely have stale flags that should have been cleaned up. Some large organizations run hundreds of flags, but they have dedicated flag management teams and tooling. For most teams, the problem is not having too many flags - it is having too many stale flags. A team with 8 flags that are all actively being rolled out is healthier than a team with 3 flags that have been sitting at 100% for six months.

A well-implemented flag check adds microseconds, not milliseconds. The flag evaluation itself is just a conditional check - the same as any other if statement in your code. The potential latency comes from how flags are loaded: if every request fetches flag state from a remote service, that adds network latency. Use local caching with periodic sync (most flag libraries do this by default) and the overhead becomes negligible. The latency from a remote flag evaluation service is typically 1-5ms, but with local caching it drops to near zero.

Yes, but with care. The pattern is called "expand and contract" or "parallel change." First, add the new column or table alongside the old one (expand). Use a flag to control which schema your application reads from and writes to. Dual-write to both during the migration period. Once the flag is at 100% and the new schema is validated, drop the old column or table (contract). This is especially important for databases because you cannot easily roll back a schema change that has already dropped data.

Test each flag state independently, not all combinations. For every flag, you need at minimum two test scenarios: flag on and flag off. Mock or stub your flag provider in unit tests so you can control the flag state deterministically. In integration tests, test with the flag at 0% and 100%. Do not try to test percentage-based rollouts in unit tests - that is a property of the flag system, not your code. The most important test is verifying that flipping the flag does not break the application, which means testing the transition, not just the steady states.

Yes, for simple use cases. Environment variables work when you have a small number of on/off flags and do not need percentage rollouts, user targeting, or runtime changes without redeployment. The main limitation is that changing an environment variable typically requires a redeployment or at minimum a process restart, which defeats one of the key benefits of feature flags: instant, no-deploy toggling. Start with environment variables if you are just getting started, and graduate to a flag management tool when you need runtime control or have more than 10 flags.

They are complementary techniques. The Strangler Fig pattern provides the overall strategy for incrementally replacing a legacy system - you build new functionality around the old system and gradually migrate traffic. Feature flags provide the mechanism for controlling that traffic migration. The Strangler Fig tells you what to replace and when. Feature flags give you the how - the ability to route traffic between old and new systems with percentage-based rollouts and instant rollback. Used together, they are the safest approach to large-scale system modernization.

Related Resources

Start Refactoring Without Fear

Feature flags give you the safety net to refactor production systems incrementally. Explore more refactoring playbooks or learn how to build the business case for tech debt reduction.