Staying Sane on a 6,000-Line Refactor with Claude Code

Staying Sane on a 6,000-Line Refactor with Claude Code

Horses and Grooms Crossing a River, Ming dynasty. Freer Gallery of Art, Smithsonian Institution. Public Domain.

How a simple checklist file helped me manage complexity and complete a Python-to-TypeScript migration with Claude Sonnet 4 CLI.

Introduction

As large language models become more capable, it's tempting to rely on them for increasingly complex engineering tasks. Tools like Anthropic's Claude Code CLI paired with Claude Sonnet 4 are explicitly designed to support multi-step code refactors, porting, and feature work. Yet despite the impressive capability of these models, I've found that their performance begins to degrade in long-running or large-scope tasks—especially those that span multiple sessions.

This degradation isn't a flaw in the technology, but rather a fundamental challenge that emerges when we push any stateless system beyond its designed boundaries. Just as a web server needs external storage to remember user sessions, language models need external scaffolding to maintain coherence across complex, multi-session workflows.

This post describes a lightweight but powerful pattern I use for keeping Claude coherent across multi-thousand-line refactors: the external progress file. By treating the model as a stateless executor and myself as the high-level planner, I was able to manage a 6,000-line Python-to-TypeScript migration without losing clarity or introducing regressions. More importantly, this approach implements a form of persistent memory that transforms how you can collaborate with AI on substantial engineering projects.


Why Big Refactors Break LLMs

Claude Sonnet 4 has a massive context window and strong planning abilities, but it's not immune to cognitive drift. Understanding why this happens helps us design better solutions. Here are three fundamental ways long refactors tend to go off the rails, along with the underlying mechanisms that cause these failures.

1. Lost-in-the-Middle Effect

Studies on long-context transformers reveal that LLMs often ignore information positioned in the center of large prompts, a phenomenon researchers call the "lost-in-the-middle" effect. This occurs because transformer attention mechanisms struggle to maintain equal focus across very long sequences. The model's attention naturally gravitates toward the beginning and end of the context window, where positional encodings are strongest and where training data typically places the most important information.

When you're feeding 40,000 tokens of code into a session, critical details relevant to your current task—like specific architectural constraints mentioned 15,000 tokens earlier—can effectively become invisible to the model. Imagine trying to hold a conversation in a crowded room where voices in the middle distance fade into background noise, even though you know important information is being spoken there.

2. Plan/Execution Drift

Even if you begin with a strong plan, language models exhibit what we might call "contextual amnesia" as sessions progress. Unlike human developers who can refer back to written specifications or remember earlier design decisions, the model's "memory" is limited to what fits in its current context window.

This creates a subtle but devastating effect: each step the model takes is optimized for immediate success, but gradually diverges from the initial goal. Picture a ship navigator who can see perfectly in a 10-mile radius but forgets the original destination. Each navigation decision makes perfect sense locally, but the ship slowly drifts off course. Without conscious re-injection of the original plan, the model's choices will optimize for local coherence rather than global objectives.

3. Session Volatility

Claude Code CLI provides a temporary, in-memory plan that exists only for the duration of your session. If your CLI crashes, your network connection drops, or you simply close your terminal, that plan evaporates entirely. The model has no long-term memory system to fall back on, and resuming work becomes an exercise in archaeological reconstruction—trying to infer the original intent from partially completed code.

This is fundamentally a problem of state management in distributed systems. Your development workflow spans multiple sessions, but the AI system treats each session as independent. Without an external persistence layer, you're essentially asking the model to solve a jigsaw puzzle while forgetting which pieces you've already placed.


The Progress File Pattern: External Memory for Stateless Systems

To avoid these pitfalls, I created a file called ConversionProgress.md at the root of my project. This file served a dual purpose: it acted as both a TODO list and a persistent memory system that bridges the gap between human planning and AI execution.

The core insight here is that we're solving a classic computer science problem: how to maintain state across stateless interactions. Web developers solve this with session storage and databases. Distributed systems use event logs and checkpoints. Our progress file implements the same principle for human-AI collaboration.

Core Properties That Make This Work:

Persistent: The file survives CLI restarts, network interruptions, and context window resets. Unlike in-memory plans, this external artifact provides continuity across sessions, acting as a form of "save game" functionality for complex refactors.

Scoped: Each task represents at most 1,000-1,200 tokens of work. This isn't arbitrary—it aligns with both cognitive load management principles and the model's optimal performance window. Tasks of this size can be completed in a single focused session without overwhelming the model's planning capabilities.

Canonical: The file is always re-injected before every Claude CLI invocation, ensuring the model starts each interaction with fresh context about the overall project state. This creates a forcing function that prevents drift and maintains alignment with original objectives.

Auditable: Each completed step includes rationale and test results, creating a breadcrumb trail that helps both human and AI understand the evolution of the codebase. This audit trail becomes invaluable when debugging issues or explaining decisions to team members.

This pattern essentially transforms the interaction model from "conversation with an AI" to "collaboration with a persistent agent that happens to forget everything between sessions." The progress file becomes the agent's external memory system, much like how databases provide persistent storage for stateless web applications.

Connection to Agent Framework Principles

What we're implementing here mirrors the architecture of sophisticated AI agent frameworks like AutoGen and BabyAGI, but without the complexity of orchestration libraries or external memory servers. These frameworks typically implement three core components: a planner that breaks down tasks, an executor that performs individual steps, and a memory system that maintains state across interactions.

Our progress file serves as the memory system, we act as the high-level planner, and Claude Code CLI functions as the executor. By explicitly designing these roles and interfaces, we create a simple but robust agent architecture that can handle complex, multi-session workflows without the overhead of more sophisticated frameworks.


Setting Up the Migration: Designing for Maintainable Complexity

I was tasked with converting a relatively large Flask API (approximately 6,000 lines of Python) into a modern TypeScript/Node.js project. The scope included database layer migration, API endpoint conversion, middleware translation, and comprehensive test coverage. I chose Node 20, native ESM modules, and Jest as my test runner. The entire process was managed through the Claude Code CLI with Claude Sonnet 4 as the execution engine.

The key insight for setup was recognizing that the initial planning phase would determine the success of the entire migration. Rather than diving into code immediately, I invested time in creating a well-structured breakdown that would serve as the project's "constitution" throughout the refactor.

The first prompt was carefully designed to establish boundaries and expectations:

You are tasked with porting a 6,000-line Python service (Flask API, PostgreSQL, etc.) 
to TypeScript using Node 20, native ESM modules, and Jest for testing. 
Begin by scanning the repo and producing a markdown checklist called ConversionProgress.md 
with at most 12 high-level tasks. Each task should involve <1,200 tokens of work.

The specific constraints in this prompt address several cognitive and technical limitations. The "at most 12 high-level tasks" limit comes from research on human working memory, which typically maxes out around 7±2 items for complex information. By keeping the high-level task count manageable, we ensure that both human and AI can maintain a clear mental model of the overall project structure.

The "1,200 tokens of work" constraint aligns with the model's optimal performance characteristics. Tasks of this size can be completed in a single focused session without approaching context window limits, while still representing meaningful progress toward the overall goal. This token limit also forces proper decomposition—if a task requires more than 1,200 tokens to describe or implement, it's likely too complex and should be broken down further.

Claude generated this initial breakdown:

# ConversionProgress.md

- [ ] Set up initial TypeScript project scaffold with tsconfig.json, Jest, and ESLint
- [ ] Port models from models.py → src/models/*.ts
- [ ] Convert routes in api/*.py → src/routes/*.ts  
- [ ] Replace SQLAlchemy ORM logic with Prisma or equivalent
- [ ] Translate middleware logic (auth, logging)
- [ ] Implement TypeScript-based dependency injection
- [ ] Convert utility functions and helpers
- [ ] Set up database migration scripts
- [ ] Port test suites from Python unittest to Jest
- [ ] Implement error handling and logging systems
- [ ] Performance optimization and bundling setup
- [ ] Final integration testing and deployment prep

Each of these items became its own scoped session, with clear entry and exit criteria that prevented scope creep and maintained forward momentum.


The Loop in Action: Implementing Persistent Collaboration

Every Claude CLI session followed the same carefully designed workflow, which implements a form of "checkpointed execution" that combats both model drift and human cognitive overload:

Step 1: Context Injection and State Restoration

Before issuing any commands, I injected the complete ConversionProgress.md file into the session. This serves the same function as loading a saved game—it restores the model's understanding of project state, completed work, and next priorities. This step is crucial because it ensures the model starts each interaction with full context rather than trying to infer project state from partially completed code.

Step 2: Task Clarification and Scope Confirmation

Claude reads the top unchecked item and restates it for clarity, often breaking it down into specific sub-steps. This restatement serves as a form of "contract negotiation" where both human and AI confirm understanding of the work to be done. For example, when tackling "Port models from models.py → src/models/*.ts", Claude might respond:

I'll work on porting the Python models to TypeScript. Based on the existing models.py, 
I need to:
1. Convert SQLAlchemy model definitions to Prisma schema
2. Generate TypeScript types for each model
3. Implement basic CRUD operations
4. Ensure proper type safety and validation

This should require approximately 800-1000 tokens and will result in 3-4 new TypeScript files.

Step 3: Focused Execution with Continuous Testing

The model generates the necessary code changes, runs tests continuously, and validates that changes don't break existing functionality. This step implements the principle of "fail fast, fail cheap"—by testing immediately after each change, we catch regressions when they're easy to fix rather than after multiple layers of additional changes.

Step 4: Checkpoint Update and Progress Recording

If the implementation succeeds, Claude marks the checkbox as complete and appends a one-line rationale explaining the approach taken and test outcomes. This creates an audit trail that serves multiple purposes: it helps debug issues later, provides context for future development, and creates a "commit message" that explains the reasoning behind each change.

Here's a real example from the migration showing the progression from task to completion:

Before:

- [ ] Port models from models.py → src/models/*.ts

After:

- [x] Port models from models.py → src/models/*.ts ✅
  ↳ Used Prisma with inferred types. Generated User, Product, Order models. All unit tests passed.

This simple pattern creates a feedback loop that maintains project coherence across multiple sessions while providing clear indicators of progress and potential issues.

The Psychology of Checkpointed Progress

This workflow leverages several psychological principles that enhance both human and AI performance. For humans, checking off completed tasks provides a sense of accomplishment and forward momentum that combats the fatigue associated with large refactors. For the AI, the explicit task boundaries prevent the model from "wandering" into scope creep or optimization rabbit holes.

The pattern also implements what software engineers call "bulkheading"—if one task fails or goes off-track, the failure is contained within that specific checkpoint rather than contaminating the entire project. This isolation makes debugging and recovery much more manageable.


Results: Measuring the Impact of Structured Collaboration

The quantitative results demonstrate the effectiveness of external memory patterns for complex AI-assisted development:

Metric Manual Baseline With Progress File
Sessions to completion 18 7
Regression bugs 10 3
Token usage ~220K ~140K
Resumption after crash Manual reset Instant reload via file

Let me explain why each metric improved, as these improvements reveal the underlying efficiencies gained through structured collaboration:

Fewer Sessions (18 → 7): The dramatic reduction in required sessions resulted from eliminating context switching overhead and reducing rework. Without the progress file, each session required significant time to re-establish context, understand what had been completed, and determine next steps. The external memory system eliminated this "warm-up" time and allowed each session to be immediately productive.

Fewer Regression Bugs (10 → 3): The checkpoint-based approach naturally enforced better testing discipline. Since each task required test validation before marking as complete, issues were caught and fixed immediately rather than accumulating across multiple changes. This implements a form of "continuous integration" at the development task level.

Lower Token Usage (~220K → ~140K): More focused prompts and reduced context switching led to significant efficiency gains. Without the progress file, many tokens were wasted on context reconstruction, clarification of previous work, and repeated explanations of project structure. The external memory system eliminated this redundancy.

Instant Resumption: Perhaps most importantly, the ability to instantly resume work after interruptions transformed the development experience from fragile and frustrating to robust and reliable. This reliability enabled longer, more complex refactors that would have been impractical with traditional session-based approaches.

The total migration took about a weekend of focused work, with significantly less mental fatigue and far fewer mid-prompt derailments than previous attempts using traditional conversational approaches.


Best Practices: Lessons from Implementation

Through multiple projects using this pattern, I've identified several key practices that determine success or failure:

Keep Each Task Genuinely Small: If the model needs more than 1,500 tokens to complete a step, split it further. This isn't just about token limits—it's about maintaining cognitive clarity for both human and AI. Large tasks lead to scope creep, where the model starts solving adjacent problems that weren't part of the original intent. Small tasks force clear thinking about dependencies and interfaces.

Enforce Test-Driven Checkpoints: Don't mark any task as complete until tests pass. Claude Sonnet 4 is remarkably capable, but it shares a common AI tendency to overstate confidence in untested code. The forced testing requirement creates a reality check that prevents optimistic assessments from accumulating into major issues down the line.

Archive Completed Work Regularly: Every 5-10 completed tasks, move finished items to an "Archive" section of your progress file. This prevents the file from becoming unwieldy while maintaining the historical record. A bloated progress file becomes harder for both human and AI to parse, reducing its effectiveness as a coordination tool.

Make Context Injection Automatic: Always paste the complete ConversionProgress.md into each prompt, or create shell aliases that inject it automatically. Manual injection is error-prone, and forgetting even once can lead to context drift that's difficult to recover from. The slight overhead of automatic injection pays enormous dividends in consistency.

Force Task Restatement: Ask the model to restate its understanding of the current task before writing any code. This simple step catches scope creep early and ensures alignment between human intent and AI execution. It also provides an opportunity to course-correct if the model has misunderstood the requirements.

Common Failure Modes and Troubleshooting

Even with careful planning, several failure modes can emerge. Being aware of these helps you course-correct quickly:

Task Size Misjudgment: Sometimes a task that seemed appropriately scoped proves larger than expected during implementation. When this happens, don't force completion—instead, mark the task as partially complete, document what was accomplished, and break the remaining work into smaller chunks. This maintains the integrity of your checkpoint system.

Dependency Confusion: Occasionally, tasks that seemed independent prove to have hidden dependencies. When you encounter this, update the progress file to reflect the actual dependency order and consider whether the original breakdown missed important architectural considerations.

Test Environment Issues: If tests consistently fail due to environment setup rather than code issues, create a separate task for fixing the test infrastructure before proceeding with feature work. Don't let test environment problems derail your main development workflow.

Context Overload: If you find yourself having to inject increasingly large amounts of context to maintain coherence, it may indicate that your task breakdown is too granular or that you need to refactor your progress file structure. The goal is to maintain simplicity while providing adequate context.


Automation Tips: Scaling the Pattern

To make this pattern truly effective for regular use, automation becomes essential. Here's how to streamline the workflow:

# Add this to your shell profile to make prompt injection automatic
alias ccode='claude code --model claude-sonnet-4 \
  --pre "$(cat ConversionProgress.md 2>/dev/null)" \
  --post "Remember to complete only the current task and mark ✅ only after tests pass."'

This alias implements automatic context injection with every Claude Code invocation. The --pre flag ensures the progress file is always loaded, while the --post flag reinforces the discipline of focused execution and test validation.

For more complex projects, consider creating a simple script that validates your progress file structure:

#!/bin/bash
# validate_progress.sh - Ensures progress file follows best practices

progress_file="ConversionProgress.md"

if [ ! -f "$progress_file" ]; then
    echo "Error: ConversionProgress.md not found"
    exit 1
fi

# Count unchecked tasks
unchecked=$(grep -c "^- \[ \]" "$progress_file")

if [ "$unchecked" -gt 15 ]; then
    echo "Warning: $unchecked unchecked tasks. Consider breaking down large tasks."
fi

# Check for tasks without descriptions
if grep -q "^- \[ \] $" "$progress_file"; then
    echo "Error: Found tasks without descriptions"
    exit 1
fi

echo "Progress file validated successfully"

This type of validation helps maintain the discipline necessary for the pattern to work effectively across team environments or longer projects.


When Not to Use This Pattern

Understanding the boundaries of any technique is as important as knowing how to apply it. The progress file pattern adds overhead that may not be justified for simpler scenarios:

Small, One-Off Edits: If you're making changes to fewer than 200 lines of code or working on a single file, the overhead of creating and maintaining a progress file likely exceeds its benefits. For quick fixes or small features, traditional conversational approaches work well.

Abundant Context Window Headroom: If your entire project fits comfortably within the model's context window with room to spare, you may not need external memory management. However, be cautious here—projects have a tendency to grow, and establishing good patterns early often pays dividends later.

Strong CI/CD Safety Nets: If you have comprehensive automated testing and deployment pipelines that catch regressions quickly, the additional validation provided by the checkpoint system may be redundant. However, consider that catching issues during development is almost always more efficient than catching them in CI.

Highly Experimental Work: When doing exploratory programming or proof-of-concept development where the requirements are rapidly evolving, the structure imposed by a progress file might feel constraining. In these cases, prioritize speed of iteration over formal process.

The key insight is that this pattern shines when the cost of losing coherence or introducing regressions is high, and when the work spans multiple sessions or involves complex interdependencies.


Broader Applications: Prompt Engineering at Scale

The progress file pattern represents a specific implementation of a broader principle: prompt engineering at scale. As AI systems become more capable and we use them for increasingly complex tasks, we need to evolve beyond single-turn prompts toward structured, multi-session collaboration patterns.

This approach connects to several emerging patterns in AI-assisted development:

Persistent Context Management: Just as databases solved the statelessness problem for web applications, external memory systems like progress files solve the statelessness problem for AI collaboration. This pattern will likely become more important as AI systems become more capable but remain fundamentally stateless.

Human-AI Cognitive Load Distribution: By taking responsibility for high-level planning and task decomposition while delegating execution to AI, we create a division of labor that leverages the strengths of both human and artificial intelligence. Humans excel at strategic thinking and context awareness; AI excels at consistent execution and detail management.

Structured Agent Interactions: The progress file creates a formal interface between human and AI that could be extended to multi-agent systems. Imagine progress files that coordinate work between multiple AI specialists—a planning agent, a coding agent, and a testing agent—all working toward shared objectives defined in the external memory system.

As the AI ecosystem evolves, patterns like this will likely be embedded directly into development tools and IDEs, creating more sophisticated human-AI collaboration environments while maintaining the simplicity and transparency that make them effective.


Conclusion

Agentic coding workflows represent a fundamental shift in how we approach complex software development tasks. However, LLMs need structured guidance to maintain coherence across extended projects. By anchoring your plan in a persistent, minimal checklist, you build a system that mirrors what sophisticated agent frameworks do behind the scenes—without needing orchestration libraries or external memory servers.

The external progress file pattern works because it solves the fundamental computer science problem of maintaining state across stateless interactions. It transforms an inherently fragile conversation into a robust, resumable collaboration that can survive interruptions, context resets, and the natural drift that occurs in any complex system.

In my experience, the ConversionProgress.md pattern is the lightest, most effective way to make Claude Code feel like a dependable pair programmer across multi-session work. It requires minimal tooling, imposes just enough structure to maintain coherence, and scales naturally from small refactors to major architectural changes.

If you're pushing LLMs beyond single-turn commands and finding that complexity leads to confusion and rework, this external memory approach may be the simplest productivity multiplier you haven't tried yet. The pattern transforms AI from a powerful but inconsistent assistant into a reliable collaborator that can maintain focus across the long arcs of meaningful software development.

More broadly, this approach points toward a future where human-AI collaboration becomes as structured and reliable as any other engineering discipline. By understanding and implementing these patterns now, we prepare ourselves for the more sophisticated AI-assisted development environments that are rapidly approaching.


Got questions about implementing this pattern in your own projects, or want to discuss variations for different types of development work? The core principles adapt well to documentation projects, data analysis workflows, and even non-technical planning processes where maintaining coherence across multiple sessions is important.