Joe Mosby's blog

Notes on a multi-repo codebase intelligence system

Most coding agents still struggle when repositories hit a certain level of complexity. (Granted, many human agents struggle at this point, too.) CLAUDE.md patterns - my favorite coding tool - has a tendency to think it's operating within a single git repository. But most codebases of sufficient complexity to be meaningful stretch across multiple repos and use different languages, all while expecting data to flow seamlessly across them.

I don't think "refactor until the agent can operate well" is a usable pattern for most engineering teams. Part of my codebase is a Salesforce managed package, and I don't have the option to convince Salesforce to change its structure. At my last role, I regularly had to play by Shopify's rules in certain parts of the codebase and Google's rules in others. And I don't want to just shove things into a structure that makes my biggest third party happy. When I took the reins at Altvia, I started sniffing around for ways to get familiar with our codebase as quickly as possible, and I looked to AI to help me do it.

First attempt: put everything into a single folder, and init Claude from there. Honestly, this wasn't very good at all. Claude could do some inference of structure, but it tended to treat everything identically, and operated at too high a level of abstraction to do anything useful.

Second attempt: do some separation of "backend" and "frontend," and re-init Claude in each context. This was okay, but Claude still tended to be completely blind about intended API payloads. Plus, after 10+ years of existence, we have some 400+ API routes to figure out. I got more granularity, still no better understanding.

Third attempt: explicitly prompt Claude to call out API contracts. This was better. Not great, though. Claude had no idea of what was actually important, and treated every contract equally. Not everything is important in my business context.

The final attempt, and the one that yielded meaningful results, was not to start at the repo level. It was to start at the product level.

Enter Jira and Gitlab

Everything we actually want to do is tracked in Jira, and has a number. It's got a title and a description. It specifically tells me the product goals in mind when we made changes to the repositories. And when we create branches and merge requests, we use that Jira ticket number as a tag to allow us to tie a merge request to a Jira ticket. Originally, this simply allowed us wee humans to have a key for reviewing: as I got going, I realized this was the missing piece for my codebase intelligence system.

My new system:

  1. Pick out a Jira ticket that's been worked and closed out, with associated merge requests.
  2. Use the Atlassian API to pull the title and description off the Jira ticket.
  3. Use the Gitlab API to pull the title, description, and diff off the merge requests. (Note: in our system, all repos will use the same ticket ID. If you do an Epic with multiple tickets for each repo, you'll need to update accordingly)
  4. Comb through the git repo for the associated changes. (and, as a hygiene thing, you'll need to make sure you keep your git repos synced with the remote)
  5. Based on the changes, use Claude to create/update a codebase intelligence file using the data cleaned in steps 2-4.

This has been the move. Ultimately, I have something that looks like CLAUDE.md files, just carrying extra data about how different codebases interact with each other. An agent reading one file will know it probably needs to go talk to another to complete a task.

Next Steps

The system has worked well for small, individual tasks. For my next trick, I'm building out a system that can take in more complex sets of feature requirements and make appropriate changes in different repos. I envision something that can take a high-level Epic idea, create a human-readable artifact to confirm product requirements, sniff through the intel files and codebase to glean what it needs to do, and ultimately parcel out work to specific agents. Baby steps, though.

A Claude script you can feed to your Claude Code to implement this

Codebase Intel System — Build Prompt

A self-contained prompt for a Claude Code agent to build this system from scratch.


What You Are Building

A codebase intelligence pipeline that a CTO (or any engineering leader) can run against a ticket ID to:

  1. Fetch the ticket from Jira
  2. Find associated merge requests in GitLab
  3. Analyze the ticket + MR diffs + review comments with Claude Opus
  4. Write a per-ticket analysis file to a vault directory
  5. Upsert a living per-package intelligence file — the primary output — that accumulates knowledge across many tickets over time

The per-package intelligence files are the main artifact. They are designed to be consumed by Claude Code agents before writing code against unfamiliar packages. The goal: a new engineer (or AI agent) can read a package's intelligence file and understand its architecture, patterns, gotchas, and recent changes without having to reverse-engineer from source.


What Was Already Built

Here's what was already built

jira_client.py

Fetches a Jira ticket by ID using the Jira REST API v3. Returns: id, summary, status, description (recursively extracted from Atlassian Document Format), comments (author + body), and url. Credentials come from .env.

gitlab_client.py

Searches a GitLab group for merge requests matching a ticket ID. Key design:

analyzer.py

Two Claude Opus API calls:

analyze(ticket, mrs) — First call. Sends the full ticket context + all MR diffs/comments and produces a structured ticket analysis with sections: Intent, What Changed, Codebase Areas Touched, Intent vs. Implementation, Key Insights, Confidence.

build_repo_update(package_name, ticket_id, analysis, existing_content, today) — Second call. Takes the current state of the package intelligence file (or a blank template if new) plus the ticket analysis, and produces the complete updated intelligence file. Rules enforced in the prompt:

REPO_INTEL_TEMPLATE — The blank template for a new package file. Sections: What This Package Is, Stack & Key Dependencies, Architecture, Key Files & Their Roles, Established Patterns, Known Gotchas, Tech Debt, Integration Points, Testing Conventions, Active Areas, Intelligence Sources (table with Ticket / Date / Contribution columns).

vault_writer.py

Writes output to two directories:

read_repo_intel(package_name) reads an existing package file before the upsert. Returns empty string if the file doesn't exist yet (triggers template-based initialization in analyzer.py).

main.py

Entry point. Flow:

  1. Accepts a ticket ID as a positional argument (--no-ticket flag skips writing the ticket file)
  2. Uppercases the ticket ID (normalizes input like ABC-480ABC-480)
  3. Fetches ticket from Jira
  4. Searches GitLab for MRs
  5. Calls analyze() for the ticket analysis
  6. Writes ticket file (unless --no-ticket)
  7. Calls extract_packages(mrs) — derives package names from MR repo paths (takes the last path segment: product-engineering/ABC-beABC-be)
  8. For each unique package: reads current intel file, calls build_repo_update(), writes result

codebase-intel.sh

Prompt the user for a specific location to save this. An example is given below.

Wrapper script at ~/codebase-intel.sh:

#!/bin/bash
cd "$(dirname "$0")/codebase-intel"
source venv/bin/activate
python main.py "$@"

Invoked as: ~/codebase-intel.sh TICKET-ID


What the User Must Provide

1. API Credentials (.env file in the codebase-intel/ directory)

ANTHROPIC_API_KEY=sk-ant-...
GITLAB_TOKEN=glpat-...
JIRA_EMAIL=your@email.com
JIRA_API_TOKEN=...

2. GitLab Group Identifier

The GITLAB_GROUP constant in gitlab_client.py must be set to your GitLab group's path (e.g. product-engineering). This is the slug that appears in GitLab URLs: gitlab.com/YOUR-GROUP/repo-name.

The MR search is scoped to this group, so it finds MRs across all repos under the group.

3. Jira Base URL

The JIRA_BASE_URL in jira_client.py must point to your Jira instance (e.g. https://your-company.atlassian.net).

4. Vault Path

The VAULT_PATH in vault_writer.py must point to the directory where you want intelligence files written. Currently set to ~/codebase. The script auto-creates tickets/ and repos/ subdirectories.

5. Your Ticket ID Format(s)

The system handles any ticket ID format by normalizing on both ends (Jira ID as-is; GitLab search with dash and space variants). You do not need to pre-configure ticket prefixes — just pass whatever format your team uses (ABC-480, XYZ-296, CR-1432, etc.).

6. Team Convention: Ticket IDs in MRs

The MR search works well when engineers include the ticket ID in either the MR title or the source branch name. If your team doesn't follow this convention, MR matching will produce poor results. The search is fuzzy (handles ABC-560 vs. ABC 560 vs. ABC/560) but it needs the ticket ID to appear somewhere.

7. Package Naming Convention

Package names are derived automatically from the last path segment of the GitLab repo URL. No configuration needed — but you should be aware:

The decision to key by package name (not repo URL) was intentional: it allows a human or agent to read ABC-be.md and know exactly what they're looking at without decoding a path.


Design Decisions Worth Preserving

Repo files are the primary artifact. The ticket files are useful for historical archaeology, but the repo intelligence files are what agents should load before writing code. Every ticket run updates the repo files — the ticket file is a side effect.

Upsert, never overwrite from scratch. The build_repo_update() prompt explicitly instructs Claude to preserve all existing content. Running 50 tickets against a repo should produce one rich file, not 50 thin ones. The "confirmed: TICKET-ID" pattern lets you trace when a pattern was first observed vs. subsequently validated.

Ticket ID must be passed explicitly. Early versions of the prompt let Claude infer the ticket ID from context, which caused it to hallucinate IDs in the Sources table. The fix: pass ticket_id as a parameter to build_repo_update() and include it twice in the prompt — once as "Ticket ID for this analysis: {ticket_id}" and once in the Sources table instruction as "The ticket ID is exactly: {ticket_id}".

Diff truncation at 500 lines. Large diffs exceed context and add noise. 500 lines captures the shape of a change. If a diff is truncated, the analysis prompt still works — it just covers fewer files.

Service class duplication in Lambda directories is a known issue, not a bug in the tool. The tool correctly reports it as tech debt in the package file. Don't try to deduplicate the Lambda directories as part of this build.


How to Run

# Single ticket
~/codebase-intel.sh ABC-480

# Skip writing the ticket file (only update repo intel)
~/codebase-intel.sh ABC-480 --no-ticket

# From within a Claude Code session
Bash: ~/codebase-intel.sh ABC-480

To backfill a codebase from scratch: run it against 20–50 recent tickets across different repos. After ~5 tickets per package, the intelligence files become meaningfully dense.


Dependencies

anthropic
python-dotenv
requests

Python 3.9+. Set up a venv at codebase-intel/venv/ and install from requirements.txt (or pip install the above three packages directly).