Neil Deshpande — Meta Repair Console

The problem

Seven separate repair systems. Zero shared knowledge.

Meta had 7 separate repair ticketing systems, maintained by 7 separate engineering teams spread across the world. Each was built in isolation, with no visibility across them.

Seven separate teams, each with their own repair tooling, distributed across Meta's global data centers. Repair Console unified them into a single platform for the first time.

"We realized there was a manufacturing problem with chips and identified almost 11,000 bad DIMMs around the world. It took a year to identify the trend, and another year to replace them."

— Repair triage engineer, user interview

Research

40+ interviews. 3 continents. The same problems, everywhere.

I supported our lead researcher with interviews, observations, and ticket analysis across Dublin, Berlin, Bay Area, Singapore, and Austin, over 40 people in total, spanning network engineers, repair technicians, team leads, and on-site operators. Despite different regions and roles, every person we talked to was struggling with the same core issues.

To make sense of what we'd heard, the researcher and I synthesized the interviews and identified three distinct behavioral patterns in how people approached repair work. Rather than mapping to job titles, these patterns cut across teams — a network engineer and a data center lead might both behave like a Pattern Finder, depending on the day. Defining these personas gave the whole team a shared language for who we were designing for, and made it much easier to pressure-test design decisions against real user needs.

Persona 01 · Diagnostic Tech

Pattern Finder

"Help me understand how what I'm working on relates to a bigger issue"

Goals

–Find patterns across multiple failures
–Diagnose in bulk, not one ticket at a time
–Prevent duplicate work across teams

Pain points

–No visibility into what other teams are working on
–Can't escalate or group tickets across the system

Persona 02 · Team Lead

Monitor

"Help me understand what my team has on for the week"

Goals

–Track team workload and open issues at a glance
–Catch SLA breaches before they escalate
–Understand fleet health across their region

Pain points

–Has to dig into individual tickets to get any overview
–No single place to see status across all teams

Persona 03 · Repair Tech

Actioner

"Tell me everything I need to work on, in the order I should do it"

Goals

–Know exactly what to do and in what order
–Start a repair without having to research first
–Trust that the steps have been validated by an expert

Pain points

–Spends most of their time on triage, not repair
–Runbooks are generic, outdated, and hard to find

What all three had in common

The personas helped us understand how different people worked. But regardless of which persona we were talking to, the same four systemic problems kept surfacing — not as isolated complaints, but as symptoms of the same broken environment.

Duplicate effort

The same failure fixed in Dublin was reinvented from scratch in Singapore the following week. No way to share solutions across teams.

Triage was the biggest time sink

Technicians spent most of their time not repairing, but figuring out where to start. Runbooks were generic, outdated, and scattered.

No single source of truth

Repair history lived across tickets, wikis, messages, and people's heads. When someone left the team, their knowledge left with them.

Processes changed faster than docs could keep up

"We've changed our process 10 times since you last interviewed us." Documentation was increasingly ignored.

Design principles

Three principles to guide every decision

The four pain points pointed to a common root cause: teams were working in isolation, knowledge wasn't being captured, and nothing was built to scale. Before moving into ideation, I ran a working session with the broader design and product team to translate those findings into design principles: filters we'd run every decision through.

Visibility

Break down siloed work. Give teams a view of the entire repair ecosystem, not just their corner of it.

Codify knowledge

Turn expertise in people's heads into structured, reusable processes the whole system can learn from.

Self-service

Let repair experts evolve workflows themselves, without needing engineering support every time.

Ideation

Two ideas worth testing

I facilitated a series of workshops with engineers, repair technicians, PMs, and subject matter experts, drawing on each discipline's perspective deliberately. Engineers helped us understand what was technically feasible without a full rebuild. Repair technicians told us which pain points cost them the most time. SMEs helped us understand the depth of knowledge we'd need to capture before any of it could be systematized.

We used a mix of whiteboarding, sketching, and dot-voting to generate and narrow down concepts, giving everyone a voice before converging on what to explore further. Two ideas kept rising to the top.

Cross-functional workshop session at Meta Dublin office

One of several cross-functional workshops held across Dublin and the Bay Area, bringing together engineers, repair technicians, and subject matter experts to align on the problem and explore solutions together.

Grouping repairs into "Issues"

Failures with the same root cause grouped together, enabling bulk diagnosis and eliminating duplicate work.

Standardized "Action Plans"

Expert-authored, step-by-step repair plans replacing generic runbooks with validated, structured instructions.

Testing & iterating

On-site testing changed our direction

We prototyped early and tested on-site at Meta's data centers. Our first global dashboard was too broad; users were overwhelmed.

"I don't need to see what's happening everywhere. I need to know what I'm working on today."

— Repair technician, on-site testing session

We pivoted to a hierarchical view (global, regional, and site-level), letting each user type zoom to the right scope. Meanwhile, the Action Plans format went through four rounds of iteration, each one surfacing a new constraint we hadn't anticipated. The biggest tension throughout was flexibility versus structure: plans needed to handle genuinely complex, branching repairs while still being consistent enough to track and eventually automate.

Iteration 01

Free-form wiki editor

Experts write repair steps as plain text
No consistent format across plans
Not machine-readable or trackable

Iteration 02

Structured template

Required fields: failure type, steps, outcome
More consistent, but strictly linear
Couldn't handle branching repair logic

Iteration 03

Tree-based action builder

Each node is an action, branches are outcomes
Handles complex, multi-path repairs
Structured enough to track and automate

Final concept

Two-panel guided execution

Step instructions on the left, server context on the right
Radio options route to the correct next step
Full action history tracked throughout

Key usability findings

Finding 01

8 of 10 preferred Repair Console

Users found it easier to author and execute repairs

Clearer than existing runbooks and ticketing tools
Faster to get started on a repair
Reduced time spent searching for the right steps

Finding 02

5 of 10 had scalability concerns

Plan maintenance at scale was an open question

Worried plans would go stale as hardware evolved
Unclear how to discover the right plan among hundreds
Concerned about who owns outdated or duplicate plans

Finding 03

7 of 10 wanted expert sign-off

Technicians wanted credibility signals before trusting a plan

Success rate and usage history were the most requested signals
Plans authored by recognized experts carried significantly more trust
Asked for publishing rules so not anyone could author a plan

Final design

Three surfaces, one unified platform

Repair Console brought three surfaces together for the first time, each designed for a specific user persona, all drawing from the same underlying data.

Surface 1: Global monitoring, for the Monitor

Designed for team leads who needed situational awareness across their fleet. Three progressive zoom levels let each user type see exactly the right scope: global trends, regional patterns, or a single campus.

Global view. All of Meta's data center regions on a world map, color-coded by unavailability. Fleet health bars break down server status by type, and the right panel surfaces the highest-priority issues across the entire fleet.

Surface 2: Issue view, for the Pattern Finder

Designed for diagnostic techs who needed to see the bigger picture. Similar failures are automatically grouped into a single Issue, letting teams diagnose once and repair in bulk rather than working ticket by ticket.

Issue Details screen showing grouped tickets for Zion T16/T20 Power Failures, with status charts, properties panel, and team conversations

Issue Details. Four tickets are grouped under a single Zion T16/T20 Power Failures issue. The technician sees all affected servers, their priorities and statuses, and how long each has been unresolved, all in one view. The "Start Repair" button launches the Action Plan across the entire group. The Conversations panel (bottom right) was added after user research revealed that teams were coordinating via messaging threads that disappeared. Repair Console needed to be the record of communication, not just of actions.

Surface 3: Action Plans, for the Actioner

Designed for on-site repair techs who needed clear, unambiguous instructions. A branching tree format routes technicians to the right next step based on what they find, with full action history tracked in the right panel throughout.

Repair Plan screen showing branching decision step

Branching decision step. "What was the error in the log?" The technician selects from pre-defined outcomes and the plan branches accordingly. The right panel surfaces full server details so the technician never leaves the flow to look up context in another tool.

Results

Measurable impact across the fleet

We measured performance across teams using Repair Console against teams still on legacy tools. These were the key outcomes.

22%

Reduction in average repair time, from initial triage to server back online

31%

Increase in repair accuracy, measured by reduction in tickets reopened for the same issue

7→1

Repair ticketing systems consolidated into a single unified platform

The numbers tell part of the story. The bigger shift was less visible: for the first time, teams across different regions and functions were working from the same data, using the same processes, and building on each other's knowledge rather than starting from scratch. The DIMM defect that once took years to surface would now be detectable in weeks. That kind of compounding improvement is what the platform was really designed for.

Redesigning how Meta repairs its global data centers

The repair process wasn't keeping pace with the infrastructure

Seven separate repair systems. Zero shared knowledge.

40+ interviews. 3 continents. The same problems, everywhere.

What all three had in common

Duplicate effort

Triage was the biggest time sink

No single source of truth

Processes changed faster than docs could keep up

Three principles to guide every decision

Visibility

Codify knowledge

Self-service

Two ideas worth testing

Grouping repairs into "Issues"

Standardized "Action Plans"

On-site testing changed our direction

Key usability findings

Three surfaces, one unified platform

Surface 1: Global monitoring, for the Monitor

Surface 2: Issue view, for the Pattern Finder

Surface 3: Action Plans, for the Actioner

Measurable impact across the fleet

What I'd do differently