Data center repair technician working between server racks
Meta · Internal Tools · 2022–2023

Redesigning how Meta repairs its global data centers

Meta's data centers keep Facebook, Instagram, and WhatsApp running for billions of people. I helped design Repair Console, a new internal platform to make server repair operations faster, more consistent, and scalable across the world.

Company
Meta Meta
My Role
Design Lead, Research Support
Team
1 PM, 1 researcher, 11 engineers, 3 designers
Duration
Nov 2022 – Apr 2023 (6 months)
22%
Reduction in average
repair time
31%
Increase in
repair accuracy
7→1
Repair ticketing
systems consolidated

The repair process wasn't keeping pace with the infrastructure

My team was brought in to redesign how repair operations worked across Meta's global data centers. Every hour a failed server stayed offline was compute capacity unavailable to Meta's products, and with AI workloads growing rapidly, that cost was rising. The process was fragmented, knowledge was siloed, and the system was increasingly unable to scale. I led design for the repair authoring and execution workstream: the part of the platform that determined how technicians actually diagnosed and fixed problems.

Seven separate repair systems. Zero shared knowledge.

Meta had 7 separate repair ticketing systems, maintained by 7 separate engineering teams spread across the world. Each was built in isolation, with no visibility across them.

Repair Console One unified platform Diagnostics team Network + hardware Networking team Switches + routing ProdOps team Production operations Repair team On-site technicians SiteOps team Facilities + power Smarthands team Physical execution Hardware Ops team Components + supply
Seven separate teams, each with their own repair tooling, distributed across Meta's global data centers. Repair Console unified them into a single platform for the first time.

"We realized there was a manufacturing problem with chips and identified almost 11,000 bad DIMMs around the world. It took a year to identify the trend, and another year to replace them."

— Repair triage engineer, user interview

40+ interviews. 3 continents. The same problems, everywhere.

I supported our lead researcher with interviews, observations, and ticket analysis across Dublin, Berlin, Bay Area, Singapore, and Austin, over 40 people in total, spanning network engineers, repair technicians, team leads, and on-site operators. Despite different regions and roles, every person we talked to was struggling with the same core issues.

To make sense of what we'd heard, the researcher and I synthesized the interviews and identified three distinct behavioral patterns in how people approached repair work. Rather than mapping to job titles, these patterns cut across teams — a network engineer and a data center lead might both behave like a Pattern Finder, depending on the day. Defining these personas gave the whole team a shared language for who we were designing for, and made it much easier to pressure-test design decisions against real user needs.

Persona 01 · Diagnostic Tech
Pattern Finder
"Help me understand how what I'm working on relates to a bigger issue"
Goals
  • Find patterns across multiple failures
  • Diagnose in bulk, not one ticket at a time
  • Prevent duplicate work across teams
Pain points
  • No visibility into what other teams are working on
  • Can't escalate or group tickets across the system
Persona 02 · Team Lead
Monitor
"Help me understand what my team has on for the week"
Goals
  • Track team workload and open issues at a glance
  • Catch SLA breaches before they escalate
  • Understand fleet health across their region
Pain points
  • Has to dig into individual tickets to get any overview
  • No single place to see status across all teams
Persona 03 · Repair Tech
Actioner
"Tell me everything I need to work on, in the order I should do it"
Goals
  • Know exactly what to do and in what order
  • Start a repair without having to research first
  • Trust that the steps have been validated by an expert
Pain points
  • Spends most of their time on triage, not repair
  • Runbooks are generic, outdated, and hard to find

What all three had in common

The personas helped us understand how different people worked. But regardless of which persona we were talking to, the same four systemic problems kept surfacing — not as isolated complaints, but as symptoms of the same broken environment.

01

Duplicate effort

The same failure fixed in Dublin was reinvented from scratch in Singapore the following week. No way to share solutions across teams.

02

Triage was the biggest time sink

Technicians spent most of their time not repairing, but figuring out where to start. Runbooks were generic, outdated, and scattered.

03

No single source of truth

Repair history lived across tickets, wikis, messages, and people's heads. When someone left the team, their knowledge left with them.

04

Processes changed faster than docs could keep up

"We've changed our process 10 times since you last interviewed us." Documentation was increasingly ignored.

Three principles to guide every decision

The four pain points pointed to a common root cause: teams were working in isolation, knowledge wasn't being captured, and nothing was built to scale. Before moving into ideation, I ran a working session with the broader design and product team to translate those findings into design principles: filters we'd run every decision through.

01

Visibility

Break down siloed work. Give teams a view of the entire repair ecosystem, not just their corner of it.

02

Codify knowledge

Turn expertise in people's heads into structured, reusable processes the whole system can learn from.

03

Self-service

Let repair experts evolve workflows themselves, without needing engineering support every time.

Two ideas worth testing

I facilitated a series of workshops with engineers, repair technicians, PMs, and subject matter experts, drawing on each discipline's perspective deliberately. Engineers helped us understand what was technically feasible without a full rebuild. Repair technicians told us which pain points cost them the most time. SMEs helped us understand the depth of knowledge we'd need to capture before any of it could be systematized.

We used a mix of whiteboarding, sketching, and dot-voting to generate and narrow down concepts, giving everyone a voice before converging on what to explore further. Two ideas kept rising to the top.

Cross-functional workshop session at Meta Dublin office
One of several cross-functional workshops held across Dublin and the Bay Area, bringing together engineers, repair technicians, and subject matter experts to align on the problem and explore solutions together.
Power Rail Failures 3 servers affected Start repair

Grouping repairs into "Issues"

Failures with the same root cause grouped together, enabling bulk diagnosis and eliminating duplicate work.

T16/T20 Power Failure — Repair Guide 1 2 3 4

Standardized "Action Plans"

Expert-authored, step-by-step repair plans replacing generic runbooks with validated, structured instructions.

On-site testing changed our direction

We prototyped early and tested on-site at Meta's data centers. Our first global dashboard was too broad; users were overwhelmed.

"I don't need to see what's happening everywhere. I need to know what I'm working on today."

— Repair technician, on-site testing session

We pivoted to a hierarchical view (global, regional, and site-level), letting each user type zoom to the right scope. Meanwhile, the Action Plans format went through four rounds of iteration, each one surfacing a new constraint we hadn't anticipated. The biggest tension throughout was flexibility versus structure: plans needed to handle genuinely complex, branching repairs while still being consistent enough to track and eventually automate.

T16/T20 Power Failure — Repair Guide Step 1: Check logs Step 2: AC Cycle Step 3: If GPU fails, try...
Iteration 01
Free-form wiki editor
  • Experts write repair steps as plain text
  • No consistent format across plans
  • Not machine-readable or trackable
New Action Plan Failure type e.g. Power Rail Failure Steps 1. Check logs on host 2. AC Cycle chassis 3. Add step... Expected outcome
Iteration 02
Structured template
  • Required fields: failure type, steps, outcome
  • More consistent, but strictly linear
  • Couldn't handle branching repair logic
Build Action Plan Check logs on host What error in the log? Power Rail Failure → AC Cycle chassis GPU Failure → GPU Shuffle Replace power board Replace GPU
Iteration 03
Tree-based action builder
  • Each node is an action, branches are outcomes
  • Handles complex, multi-path repairs
  • Structured enough to track and automate
Repair Plan Zion T16/T20 - Power Failures Check logs on the host What was the error in the log? Back Next step Plan Properties History Entity Info
Final concept
Two-panel guided execution
  • Step instructions on the left, server context on the right
  • Radio options route to the correct next step
  • Full action history tracked throughout

Key usability findings

Finding 01
8 of 10 preferred Repair Console
Users found it easier to author and execute repairs
  • Clearer than existing runbooks and ticketing tools
  • Faster to get started on a repair
  • Reduced time spent searching for the right steps
Finding 02
?
5 of 10 had scalability concerns
Plan maintenance at scale was an open question
  • Worried plans would go stale as hardware evolved
  • Unclear how to discover the right plan among hundreds
  • Concerned about who owns outdated or duplicate plans
Finding 03
7 of 10 wanted expert sign-off
Technicians wanted credibility signals before trusting a plan
  • Success rate and usage history were the most requested signals
  • Plans authored by recognized experts carried significantly more trust
  • Asked for publishing rules so not anyone could author a plan

Three surfaces, one unified platform

Repair Console brought three surfaces together for the first time, each designed for a specific user persona, all drawing from the same underlying data.

Surface 1: Global monitoring, for the Monitor

Designed for team leads who needed situational awareness across their fleet. Three progressive zoom levels let each user type see exactly the right scope: global trends, regional patterns, or a single campus.

Surface 2: Issue view, for the Pattern Finder

Designed for diagnostic techs who needed to see the bigger picture. Similar failures are automatically grouped into a single Issue, letting teams diagnose once and repair in bulk rather than working ticket by ticket.

Issue Details screen showing grouped tickets for Zion T16/T20 Power Failures, with status charts, properties panel, and team conversations
Issue Details. Four tickets are grouped under a single Zion T16/T20 Power Failures issue. The technician sees all affected servers, their priorities and statuses, and how long each has been unresolved, all in one view. The "Start Repair" button launches the Action Plan across the entire group. The Conversations panel (bottom right) was added after user research revealed that teams were coordinating via messaging threads that disappeared. Repair Console needed to be the record of communication, not just of actions.

Surface 3: Action Plans, for the Actioner

Designed for on-site repair techs who needed clear, unambiguous instructions. A branching tree format routes technicians to the right next step based on what they find, with full action history tracked in the right panel throughout.

Measurable impact across the fleet

We measured performance across teams using Repair Console against teams still on legacy tools. These were the key outcomes.

22%
Reduction in average repair time, from initial triage to server back online
31%
Increase in repair accuracy, measured by reduction in tickets reopened for the same issue
7→1
Repair ticketing systems consolidated into a single unified platform

The numbers tell part of the story. The bigger shift was less visible: for the first time, teams across different regions and functions were working from the same data, using the same processes, and building on each other's knowledge rather than starting from scratch. The DIMM defect that once took years to surface would now be detectable in weeks. That kind of compounding improvement is what the platform was really designed for.

What I'd do differently

Test in context earlier

We designed the global dashboard for an idealized user, not the real one. On-site testing surfaced this quickly, but we got there too late. Earlier field research would have saved weeks.

Distributed teams need deliberate design rituals

Keeping alignment across Dublin, Berlin, and San Francisco required short weekly syncs I hadn't planned for. Once established, they became one of the most valuable parts of the process.

Powered by green tea and a passion for creating new things.