Defining the Problem in RFCs: Precision Language and How to Write an RFC Problem Statement
Tired of RFCs that sneak in solutions or leave stakeholders guessing about impact? In this lesson, you’ll learn to write a precise, solution-agnostic problem statement that quantifies symptoms, ties them to business outcomes, and locks in clear constraints—so three or more valid solution paths remain open. Expect a tight framework, high-signal phrasing patterns, and real-world examples, plus quick checks and exercises to stress-test your draft. Finish with a micro-template you can use to ship credible, review-ready RFCs fast.
Step 1 — Define the RFC problem statement and its boundaries
A well-written RFC problem statement is the anchor for informed decision-making. In an RFC, the problem statement is a concise, solution-agnostic description of a current shortcoming in the system or process that produces measurable risk, cost, or missed opportunity. Its purpose is to explain why action is needed now and to establish the criteria that any proposed solution must satisfy. When stakeholders read it, they should immediately see the gap between current performance and required performance, understand the scale and urgency of the gap, and grasp the constraints that any solution must respect. This clarity enables teams to evaluate multiple solution options against shared, explicit needs rather than personal preferences or implicit assumptions.
To maintain reliability and credibility, the problem statement must be factual and testable. It should report observations from logs, dashboards, incidents, user tickets, or cost data, and translate those into concrete business or user impacts such as breached SLOs, revenue variance, compliance exposure, or degraded developer velocity. A credible problem statement is also time-bounded and scoped to the smallest area where the issue manifests, avoiding the temptation to inflate scope for visibility or resources. The reader should find no hype, no marketing language, and no unstated assumptions.
Equally important are the boundaries of what a problem statement is not. It is not the solution, not the design, and not a roadmap. It omits implementation details and prescribes no technology choices. It avoids forward commitments like “we will migrate” or “we must adopt,” because such phrasing prematurely closes the space of possible solutions. The problem statement also avoids speculative causes unless supported by evidence; instead, it presents the present state and its verified consequences. These boundaries keep the conversation focused on validating the existence, scope, and impact of the issue before debating how to solve it.
A precise distinction becomes clear when comparing weak and strong formulations. A weak formulation substitutes a preferred tool for the problem itself, tipping stakeholders toward a single path and shutting down exploration. A strong formulation names the affected system under specific conditions, quantifies symptoms using observable metrics, traces those symptoms to business or user impact, and lists constraints that any solution must honor. By separating facts from preferences, the problem statement invites a fair evaluation of diverse options, including incremental fixes, architectural changes, or process adjustments.
In practice, this boundary discipline improves stakeholder alignment. Product managers can connect system symptoms to user and revenue outcomes, SREs can align observed failures to SLO performance, and engineering leads can situate constraints such as uptime commitments and regulatory obligations. When everyone sees the same problem with the same numbers in the same scope, approval flows faster and the quality of debate improves. The RFC’s later sections—goals, non-goals, solution options—then naturally derive from a shared, precise definition of the problem.
Step 2 — The 4 building blocks of a precise problem statement
A precise problem statement rests on four building blocks that collectively constrain scope, establish evidence, and make impact legible to both technical and non-technical readers.
1) Context and scope (where the problem lives). The opening should identify the system or component, the affected actors, the operational context, and the relevant workload characteristics. This includes naming the environment (production, staging), the time or event windows (e.g., month-end close), and the specific flow or subsystem (e.g., write path, batch job, authentication handshake). By localizing the problem to a specific area, you reduce ambiguity and prevent diffusion into adjacent systems that may not share the same failure mode. This sharp scope also helps reviewers verify the observation and replicate the conditions under which the issue occurs.
2) Observable symptoms and evidence. You must translate qualitative discomfort into quantitative, observable signals. Instead of “slow” or “often,” provide metrics: error rates, latency percentiles, throughput limits, resource starvation, or operational toil measured in hours or tickets. Cite sources: dashboards, incident timelines, logs, user reports, or cost tables. When you include exact numbers (ranges if necessary), you create falsifiable claims. If new data contradicts the statement, it should be straightforward to update the numbers or narrow the conditions. Avoid implied causality unless you can reference supporting data; focus on what is demonstrably happening under which conditions.
3) Business/user impact. Technical symptoms matter because they create consequences. Connect the system behavior to outcomes stakeholders recognize: breached SLOs, missed contractual obligations, revenue variance, margin pressure from cost inefficiency, compliance exposure, support burden, or churn risk. This translation gives the problem shape and urgency. It also becomes the foundation for evaluation criteria: if the issue causes a 0.4 percentage point churn increase or adds hundreds of hours of operational toil per quarter, then acceptable solutions must eliminate or sharply reduce that impact. Remember that impacts can be non-financial but still material—e.g., developer velocity or risk posture.
4) Constraints and assumptions (still solution-agnostic). Name external and internal boundaries that shape what counts as a viable resolution: regulatory deadlines, data residency requirements, uptime or latency commitments, budget ceilings, team capacity, and integration restrictions. State the assumptions you are making (e.g., growth trends, data volumes, or seasonal peaks) so reviewers can challenge them. Constraints do not prescribe a solution; they state the rules of the game. By surfacing them early, you prevent later surprises and reduce rework in solution evaluation.
These blocks can be woven into a single, coherent paragraph using a micro-template that keeps the statement compact while preserving precision: “In [system/area], [under condition X], we observe [quantified symptoms] supported by [evidence]. This results in [business/user impact]. We operate under [constraints/assumptions], which means the problem must be addressed without [forbidden approaches] and while maintaining [critical obligations].” This structure ensures that readers quickly see the where, what, so-what, and boundaries, and it keeps the language neutral with respect to specific tools or architectures. When filled with concrete details, it naturally resists vagueness and solutioning.
Step 3 — A 4-step workflow: how to write an RFC problem statement
A repeatable workflow helps you move from scattered signals to a polished, review-ready statement. The goal is consistency: every RFC problem statement should be reliable, compact, and testable, regardless of who authors it.
1) Gather signals. Start by collecting quantitative and qualitative inputs from multiple sources: SLO dashboards, error logs, incident postmortems, cost and usage reports, user complaints, and internal tickets. Convert anecdotes into numbers wherever possible—if a customer success manager says “billing delays happen a lot,” find the monthly incidence rate and quantify the financial impact. As you gather, identify the smallest affected surface area that still captures the full impact. This tight scope avoids conflating the core issue with loosely related annoyances and keeps later solution exploration focused.
2) Draft the statement using the micro-template. Aim for 4–6 sentences and keep it under 120–150 words. Use exact nouns for systems and flows (e.g., “events write path in the ingestion service”). Replace vague adjectives with measurable thresholds: instead of “significant latency,” specify “p95 write latency exceeds 450 ms.” When ranges are necessary, provide bounded ranges grounded in data. Attribute evidence to sources so reviewers can verify numbers without back-and-forth. Maintain a neutral tone and avoid verbs that implicitly prescribe change such as “migrate,” “adopt,” or “rewrite.” Precision in naming, measurement, and conditions does most of the work of making the statement credible and comprehensible.
3) Stress-test for precision and scope. Share the draft with reviewers and ask a simple question: “Can you propose three different solution approaches after reading this?” If the answer is no, the statement is likely too vague or solution-biased. Use a quick diagnostic: a) scan for solution words like “adopt,” “replace,” “switch,” b) remove unnecessary forward commitments or design decisions, and c) verify that each claim has a source or can be validated. Also check that the scope is crisp: if the statement references adjacent systems, ensure they are causally implicated with evidence; otherwise, trim them out. The stress-test should leave you with a concise core that multiple solution paths could satisfy.
4) Refine with stakeholders. Convene a 10-minute read-aloud session with a PM, an SRE, and a domain lead. Reading aloud forces clarity and exposes ambiguous terms. Invite quick, in-line objections and map each to one of the four building blocks: context (e.g., peak definition), evidence (e.g., missing incident link), impact (e.g., revenue estimate needs source), or constraints (e.g., regulatory deadline). Update the draft immediately, link sources, and freeze the statement when every objection is addressed in one of those blocks. This disciplined closure prevents endless revisiting and allows the RFC to proceed to solution exploration with a solid foundation.
Adopting this workflow across teams builds a shared habit of precision and prevents a common failure mode: RFCs that argue for a favorite design without a validated problem. It also standardizes the quality bar for data and clarity, making cross-team review easier and faster.
Step 4 — Phrasing patterns, pitfalls, and an assessment checklist
Language patterns help you write with economy and rigor. Use time-bounded conditions to narrow context, quantified symptoms to make claims testable, impact framing to translate technical issues into business terms, and constraint clarity to establish non-negotiables. For example, “During month-end close (last 48 hours of each month)…,” “p95 write latency exceeds 450 ms in 18–24% of requests,” “Leads to missed SLO (99.9%) in 3 of the last 5 weeks; projected churn impact 0.2–0.4 pp,” and “Must maintain data residency in EU; no additional PII exposure.” When communicating assumptions, signal them explicitly: “Assumes growth trend of 15% MoM persists through Q4.” These patterns not only improve readability but also teach reviewers how to interrogate your claims.
Avoid predictable pitfalls that undermine credibility. First, solutioning: naming tools, vendors, or architectures in the problem statement narrows exploration and can bias reviewers. Second, vague quantifiers: words like “often,” “usually,” “big,” and “slow” are interpretive; replace them with metrics and ranges tied to sources. Third, scope creep: resist bundling adjacent issues unless you can demonstrate a causal link; otherwise, create separate problem statements. Fourth, unsupported claims: any stated risk, cost, or impact should include traceable citations or links. Fifth, hidden constraints: budget, compliance, or capacity limits that surface later erode trust and force rework; state them upfront even if they feel uncomfortable.
Use a short assessment checklist as your final gate before sharing the RFC:
- Is the problem statement 4–6 sentences, solution-agnostic, and scoped to one system or flow?
- Are all key terms defined (peak, outage, session, tenant) so non-specialists can follow?
- Do the symptoms include numbers, time windows, and sources (dashboards, incidents)?
- Is the business/user impact explicit and traceable (SLO, revenue, compliance, velocity)?
- Are constraints and assumptions stated clearly and realistically, with any forbidden approaches noted?
- Could at least three distinct solution paths be valid given this statement?
- Would a non-domain stakeholder grasp the urgency and boundaries within 60 seconds?
When you apply these patterns and the checklist consistently, the problem statement becomes a shared contract. It tells solution designers what success must look like without telling them how to achieve it. This invites creativity while preserving accountability to measurable outcomes and fixed constraints. It also accelerates cross-functional alignment by giving each stakeholder a clear entry point: engineers verify evidence and feasibility, PMs validate impact and priority, and compliance or security teams confirm constraints.
Finally, remember that precision does not mean verbosity. A strong RFC problem statement is short but dense with meaning. Every sentence should perform a job: locate the issue, measure it, translate it to impact, and frame the non-negotiables. Keep the tone neutral, the claims checkable, and the scope tight. With these principles and the four-step workflow, you can reliably produce problem statements that guide productive solutioning and help secure timely, well-reasoned approval.
- An RFC problem statement must be concise, factual, and solution-agnostic: define the gap with testable evidence, not tools or decisions.
- Build it with four blocks: context/scope (where and when), observable symptoms with metrics and sources, business/user impact, and constraints/assumptions.
- Use quantified, time-bounded language and a micro-template to keep claims verifiable and neutral; avoid vague terms, speculation, and scope creep.
- Apply the workflow: gather signals, draft 4–6 sentences, stress-test for neutrality and multiple solution paths, and refine with stakeholders using an evidence-and-constraints checklist.
Example Sentences
- In the checkout service during peak Friday traffic, p95 payment authorization latency exceeds 600 ms for 22–28% of requests, as shown in Grafana panel 3.2.
- This behavior has breached the 99.9% latency SLO in 4 of the last 6 weeks and correlates with a 0.3–0.5 percentage point decline in conversion, per Looker dashboard REV-17.
- We must address the problem within EU data residency and a 2-hour monthly maintenance window while avoiding additional PII exposure.
- Assumes order volume continues to grow 12–15% MoM through Q4; if growth slows below 5%, impact estimates should be revised.
- Any solution must operate without vendor lock-in commitments and maintain current uptime (99.95%) and auditability requirements (SOX scope).
Example Dialogue
Alex: I’m drafting the RFC problem statement for the auth timeouts—can you sanity-check the scope?
Ben: Sure. What’s the precise context and evidence?
Alex: In production, during the first 30 minutes of each workday, p95 token refresh latency spikes to 900–1,200 ms for 18–24% of requests, confirmed by the Kibana query AUTH-42 and PagerDuty incidents #1187 and #1194.
Ben: And the impact?
Alex: We’ve missed the 99.9% auth SLO twice this month and support tickets increased by 35%, adding about 90 hours of ops toil; finance estimates a 0.2 pp churn risk if it persists.
Ben: Good. Note constraints: GDPR data residency, no new PII flows, budget cap of $60k this quarter, and keep it solution-agnostic—no ‘migrate to X’ language.
Exercises
Multiple Choice
1. Which sentence best maintains a solution-agnostic, evidence-based RFC problem statement?
- We must migrate to Vendor X because the system is too slow.
- In the invoicing batch job during month-end close, p95 processing time exceeds 45 minutes for 20–26% of runs, per Airflow logs JOB-221; this breached our 99.9% timeliness SLO twice last quarter.
- Our batch job is often slow and hurts the business, so we should replace it.
- The invoicing job is bad and causes customer frustration.
Show Answer & Explanation
Correct Answer: In the invoicing batch job during month-end close, p95 processing time exceeds 45 minutes for 20–26% of runs, per Airflow logs JOB-221; this breached our 99.9% timeliness SLO twice last quarter.
Explanation: A strong problem statement is factual, quantified, sourced, and solution-agnostic, linking symptoms to impact. The other options are vague or prescribe a solution.
2. Which item correctly states constraints without prescribing a solution?
- We will adopt a serverless architecture to fix the issue.
- Any fix must be open source and use PostgreSQL 14.
- Must maintain EU data residency, keep current uptime (99.95%), and operate within a 2-hour monthly maintenance window.
- We should rewrite the service in Go to meet latency goals.
Show Answer & Explanation
Correct Answer: Must maintain EU data residency, keep current uptime (99.95%), and operate within a 2-hour monthly maintenance window.
Explanation: Constraints state non-negotiable boundaries (compliance, uptime, windows) without naming a specific technology or solution.
Fill in the Blanks
In [system/area], under [condition/time window], we observe [quantified symptom] supported by [evidence]. This results in [business impact]. We operate under [constraints], which means the problem must be addressed without [___].
Show Answer & Explanation
Correct Answer: prescribing a specific solution or technology
Explanation: The problem statement should be solution-agnostic, avoiding language that commits to tools, vendors, or architectures.
Replace vague terms with metrics: instead of saying "the API is often slow," write "p95 latency exceeds ms for % of requests," citing a dashboard or log source.
Show Answer & Explanation
Correct Answer: a specific number; a specific percentage
Explanation: Observable symptoms should be quantified with concrete thresholds and proportions, tied to verifiable sources.
Error Correction
Incorrect: During traffic spikes, we will migrate to Service Y because error rates are high, which probably causes churn.
Show Correction & Explanation
Correct Sentence: During traffic spikes, p95 error rate reaches 3.2–4.1% for the checkout write path, as shown in Grafana ERR-12; this correlates with a 0.2–0.4 pp conversion decline.
Explanation: The correction removes solutioning (“will migrate”) and speculation, replacing them with quantified, evidence-backed symptoms and impact, per the solution-agnostic rule.
Incorrect: The login system is usually slow and impacts users; fix it soon, but details can be figured out later.
Show Correction & Explanation
Correct Sentence: In production during 08:00–09:00 UTC, p95 login latency is 850–1,100 ms for 18–24% of requests (Kibana AUTH-42), breaching the 99.9% SLO in 3 of the last 5 weeks under GDPR and a $60k quarterly budget cap.
Explanation: The correction replaces vague language with precise context, quantified symptoms, explicit impact (SLO breach), and stated constraints, aligning with the four building blocks.