Precision English for RWE Documentation: Crafting a “template: data sources section for RWE” with Linkage, Coverage, and Bias Language
Struggling to write a data sources section that withstands audit without sounding promotional? In this lesson, you’ll learn to craft a precise, regulator-ready “template: data sources section for RWE,” with defensible linkage parameters, transparent coverage and representativeness, and clear bias language. You’ll get concise guidance, domain-approved phrasing, real-world examples, and targeted exercises (MCQs, fill‑in‑the‑blank, error fixes) to operationalize neutral, versioned statements that map to EQUATOR-aligned expectations. Finish with a scannable, journal-ready section that earns trust and shortens review cycles.
Purpose and Structure of a “template: data sources section for RWE”
A well-constructed data sources section is the backbone of reproducible real‑world evidence. Its purpose is to allow a reader to understand precisely what data were used, how they were obtained, how records were linked and transformed, and where bias could enter the pipeline. When this section is complete and neutral in tone, an external reviewer can audit the study, replicate the cohort construction, and evaluate the external validity of findings without reading between the lines. The emphasis is on traceability: sources, versions, dates, governance, and processes should be stated so that each step can be verified.
To achieve this, structure the section with canonical subheadings that mirror the lifecycle of data use in RWE:
- Source Overview and Governance
- Coding Systems and Phenotyping
- Linkage and Data Integration
- Data Quality: Missingness, Refresh Cadence, and Preprocessing
- Coverage and Representativeness
- Bias, Limitations, and Sensitivity Checks
Use neutral, declarative sentences. Avoid causal interpretation, promotional language, and open‑ended claims. Prefer standardized phrases that are easy to audit and compare across studies, such as: “We obtained…,” “The dataset includes…,” “Linkage was performed using…,” “Coverage spans…,” “Representativeness was assessed by…,” and “Potential biases include…”. Give parameters and versions wherever possible. Cite coding systems with their controlled vocabularies, and specify time windows in YYYY‑MM‑DD format. This voice keeps the prose factual and scannable while retaining the specificity required for independent verification.
Organizing your writing with these subheadings ensures that the reader can answer four core questions: What are the sources and their governance? How are clinical concepts encoded and phenotypes defined? How are records joined and curated across sources? What is the scope and quality of the data, including coverage and potential biases? A section that answers these questions with auditable detail will accelerate protocol review and reduce queries from methodologists, data partners, and regulators.
Domain‑Approved Phrasing by Subheading
1) Source Overview and Governance
Begin by naming the data sources, the organizations that supplied them, and the legal and ethical frameworks that govern their use. State whether you accessed an electronic health record (EHR) repository, claims data, a disease registry, or an NLP‑derived corpus from clinical notes. Specify whether the acquisition was a direct extract from a health system, a licensed dataset from a vendor, or a research network pull. Report the relevant data model—such as OMOP or PCORnet—to set expectations for standardized fields and vocabulary mappings.
Use a consistent template to capture these details. A clear, domain‑approved formulation is: “We obtained [data source(s): EHR/claims/registry/NLP corpus] from [organization(s)] under [IRB/DUA details]. The extract covered [start date–end date] and included [population scope, e.g., adults ≥18 receiving care in X systems]. Data governance adheres to [HIPAA/GDPR/other], and data are de‑identified to the [Safe Harbor/Expert Determination] standard.” Extend this with the version or date of the data pull, protocol identifiers, and any network‑level governance. If your dataset uses a common data model, name the version (e.g., OMOP v5.4) and the date of transformation. When appropriate, state whether expert determination for de‑identification was performed and by whom.
These details establish provenance and legal compliance. They also define the population universe accessible to the study. Keep claims about scope modest and concrete, tied to inclusion rules, encounter thresholds, and care settings actually present in the data.
2) Coding Systems and Phenotyping
In RWE, clinical concepts are encoded with controlled vocabularies. Readers must see exactly which vocabularies and versions were used and understand the logic that converts codes and text into phenotypes. List the coding systems relevant to diagnoses, procedures, medications, and laboratory results—commonly ICD‑10‑CM, SNOMED CT, CPT/HCPCS, RxNorm, and LOINC. If you rely on mappings across vocabularies, name them and note the version dates.
Describe the phenotyping approach with standardized, auditable language: “Clinical concepts were identified using [coding systems]. Phenotypes were defined by [logic: rule‑based algorithms/validated phenotyping algorithms/NLP pipelines], with code lists sourced from [authoritative source/literature] and versioned on [date]. Code‑list validation was [internal review/clinical adjudication/benchmark against gold standard], achieving [metrics if available].” Name algorithm references (e.g., established phenotypes from a consortium or published rule sets). For NLP pipelines, report the model type and version, the document domains used (e.g., progress notes, discharge summaries, radiology reports), and whether negation, temporality, and assertion status were handled. If you used assertion or section detection, describe the schemas and any confidence thresholds.
This level of specificity lets reviewers judge misclassification risk and portability. It also supports sensitivity checks, such as testing alternative code lists or stricter inclusion logic. If metrics exist—precision, recall, F1, or agreement statistics—state them and indicate the reference standard used for evaluation. If formal metrics are not available, indicate the nature of clinical review or adjudication and the sample sizes used.
3) Linkage and Data Integration
Many RWE studies synthesize multiple sources: EHR encounters, pharmacy dispensing, laboratory feeds, registries, or payer claims. Linkage choices affect cohort completeness and the risk of duplicate or mismatched records. State the sources linked, the identifiers or features used for linkage, and the method class—deterministic or probabilistic.
Use canonical phrasing: “Records were linked across [sources] using [deterministic keys: MRN, hashed SSN, payer member ID; and/or probabilistic linkage via [Fellegi–Sunter/ML‑based] with thresholds of [x]]. Linkage quality was assessed by [clerical review/sample match rate], with estimated precision/recall of [x/y] where calculable. De‑duplication used [rules], and person‑level IDs were [stable/refresh‑reconciled].” If crosswalk tables or common concept identifiers (e.g., OMOP person_id) were used, state this. If weights or thresholds were tuned, identify the criteria, sample size, and any holdout used to estimate error rates.
Address how you handled mismatches and ambiguous links. Indicate whether you suppressed ties below a certain probability, escalated borderline cases for clerical review, or resolved conflicts via hierarchy rules. If identifiers change across refreshes, explain how you reconcile longitudinal identity and how many records are affected. If site‑level variation exists in identifier quality, note it and its implications for linkage error and coverage.
4) Data Quality: Missingness, Refresh Cadence, and Preprocessing
Data quality affects inference and reproducibility. Report the timing of data ingestion, the snapshot used for analysis, and the cadence of refreshes. Be explicit: “Data refreshes occurred on a [weekly/monthly/quarterly] cadence; this analysis uses the [YYYY‑MM‑DD] snapshot.” This positions any count differences or cohort drift within a known update cycle.
Describe missingness profiling for key fields (e.g., demographics, exposure variables, outcomes, dates, units). Avoid asserting strong missingness mechanisms without evidence; permissible phrasing includes “profiled as,” “treated as,” or “assumed under analysis,” with justification. If you categorize missingness as MCAR, MAR, or MNAR, link the classification to diagnostics, external benchmarks, or sensitivity analyses. State the rules you used for excluding records, imputing values, or carrying forward measurements, and identify the variables affected.
List preprocessing steps that could alter values or selection: date harmonization, timezone alignment, unit normalization, outlier handling with explicit bounds, deduplication logic, text de‑identification, concept mapping, and value range checks. Name the pipeline or tool, including version numbers, configuration hashes, and execution dates, so a reviewer can trace the transformations end‑to‑end. Note any QC dashboards used and thresholds for flagging anomalies. If you applied conservative truncation of implausible measurements or filtered noisy text segments, define the heuristics and their provenance.
This subheading should make clear which steps are deterministic, which are probabilistic or model‑based, and which are policy decisions (e.g., excluding incomplete encounters). Precision in this section supports both audit and replication on a future snapshot.
5) Coverage and Representativeness
Coverage tells the reader where and when data capture is possible and what subpopulations are present. Specify geographic scope (states, regions), care settings (inpatient, outpatient, emergency department, specialty clinics), and the temporal window of available data. Use date ranges with explicit boundaries and note any known discontinuities or lags. If certain sites onboarded mid‑period or if feeds changed, document the inflection points.
Representativeness requires a comparison against external denominators. Describe the analytic cohort in terms of age bands, sex distribution, payer mix, and other relevant attributes, and then say how you benchmarked: “The analytic cohort represents [age ranges, sex distribution, payer mix], compared against [external benchmark: Census/HCUP/claims universe] using [standardized differences/weighting diagnostics].” Provide denominators and specify inclusion/exclusion criteria and minimum encounter thresholds. Avoid generalizing beyond capture scope; for example, if data are from insured populations in urban centers, state that clearly and avoid implying national generalizability without adjustment.
If you applied reweighting or post‑stratification to mitigate nonrepresentativeness, describe the approach at a high level and name the diagnostics used to evaluate balance. If representativeness varies by site or phase, acknowledge heterogeneity and its implications for subgroup analyses. Transparency in this section helps the reader judge external validity and interpret effect heterogeneity.
6) Bias, Limitations, and Sensitivity Checks
Every RWE dataset carries risks of bias. Name plausible biases in neutral terms and separate observation from inference. Common concerns include selection bias due to care‑seeking patterns, misclassification from coding or NLP errors, left truncation (incomplete prior history) and right censoring (loss to follow‑up), and differential capture or coding intensity by site or payer. State these succinctly and tie them to the specifics of your sources.
Articulate what you did to probe robustness: “We conducted [sensitivity analyses: alternative code lists, stricter phenotype definitions, linkage threshold variation, reweighting] yielding [direction/magnitude] of changes.” Use hedged language when summarizing outcomes of these checks. If falsification endpoints, negative controls, or exposure misclassification stress tests were applied, note them and report whether conclusions were directionally stable. When possible, quantify how many records or events are affected by each sensitivity setting. This section builds trust by showing that you have looked for instability and by documenting the boundaries of inference supported by the data.
Putting It Together: How to Assemble a Concise, Auditable Section
To assemble a coherent section, proceed in the recommended order and keep sentences concrete. Begin with provenance and governance to establish legitimacy and scope. Move to coding systems and phenotyping to define how clinical meaning is operationalized. Then describe linkage and integration to clarify record‑level identity and deduplication. Address data quality, refresh cadence, and preprocessing to make transformations traceable. Conclude with coverage and representativeness to situate the cohort in time, place, and population, and finish with a neutral statement of bias risks and sensitivity probes.
As you draft, favor parameterized statements. Replace vague phrases with dates, versions, thresholds, and rule names. Instead of “we used standard ICD codes,” write “diagnoses were identified using ICD‑10‑CM (2023‑10 release) mapped to SNOMED CT (2023‑09‑01), with mappings versioned in OMOP v5.4 vocabularies (downloaded 2023‑11‑15).” Instead of “records were linked probabilistically,” write “probabilistic linkage used a Fellegi–Sunter model with a 0.95 acceptance threshold and 0.85–0.95 clerical review band; 2% of candidate pairs underwent manual adjudication.” These details shorten review cycles and reduce ambiguity.
Keep the tone neutral and non‑inferential. Avoid statements that imply effectiveness, safety, or causality. The data sources section should read as an operational report rather than a results narrative. When uncertainty exists—about missingness mechanisms, linkage error rates, or phenotyping accuracy—say so plainly and provide the closest available diagnostics or assumptions used in downstream analysis. Transparency about limits is a strength, not a weakness, in RWE documentation.
Finally, ensure cross‑references and traceability. If you cite a code list or phenotype algorithm, give a repository link or DOI, plus the version or commit hash. If you name a pipeline, include the release tag, configuration file name, and execution date. If you present coverage summaries or representativeness diagnostics, indicate where the full tables reside (e.g., an appendix or methods supplement). This documentation scaffolding enables others to reproduce your environment and decisions on a new snapshot or a different dataset with minimal friction.
By following this structure and language, you will produce a data sources section that is precise, auditable, and suitable for regulatory and peer review. The emphasis on linkage, coverage, and bias—with clear coding and preprocessing details—ensures that downstream analyses can be interpreted in the appropriate context and that the evidence chain from raw data to analytic cohort remains intact.
- Structure the data sources section with clear subheadings covering: Source/Governance; Coding Systems/Phenotyping; Linkage/Integration; Data Quality (missingness, refresh, preprocessing); Coverage/Representativeness; and Bias/Sensitivity checks.
- Use neutral, parameterized statements that specify sources, versions, dates, models, thresholds, and governance (e.g., IRB/DUA, OMOP version, coding vocab versions, snapshot dates) to ensure auditability and reproducibility.
- Explicitly detail phenotyping methods and linkage procedures (coding systems, algorithm logic, validation metrics; deterministic/probabilistic linkage keys, thresholds, de-duplication rules) to assess misclassification and identity accuracy.
- Document data quality profiling, preprocessing steps, coverage scope, representativeness benchmarking, and plausible biases with corresponding sensitivity analyses, avoiding causal or promotional claims.
Example Sentences
- We obtained EHR and pharmacy claims from MetroHealth and OptiRx under IRB #2024-117 and DUA MH-OPX-23, covering 2018-01-01 to 2024-06-30 for adults ≥18.
- Diagnoses were identified using ICD-10-CM (2023-10) mapped to SNOMED CT (2023-09-01) via OMOP v5.4 vocabularies (downloaded 2023-11-15).
- Records were linked across EHR and claims using deterministic MRN-to-member-ID crosswalks and a Fellegi–Sunter model with a 0.95 acceptance threshold and 0.85–0.95 clerical review band.
- Coverage spans 12 states with inpatient, outpatient, and ED encounters; representativeness was assessed against Census 2020 age-sex strata using standardized differences.
- Potential biases include left truncation due to incomplete prior history, misclassification from NLP negation errors in progress notes, and site-level variation in coding intensity.
Example Dialogue
Alex: I’m drafting the data sources section; can you confirm our linkage details?
Ben: Yes—records were linked using hashed SSN plus DOB deterministically, and we used probabilistic linkage at 0.9–0.95 for clerical review.
Alex: Good. I’ll state that de-duplication followed OMOP person_id rules and that the 2025-02-01 snapshot was used.
Ben: Include coverage and bias too—coverage spans 2019-01-01 to 2024-12-31 across six hospitals, and potential biases include loss to follow-up after plan disenrollment.
Alex: Noted. I’ll add that representativeness was compared to state claims denominators using standardized differences.
Ben: Perfect—keep the tone neutral and specify versions for ICD-10-CM, RxNorm, and LOINC so reviewers can audit the phenotypes.
Exercises
Multiple Choice
1. Which sentence best matches the neutral, auditable style for the “Source Overview and Governance” subheading?
- We used a best-in-class dataset that captures nearly all U.S. care.
- We obtained EHR and pharmacy claims from MetroHealth and OptiRx under IRB #2024-117 and DUA MH-OPX-23, covering 2018-01-01 to 2024-06-30 for adults ≥18.
- Our data are comprehensive and definitely representative of the national population.
- The dataset was great and included everything we needed.
Show Answer & Explanation
Correct Answer: We obtained EHR and pharmacy claims from MetroHealth and OptiRx under IRB #2024-117 and DUA MH-OPX-23, covering 2018-01-01 to 2024-06-30 for adults ≥18.
Explanation: The lesson emphasizes neutral, parameterized statements with governance details, dates, scope, and source names; promotional or sweeping claims should be avoided.
2. Which option correctly states linkage details with parameters and methods?
- Records were linked probabilistically.
- Records were linked using a secret algorithm that worked well.
- Records were linked across EHR and claims using deterministic MRN-to-member-ID crosswalks and a Fellegi–Sunter model with a 0.95 acceptance threshold and 0.85–0.95 clerical review band.
- We think the records probably matched.
Show Answer & Explanation
Correct Answer: Records were linked across EHR and claims using deterministic MRN-to-member-ID crosswalks and a Fellegi–Sunter model with a 0.95 acceptance threshold and 0.85–0.95 clerical review band.
Explanation: Good practice specifies sources, method (deterministic/probabilistic), thresholds, and clerical review bands for auditability.
Fill in the Blanks
Diagnoses were identified using ___ (version/date) mapped to SNOMED CT (2023-09-01) via OMOP v5.4 vocabularies (downloaded 2023-11-15).
Show Answer & Explanation
Correct Answer: ICD-10-CM (2023-10)
Explanation: The explanation and examples require naming specific coding systems with version dates to support traceability.
Data refreshes occurred on a monthly cadence; this analysis uses the ___ snapshot.
Show Answer & Explanation
Correct Answer: 2025-02-01
Explanation: The template calls for explicit snapshot dates in YYYY-MM-DD format to anchor counts and cohort drift to a known refresh.
Error Correction
Incorrect: Coverage is basically nationwide and represents everyone equally.
Show Correction & Explanation
Correct Sentence: Coverage spans 12 states with inpatient, outpatient, and ED encounters; representativeness was assessed against Census 2020 age-sex strata using standardized differences.
Explanation: Avoid unsupported generalizations. Replace with concrete scope and an objective benchmarking method, per the lesson’s guidance.
Incorrect: We used standard ICD codes and linked records somehow; missingness was probably random.
Show Correction & Explanation
Correct Sentence: Diagnoses were identified using ICD-10-CM (2023-10) mapped to SNOMED CT (2023-09-01) via OMOP v5.4 vocabularies (downloaded 2023-11-15). Records were linked using hashed SSN plus DOB deterministically, with probabilistic linkage at 0.90–0.95 for clerical review. Missingness was profiled for demographics and outcomes; treatment assumptions are reported in the analysis.
Explanation: The corrected version supplies specific vocabularies, versions, linkage keys and thresholds, and uses neutral phrasing about missingness profiling instead of unsubstantiated claims.