Cultivating Digital DNA (Data, Networks, and Applications) Enables a Thriving Real-World Evidence Ecosystem to Improve Patient Outcomes

An evolution of medical diagnostics, treatments, and services relies upon a multitude of trusted vendors collectively developing “Digital DNA” — scaled medical datasets, diverse provider networks, and cutting-edge software applications.

by Allison Robbins, CFA* | David M. Rubin, PhD | Junko Saber, MBA | Arash Hanifi, PhD

*corresponding author:

[bg_collapse view=”link” color=”#1baae2″ icon=”arrow” expand_text=”Read More” collapse_text=”Read Less” ]

The decades-long transition from paper charts to electronic medical records (EMRs) in the United States has run parallel to the development of novel analytical tools that glean valuable insights from large datasets. These phenomena have now merged to enable a growing and evolving ecosystem of companies that aim to advance diagnostics, treatments, care delivery, and ultimately patient outcomes.

Figure 1. A “Learning Health System,” as defined by the U.S. Agency for Healthcare Research & Quality (AHRQ), pulls data from clinical practice and medical research, converts it into usable evidence, and reintroduces it back into the system.

As shown in Figure 1, Health IT experts have long envisioned a “learning health system” in which aggregated patient data provides real-time insights that improve care delivery.[1] Today, health systems, nonprofits, and companies collect, de-identify, and curate patient real-world data (RWD) from EMR systems and organize it into structured databases with standardized ontologies.[2] Modern software tools, such as natural language processing (NLP), tokenization, and machine learning, convert raw RWD, combined with other data types and analyses, into real-world evidence (RWE) that supports numerous health care use cases. To close the learning health system’s loop, third-party applications that integrate with EMRs, such as clinical decision support and medical education tools, educate and engage willing patients and clinicians with the knowledge previously acquired.

The Digital DNA (Data, Networks, and Applications) Framework

Figure 2. To address health care use cases, the real-world evidence ecosystem must develop sufficient “Digital DNA” — a composite of “data” quality and volume, “network” scale and diversity, and practical software “applications” that generate and return solutions back to the network. Some applications iteratively pull data from the health system and push solutions into it, such as certain SaaS services and Software-as-a-Medical-Device (SaMD) tools.

A financially sustainable, independent, and competitive ecosystem of trusted health tech vendors supported by thousands of customers benefits all health care stakeholders. The conceptual “Digital DNA” framework expands upon the learning health system concept and empowers health care stakeholders to invest in, partner with, and purchase from health tech companies that have the greatest potential for improving patient care. The Digital DNA framework encompasses RWE as well as its downstream uses, such as developing diagnostics, therapeutics, and SaaS services and provides a scorecard to evaluate the strengths and weaknesses of health tech companies and their products for a given use case. Furthermore, it acknowledges that some software applications iteratively pull data from the health system and reintroduce solutions back into it, such as SaaS services and Software-as-a-Medical-Device (SaMD) tools.

As Figure 2 illustrates, the Digital DNA framework encourages the RWE ecosystem to collectively develop a) high-quality, large-volume “data,” b) broad, diverse patient and provider “networks,” and c) practical software “applications” to transform RWD into RWE and medical products and reintroduce solutions to patients and providers. This effort should provide as much Digital DNA to as many stakeholders as efficiently, effectively, and ethically as possible.


Many health care use cases require a large volume of detailed medical data from a diverse set of patients. Due to inclusion and exclusion criteria, the number of initial patients in a commercially available RWD dataset often needs to be significantly greater than that required for any given study’s statistical purposes.[3] For example, life science company evaluations of RWD datasets should consider the number of unique patients with i) certain diagnosis codes or biomarkers, ii) relevant treatment in the historical and forecasted database, iii) active status (i.e. recent visits), and iv) longitudinal data over several years in duration. Furthermore, for regulatory use cases, RWD should be traceable/auditable, transparent, generalizable, timely, and scalable.[4] Other considerations are linkages to other types of datasets such as ‘omics, image, lab, claims, pharmacy, financial and operational data, the Social Security Administration’s Death Master File (DMF), and patient-reported outcomes.

RWD datasets should also provide sufficient depth and completeness of data elements. For example, some life science oncology use cases require the following data variables to be structured fields: diagnosis codes, tumor stage and histology, performance status, biomarker status, standard lab values, clinician orders by date, medications, encounter dates and type, date of death, and patient-reported outcomes. Importantly, much of the useful information in EMR datasets must be extracted from unstructured free-text provider notes and genomic or pathology reports, likely via natural-language processing (NLP). Mining free text notes can be problematic without the appropriate legal and ethical framework. To protect patient privacy, extracted terms should be well defined and limited. Thus, both the capacity for data depth as well as demonstrated sensitivity to the related ethical considerations are tandem items to consider when evaluating a RWD company’s operations. Another consideration is a RWD vendor’s approach to manual curation and methodology for calculating derived data elements, such as progression-free survival, which may require access to raw image data and interpretation from a qualified pathologist to be accepted as evidence. Specialties such as oncology may be a better focus for early experimentation in this space, as regulations and frameworks need time to catch up to innovation.


In the United States, the practice of medicine varies between regions, institutions, care settings, specialties and individual physicians. To create large and generalizable datasets for health care use cases, health tech companies should partner with a broad and diverse network of health care providers. Such networks should contain a significant number and variety of i) clinicians (both by specialty and credential) and ii) institutions (e.g. academic medical centers, community health systems, and independent clinicians). The socio-economic demographics of patients in resulting RWD datasets should reflect the broad general population. Similar considerations apply as RWD is collected globally.

Network-level data is critical to a holistic view of the patient journey. Certain RWD use cases require multidisciplinary data on patients collected over months and years, inevitably across multiple institutions and care providers who use different EMR vendors. Prostate cancer represents this complexity, as a patient may start their journey with a primary care provider, be referred to a urologist, and subsequently receive care at an oncology practice. A comprehensive view from diagnosis to treatment and the eventual patient health outcome necessitates access to RWD earlier in the patient journey, possibly even pre-diagnosis, as well as harmonized third-party and payer level data. Often, this access relies on relationships with payers and contractual permission from patients and/or health systems.

Health record aggregators that work with health insurance companies tend to have a more thorough map of the entire patient journey than independent practice and health system EMR vendors and health systems themselves.[5] However, those entities have consistent and ongoing access to audit the source data and ability to directly engage patients and providers. Population health vendors are uniquely positioned to evaluate the health outcomes and costs of a therapeutic intervention and treatment plan as well as engage with providers through education and workflow applications. These vendors require deeper, multidisciplinary data sources, and have the ability to harmonize RWD to generate RWE insights.

Patient consent should be obtained whenever possible and practical, as it is the most ethical approach and allows for the most complete datasets. Clinicians and health systems should be informed of any use of their data, regardless of contractual requirements, and Institutional Review Boards should become standard practice.


The sheer amount of medical data becoming available is catalyzing many innovative software applications that can distinguish signal from noise and detect patterns currently invisible to the human eye and mind. For example, machine learning tools, such as causal or deep learning software, are now being applied to RWD, pathology slides and radiology images, revolutionizing medical insights and enabling novel diagnostic modalities, such as Software-as-a-Medical-Device (SaMD). Similarly, the combination of these analytical tools with gene expression and patient outcomes data is facilitating the identification of novel drug targets for currently untreatable diseases.

Additionally, EMR-integrated Software-as-a-Service (SaaS) applications are reintroducing RWE insights and support tools to relevant and willing clinicians and patients. These solutions range from clinical decision support, to care management, medication adherence, drug safety monitoring, patient reported outcomes, patient consent and targeted clinical trial recruitment.

When possible, traditional evaluation methods should be used to validate these applications (e.g. AUROC curve for software diagnostic tools, wet-lab validation for target ID analytics, and clinical trials for certain SaaS solutions).

Applying the Digital DNA Framework to Life Science Use Cases and Health Tech Companies

For any given life science use case, the ability of a health tech company to add value depends upon i) the Digital DNA signature required to address the use case, ii) the health tech company’s current and potential Digital DNA signature, iii) the relevant time frame, and iv) the candidate’s respect for patient privacy and transparency. Examples of life science use cases, required Digital DNA signatures, and health tech companies which fulfill those requirements are discussed below.

Credit: Nobi_Prizue, iStock by Getty Images

To identify novel biomarkers, the fields of genomics, transcriptomics, proteomics, pathomics, and radiomics all require large amounts of heterogenous RWD combined with highly detailed molecular data or images to train machine learning models. Such biomarkers can inform clinical trial design and enable new diagnostics. This tailored approach can reduce trial costs while improving the probability of success. Furthermore, the same dataset may be used to identify new drug targets. While public biobanks are critical for such a purpose, RWD companies with a particular focus on gathering detailed molecular and image data provide useful supplementation.

M2GEN, a spin-out from Moffitt Cancer Center, is a health tech company that has been rigorously addressing these use cases for many years. The company has developed a large and comprehensive clinico-genomic database through a network that gains lifetime consent from every patient involved. The database consists of longitudinal clinical data linked to deep molecular data, including whole exome and transcriptome data from matched germline and tumor samples.

M2GEN’s platform includes clinico-genomic data from more than 300,000 consented patients across its Oncology Research Information Exchange Network® (ORIEN®), an alliance of 18 cancer centers.[6],[7] ORIEN members benefit from the interaction as they derive data to accelerate their research, supporting peer-reviewed publications and grants, and fueling potential new scientific discoveries to benefit patients. Importantly, the ORIEN network is geographically diverse, including sites throughout the US, and racially diverse, including the Morehouse School of Medicine, which was originally part of Morehouse College, a historically black college. ORIEN members consent and enroll patients onto the IRB-approved protocol, Total Cancer Care®, and share de-identified data with each other and with life science companies via M2GEN. Patient consent allows M2GEN to track patients across health care providers and throughout their lifetime. Merck (NYSE: MRK) previously employed biomarker epidemiology data obtained from M2GEN to help select cancer indications when designing clinical trials for its PD-L1 inhibitor pembrolizumab.[8]

Once large RWD datasets have been gathered, analytical software applications are required to curate, clean, and integrate them and apply artificial intelligence to discover novel drug targets and prognostic and drug response biomarkers. GNS Healthcare has a proprietary causal machine learning and simulation platform that transforms clinico-genomic patient data streams into “virtual patients” that can be simulated to reveal the causal drivers of disease progression and individual patient response to drugs.[9] This enables the rapid discovery and “screening”of hundreds and thousands of genes and proteins to discover and stratify novel drug targets. The in silico patient models are also used to predict the patient-by-patient results of the efficacy arm of a clinical trial in order to select optimal patient subpopulations for drugs in development.

Following identification of novel ‘omic biomarkers, the subsequent commercialization of related diagnostics back into the next-gen health system will often rely on cloud-based software systems with minimal clinician supervision, to provide real-time clinical reports, diagnoses, prognoses, and treatment recommendations. This so-called “Software-as-a-Medical-Device” (SaMD) field is expected to grow rapidly in the next several years.

PathAI is one such SaMD company. It provides artificial intelligence-powered research tools and services for pathology. It is currently developing software-based pathomic companion diagnostics (CDx) to determine whether a particular medical treatment is optimal for a given patient. Its digitization platform promises substantial improvements to the speed, accuracy, reproducibility and cost of pathology analysis, leveraging modern approaches in machine and deep learning.[10] True to the Digital DNA framework, the Company recently acquired an independent anatomic pathology laboratory services provider, to introduce digital diagnostic products directly to practicing pathologists.[11]

Credit: aleksey-martynyuk, iStock by Getty Images

Real-world Efficacy and Safety of Pharmaceuticals

By analyzing the outcomes of off-label prescription drug use in patient medical records, RWD can suggest the efficacy (or lack thereof) of a drug in a therapeutic indication or novel subpopulation not yet approved by the FDA. It can also serve as a surveillance system for drug safety. During the COVID-19 pandemic, RWD was screened for hints of the “off-label” efficacy of certain already marketed therapeutics to treat the viral infection and its symptoms. One such study examined the outcomes of using histamine antagonists and aspirin in 22,560 COVID-19 patients using real-world data from the COVID-19 Research Network supplied by TriNetX, a RWE company.[12] In April 2019, the FDA approved a label expansion for Pfizer’s (NYSE:PFE) Ibrance (palbociclib) for a rare form of metastatic male breast cancer based on real-world use in male patients from the IQVIA insurance database, Flatiron Health EMR database and Pfizer’s own global safety database.[13] Realizing RWD’s potential, the FDA recently created a framework for evaluating its use to support the approval of a new therapeutic indication for an already approved drug.[14] It also released related guidance documents and launched pilot programs with several RWE companies.[15],[16] This use case requires a broad network reach to identify patterns of off-label use or safety events in patients who are geographically dispersed or have rare diseases.

Ciox Health, a health information technology company which recently merged with Datavant, a privacy and data linking company, is particularly well suited for this use case.[17] It provides outsourced health information management services with release of information, coding, denials, registry, EMR conversion and audit solutions. Payers and providers often engage Datavant/Ciox to aggregate health records across institutions, geographies, and EMR systems. The combined entity’s network now includes relationships with 2,000 U.S. hospitals, 15,000 clinics, 700,000 providers, 13 of the 15 largest U.S. payers, 30 life science companies, 70 academic institutions and non-profits, and 75 state, local, and federal government agencies.[18] The Company claims this network is the nation’s largest health data ecosystem, enabling patients, providers, payers, health data analytics companies, patient-facing applications, government agencies, and life science companies to securely exchange authorized patient-level data.

Datavant/Ciox is also developing specialized software applications to process and structure the natively unstructured data in a patient clinical record, in a privacy preserving manner. Its proprietary biomedical Natural Language Processing (BioNLP) software, extracts targeted variables from within the unstructured elements of the medical chart, such as surgical, pathology, and imaging reports, as well as discharge summaries, physician notes and other clinician-scribed narrative text. It claims to do so while maintaining patient privacy, regulatory compliance, and the trust of health systems and clinicians. Furthermore, with the merger, it now claims access to proprietary tokenization technology to connect disparate clinical and non-clinical datasets, in a HIPAA-compliant manner, which is of particular significance for mortality data.

In April 2020, Datavant/Ciox announced a collaboration with LabCorp (NYSE:LH) on a comprehensive U.S.-based COVID-19 patient data registry to rapidly construct research-grade clinical cohorts for a wide range of epidemiological, clinical and observational uses.[19] Such a registry would leverage LabCorp’s de-identified datasets from its COVID-19 testing platform and additional longitudinal medical record data, compiled by Datavant/Ciox. In February 2021, Datavant/Ciox also announced a decentralized real-world COVID-19 trial in partnership with HealthVerity to surface and accelerate insights on vaccine safety.[20]

Syapse is another health system vendor that provides RWE insights. Syapse’s network includes close relationships with US health systems, providers, life sciences companies, and regulators. It integrates EMR, LIMS, PACS and other systems to bring together structured and unstructured data. It can identify and address gaps in care and monitor outcomes that help insights from root cause analysis to potentially drive successful interventions for better care.

Credit: Marcela Vieira, iStock by Getty Images

Patient and Clinician Education

To complete the full learning health system loop, once RWD has been transformed into validated and recognized RWE, it can be reintroduced back to the health care system, via SaaS applications integrated with EMRs or other platforms, to educate patients and clinicians. Such communication channels must follow regulatory guidelines and should have opt-in default settings.

Navigating Cancer, a health tech company particularly suited for this use case, has developed one of the most broadly deployed oncology patient relationship management (PRM) platforms in the U.S. Over 2,600 providers use the Navigating Care platform to care for over 1.5 million patients.[21] The PRM solution allows care teams to streamline interactions with patients, deliver real time information, and efficiently triage and provide the appropriate coordinated care. PRM platforms are also used for pharmaceutical-sponsored education programs to enhance patient care. For example, Navigating Cancer assessed the impact of a pemetrexed educational program for lung cancer patients from 58 oncology practices delivered from 2014–2016 via its patient portal.[22] The program provided sequential messages about pemetrexed therapy and management of side effects. The study found that the program reached 47% of registered patients and was associated with a 50 day increase in duration of therapy and 13% increase in one-year survival.

Modernizing Privacy Regulations

For good reason, industrial use of RWD is a contentious issue affecting numerous stakeholders, including, but not limited to patients, clinicians, health systems, governments, payers, researchers, biobanks, and digital health and life science companies. While virtually everyone is invested in seeing RWD used to improve patient outcomes, there are valid concerns regarding patients’ rights to privacy, data ownership, transparency, and the general ethics of human subject research.

Because the large-scale aggregation of data is still to be understood as an emergent phenomenon, the norms, practices, and standards for how best to transparently collect, protect, and share data with other stakeholders are not yet institutionalized. The Health Insurance Portability and Accountability Act of 1996 (HIPAA), the main federal legislation addressing the privacy of patient health data in the U.S., was written long before the advent of widespread EMRs and cloud-aggregated databases. Without updated federal legislation and appropriate guidelines, the RWE ecosystem is largely on its own to develop and follow best practices.

The evaluation of a company’s commitment to privacy is difficult. Even de-identified datasets may retain some possibility of reidentification.[23] The thought that health records could become visible to others can have a chilling effect on a patients’ willingness to disclose important information to their clinician. Primary care records are particularly sensitive due to the large volume of psychosocial information contained in clinician notes and implied in lab tests and medication lists. HIPAA must be modernized to provide a clear, legally enforceable rulebook for the RWE ecosystem.

Companies that touch U.S. health care records in any capacity should also be carefully scrutinized to exclude direct or indirect ownership by foreign bad actors. The Foreign Investment Risk Review Modernization Act of 2018 (FIRRMA) expanded the authority of the Committee on Foreign Investment in the United States (CFIUS) to review certain non-controlling foreign investments involving “sensitive personal data of United States citizens that may be exploited in a manner that threatens national security.”[24] The rule covers transactions involving companies, funds, private equity firms and other investors that qualify as “foreign persons.”


Humanity can benefit greatly from the possibilities afforded by aggregated RWD. But a future where human health data is used to inform industry practice largely depends on enforcement of modernized privacy regulations and how well industry shoulders the responsibility for managing such sensitive information. For society to fully realize the opportunities of the learning health system, there is a present need to cultivate an independent, diverse and sustainable ecosystem of trusted Digital DNA vendors. These companies are setting the bar for data excellence, developing broad networks, and building applications to improve patient outcomes.

Author affiliations

Author contributions

AR and DR developed the conceptual Digital DNA framework. JS and AR developed a related scorecard for the Digital DNA signatures of life science use cases and health tech companies. AR, JS, and AH applied the framework to the analysis of use cases and companies and revised as needed. AR drafted the initial paper. AR, DR, JS, and AH revised the paper.



Copyright 2021 Green Shoots Consulting, LLC and Merck Global Health Innovation Fund, LLC, for reprints contact the corresponding author


[1] About Learning Health Systems. Content last reviewed May 2019. Agency for Healthcare Research and Quality, Rockville, MD.

[2] Adler-Milstein, Julia, Jha, Ashish K “HITECH Act Drove Large Gains In Hospital Electronic Health Record Adoption,” Health Affairs 36, №8 (2017): 1416–1422

[3] Khozin, Sean et al. “Characteristics of Real-World Metastatic Non-Small Cell Lung Cancer Patients Treated with Nivolumab and Pembrolizumab During the Year Following Approval,” Oncologist. 2018 Mar;23(3):328–336. Epub 2018 Jan 9.

[4] Miksad RA, Abernathy AP “Harnessing the Power of Real-World Evidence (RWE): A Checklist to Ensure Regulatory-Grade Data Quality,” Clin Pharmacol Ther. 2018 Feb;103(2):202–205. Epub 2017 Dec 6.

[5] Stern, Jacob “The Fragmentation of Health Data” originally posted Jul 31, 2018 on by Travis May

[6] “CD&R, Merck GHI, and McKesson Ventures Invest in Innovative Oncology Data and Informatics Company M2GEN,” M2GEN Press Release, 17 Mar, 2021.

[7] M2GEN website accessed 9/1/2021

[8] “M2GEN Announces New Collaboration with Merck to Advance Cancer Therapies,” M2GEN Press Release, 17 Mar, 2021.

[9] GNS Healthcare website accessed 9/1/2021

[10] PathAI website accessed 9/1/2021

[11] “PathAI Enters Into Clinical Diagnostics Through Acquisition of Poplar Healthcare Management,” PathAI Press Release, 26 Jul, 2021.

[12] Mura Cameron et al. “Real-world evidence for improved outcomes with histamine antagonists and aspirin in 22,560 COVID-19 patients,” Signal Transduct Target Ther. 2021 Jul 14;6(1):267.–021–00689-y

[13] “U.S. FDA Approves Ibrance® (Palbociclib) for the Treatment of Men with HR+, HER2-Metastatic Breast Cancer,” Pfizer Press Release, 4 Apr, 2019.

[14] “Framework for FDA’s Real-World Evidence Program,” U.S. FDA, December 2018.

[15] “Submitting Documents Using Real-World Data and Real-World Evidence to FDA for Drugs and Biologics Guidance for Industry,” U.S. FDA, May 2019. Docket ID: FDA-2019-D-1263.

[16] “Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision- Making for Drug and Biological Products Guidance for Industry,” U.S. FDA, September 2021. Docket ID: FDA-2020-D-2307.

[17] “Datavant and Ciox Health Announce Merger, Creating the Largest Neutral and Secure Health Data Ecosystem,” Datavant Press Release, 9 Jun, 2021.

[18] Ciox Health website accessed 9/1/2021

[19] “Labcorp and Ciox Health Enter Collaboration to Create Comprehensive Patient Data Registry,” LabCorp Press Release, 9 Apr, 2020.

[20] “Ciox Health and HealthVerity Announce Partnership to Enable First -in-Kind Decentralized Clinical Trial,” Ciox Health Press Release, 10 Feb, 2021.

[21] “Patients Empowered with Relevant Education Experience an Improved Patient-Provider Relationship and Better Health Outcomes,” Navigating Cancer Press Release, 27 May, 2021.

[22] Howard, Scott C. et al. “Increasing the Duration and Efficacy of Intravenous Chemotherapy Using a Patient-Centered Digital Education Program,” Journal of Clinical Oncology. 2017 May;35(15): e18025. Epub 2017 May 30.

[23] Mandl, K, Perakslis, E “HIPAA and the Leak of ‘Deidentified’ EHR Data,” N Engl J Med. 2021 Jun;384:2171–2173.

[24] Mooney, Austin “Spotlight On Sensitive Personal Data As Foreign Investment Rules Take Force” McDermott Will & Emery White Paper, published online 2/18/2021 and accessed 9/1/2021