Skip to main content

Drug Development Safety and Pharmacovigilance

Modern drug safety and pharmacovigilance dates back to the thalidomide disaster. The growing cost of drug development is driving pharmaceutical companies to identify potential safety issues earlier in the process. Valuable safety data are available in public databases and internal sources, but much of this is unstructured text. Linguamatics NLP transforms this text into actionable data that can be visualized and analyzed at every stage of the drug development process.


Modern drug safety and pharmacovigilance began in the early 1960s following the thalidomide disaster. Thalidomide, a drug designed to prevent morning sickness, was released in 1959 and resulted in over 10,000 children in 46 countries being born with birth defects.

In the wake of thalidomide, the World Health Organization (WHO) set up the Programme for International Drug Monitoring (PIDM). Today, PIDM has more than 150 participating countries, with over 16 million Adverse Event Reports (ADRs) collected.

In parallel, the United States Congress passed the Kefauver-Harris Drug Amendments (1962). For the first time, these laws required drug makers to prove their drugs worked safely before the Food and Drug Administration (FDA) would approve them for sale.

These changes were the start of a wave of regulatory changes designed to ensure reliable evidence of drug safety, efficacy and chemical purity prior to market release.

While a lack of clinical efficacy is the major cause of drug attrition, a poor safety profile is also a significant factor in the failure of drugs during development. This may occur at any stage in the development process, from initial drug discovery to preclinical trials, clinical trials and post-marketing surveillance (pharmacovigilance).

Drug Development Pipeline

The diagram below shows the timing of the main safety assessment studies conducted during the drug development process.


Timing of main safety assessment studies during a general drug development process

Drug Discovery

Typically this involved highly parallelized processes for making new compounds and testing them in high-throughput screens. From this, a certain number of hits will be obtained and these will be whittled down by further analysis into a set of leads.

Preclinical Trials

This includes in vitro and in silico testing of the compounds to identify the best members of a series to take into Clinical Trials. This is also where the first stages of safety assessment are undertaken via toxicity testing in animals. If a drug shows promise in preclinical trials, a pharmaceutical company can request permission from the FDA to begin testing in humans (known as First-in-Man or FIM trials). This is called an Investigational New Drug (IND) application. In Europe, the European Medicines Agency (EMA) equivalent is an Investigational Medicinal Product Dossier (IMPD).

Phase 1 Clinical Trials

Phase 1 clinical trials are concerned primarily with establishing how a drug is absorbed, distributed, metabolized and excreted by the human body - a study known as pharmacokinetics (PK).

The dosage range of a new drug is determined by administering increasingly larger doses to one or more groups of subjects, who are closely monitored for harmful side effects. The goal is to learn the maximum tolerated dose that does not produce unacceptable side effects.

Phase 2 Clinical Trials

Phase 2 clinical trials are designed to answer the question: does drug X improve disease Y?

Subjects in a phase 2 clinical trial may benefit from their participation if they receive an active treatment. Most phase 2 clinical studies are randomized, with subjects assigned randomly (by chance and not by choice) to receive the experimental drug, a standard treatment or placebo (harmless, inactive substance). Since larger numbers of patients receive a treatment in Phase 2 clinical trials, there is a greater chance to observe and compile information on potential side effects.

Phase 3 Clinical Trials

Phase 3 clinical trials are conducted at multiple centers with hundreds or thousands of patients for whom the drug is intended. Testing on large patient populations allows continuous generation of data on a drug’s safety and efficacy. As in phase 2, most phase 3 clinical trials are randomized and blinded. A drug in this phase can be studied for several years.

New Drug Application (NDA)

Once the Phase 3 clinical trials are complete, a pharmaceutical company can request FDA approval to market the drug within the USA. This is called a New Drug Application (NDA). The NDA contains all the scientific data that the company has gathered during clinical trials. Within the EU, pharmaceutical companies submit a Marketing Authorization Application (MAA).

Regulatory (GLP) Toxicology

These studies are performed to Good Laboratory Practice (GLP) standards and comprise those required by local regulatory authorities or ethics committees before a drug can be given to human subjects for the first time. Regulatory toxicology also covers the studies required to support a New Drug Application (NDA).

Post-market Surveillance (Pharmacovigilance)

Overseen by the FDA or EMA, post-market surveillance is designed to ensure the safety of a drug once it released onto the market. Pharmacovigilance is designed to ensure that regulators monitor any adverse events reported by the public who may be suffering from a wide range of medical conditions (far wider than those to which the drug would have been exposed during clinical trials).

Shortcomings of the Drug Development Process

There are a number of problems associated with the drug development process as it stands, but they can be distilled into three factors: cost, time and effectiveness.

Drug Development Costs

For years, the pharmaceutical industry has relied on development cost estimates from the Tufts Center for the Study of Drug Development (TCSDD), the most recent of which (2015) puts the cost of bringing a drug from discovery to market launch at $2.9 billion. This includes actual out-of-pocket costs averaging $1.4 billion, opportunity costs of nearly $1.2 billion and the cost of post-market studies amounting to $312 million.

Time to Market

On average, it takes 12 years to bring a new drug to market. This is one reason why the process is so expensive, as capital costs are magnified by the amount of time that money is tied-up in a single project.


Almost 90% of drugs that start testing in patients don’t reach the market because they are unsafe or ineffective, and there is a pressing need to improve the understanding of safety issues during drug discovery, development and after launch. A successful drug development process demands that potential safety issues are recognized as early as possible.

At all stages of drug development, critical data is being generated and retrieved from unstructured text. Project teams need the most comprehensive view of all relevant data, and text mining plays a key role in access to actionable insights for drug safety.

Data Sources on Drug Development Safety

Valuable information can be gained by improved analysis and understanding of the wide variety of information available to researchers and clinicians. This data can come from internal data sources such as study reports, project reviews, clinical investigator brochures and case reports, or from external sources such as:-

  • MEDLINE® containing biomedical literature from around the world.
  • DailyMed publishes up-to-date drug labels to health care providers and the general public.

Linguamatics provides access to a range of ready-to-access content options (including all of the above) via our OnDemand or Connected Data Technology services. This content is linked to the relevant domain specific ontologies, and updated weekly to ensure that you always have up-to-date information. In addition, valuable information may be found on patient forums, social media and conference abstracts.

The ability to search intelligently across the hundreds of thousands of pages contained in these disparate sources is a prerequisite for efficient decision support. However, much of this data will only be available as unstructured text.

Transforming Unstructured Text

Linguamatics Natural Language Processing (NLP) platform can transform unstructured text into actionable (structured) data that can be rapidly visualized and analyzed at every stage of the drug development process.

Query Flexibility

Linguamatics I2E can query and extract drug names, dosages, adverse events, safety indicators and context such as species and tissue (among other things) from large document collections. Queries can be defined using keywords and linguistic expressions. I2E has powerful table processing which enables accurate data to be extracted from preclinical toxicity or drug safety summaries.

Domain Ontologies

By plugging in ontologies, queries will automatically find synonyms or search for entire classes of items. Pre-defined smart queries can also be used; these are templates that hide complexity from the user by only exposing specific pre-defined options. In addition, queries can be combined to answer a set of questions simultaneously, for example: by providing systematic profiles of compounds.

A Choice of Formats

I2E presents the structured results in a choice of formats. These include web pages with results classified by drug, dosage and adverse event. Microsoft Excel spreadsheets, XML files and network graphs are also supported and allow the user to visualize direct and indirect relationships between entities. Results can also be presented in formats suitable for export to third-party databases.


Using I2E’s unique strengths, you can provide comprehensive, precise and accurate data to end-users: capture precise relationships, find concepts in their appropriate context, normalize and extract quantitative data and data in embedded tables.

Answering Key Questions on Drug Safety

In using I2E's powerful query functionality, we can ask a variety of direct and indirect safety-related questions: what information are we looking for, and what questions need answering to ensure that a drug will be safe?

Direct Questions

  • Is there toxicity in the liver, heart, cellular metabolism or reproductive organs?
  • Is the drug carcinogenic or genotoxic or mutagenic?
  • What is the therapeutic dose level for this drug? Is it close to an unsafe dose level?
  • Has this compound been tested before, perhaps for a different disease?
  • Is the pathway completely predictable?
  • Are there any drug-drug interactions?
  • Is the drug metabolized in the body?

Indirect Questions

  • Is there any relationship between known toxins and the current drug?
  • Is there any relationship between toxic pathways and the drug in question?
    Potential common factors could include:-
    • Proteins
    • Tissues
    • Genes
    • Substructure
    • Distribution
  • Are there any toxicity-associated biomarkers related to the drug?
  • Is an animal viable without a knock-out gene that is implicated in a pathway?

Drug Similarity Questions

  • Do dissimilar compounds have known toxins?
  • Are there toxicology studies on similar compounds?
    Similarity could be:-
    • Structural: 2D or 3D
    • Mechanistic
    • Based on end effect

The flexibility of I2E enables you to answer these questions precisely; tailoring queries to extract exactly the information you require and then combine results into the desired format.


Assessing Safety during Preclinical Trials

During preclinical trials, the critical question pharmaceutical developers seek to answer is whether the new drug is safe to be tested in humans, which is also the primary concern of regulatory agencies.

The safety assessment starts early and - as candidates advance from discovery to preclinical trials - more extensive tests have to be performed in vitro, in silico (the rapidly growing discipline of computational toxicology) and in vivo to gain a better understanding of their pharmacodynamics (PD) and pharmacokinetics (PK) behavior and establish their pharmacologic, safety and toxicity profile.

Preclinical trials are the final hurdle prior to clinical trials, and only 12% of the candidates advance to Phase 1 clinical trials. From this point, the success rate increases at each clinical phase, with 17% at Phase 1, 27% at Phase 2, 58% at Phase 3 and 82% at the registration phase. On average, drug discovery and preclinical development take three to six years and account for 30% of costs per drug. Source: DiMarsi, J.A., “Cost of Developing a New Drug Briefing,” Tufts Center for the Study of Drug Development. Nov 18, 2014

Early Identification of Potential Safety/Toxicity Issues

I2E enables early identification of potential safety issues. This is crucial to optimizing investment in R&D and avoiding failures later in the drug development process. Assessment and prediction of the potential for adverse side effects from a particular compound, lead series or biologic molecule is important both in drug discovery and preclinical trials, as well as later clinical trials.

However, much of the relevant safety information is locked up in textual documents, either from medical and scientific literature or within internal study reports. The challenge is therefore to mine available literature sources - both internal and external - and to find and extract relevant information in a timely manner. In addition, I2E can be used to hypothesize indirect relationships, for example by finding mechanisms linking a compound to an adverse effect through a protein or biological process. 

Linguamatics I2E can query and extract dosages, drug names, tissues and safety indicators from large document collections, to answer questions such as:-

  • Are there known adverse or toxic events related to this compound, in model animal species or in humans? In the liver, heart, reproductive organs?
  • Is the therapeutic dose level for this drug close to an unsafe dose level previously reported?
  • What is the pathway of the drug target, and are there known safety issues around this pathway or target?
  • Are there any drug-drug interactions? How is the compound metabolized in the body?
  • Are there similar compounds, either structurally or by mechanism of action, that have known adverse events?

According to our customers, I2E reduces the time spent searching and reviewing safety and toxicity information by up to 70%.

Table showing the extraction of numerical information to investigate toxicity issues using Linguamatics I2E

Extraction of numerical information to investigate toxicity issues 

Download the application note at the end of this page to learn more on Text Search and Mining for Safety/Toxicity.

Preclinical Safety and Toxicity

I2E's powerful linguistic processing capabilities can be used to extract numeric dosage data associated with toxicities and adverse reactions from MEDLINE® abstracts. Read this application note to learn how to use I2E to identify potential safety issues for drugs at specific dosages.

Financial Advantages of Early Detection

The advantages of extracting potential safety and toxicity issues from existing literature can be enormously beneficial financially if done early enough in the drug development process. Access this webinar on extracting safety and toxicity knowledge with I2E and learn how Linguamatics' agile text mining platform can aid this process significantly.

Legacy Safety Reports - Case Study

Better access to the high value information in legacy safety reports remains a cherished goal for those involved in preclinical safety assessment. Locked away in these historical data are answers to questions such as:  Has this particular organ toxicity been seen before, in what species and with what chemistry? Could new biomarker or imaging studies predict the toxicity earlier? What compounds could be leveraged to help build capabilities?

Learn how Merck developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports and protocols.

Chronic Toxicology Studies - Case Study

Learn how Merck used I2E to review 32 chronic toxicology studies in non-rodents (22 studies in dogs and 10 in non-human primates) and 27 chronic toxicology studies in rats dosed with Merck compounds to determine the frequency at which additional target organ toxicities are observed in chronic toxicology studies as compared to sub-chronic studies of 3 months in duration.

Clinical trials provide the evidence on which every new drug is approved. Approximately 25% of drugs that fail in clinical trials do so for safety reasons, for example, exhibiting unacceptable toxicity levels in patients.

The regulatory landscape for clinical trials has also evolved with increased requirements for risk management plans, risk evaluation and minimization strategies. As the industry transitions from passive to active safety surveillance, there will be a greater demand for monitoring data from a wide variety of sources, much of which will only be available as unstructured text.

Many pharmaceutical and biotech companies are realizing the benefits of applying Artificial Intelligence (AI)  technologies such as NLP to internal clinical safety data silos, as well as publicly available clinical safety-related data. The following three case studies provide some examples.

Searching Clinical Investigator Brochures for Safety Assessments

One top-10 pharmaceutical company has provided access to their silo of Clinical Investigator Brochures using Linguamatics Portals. This means the safety assessment teams can, with just a couple of clicks, get answers to questions such as:

  • Which compounds have we ever studied that have shown kidney effects in any species?
  • Which compounds in our pipeline have toxicology studies with >1 non-rodent species?
  • Which compounds cause liver enzymes elevations in both preclinical and clinical studies?

Behind the scenes, the documents have been processed and indexed, ontologies applied, appropriate document regions identified, NLP queries run, key concepts such as chemicals and disease have been standardized and mutations and dosages normalized.

This workflow, and I2E’s easy-to-access Portal, means that the safety data otherwise buried in valuable internal reports can be re-used to answer critical questions for other drug development teams.

Implementing an Adverse Event Reporting System (AERS)

Agios is in the process of implementing an Adverse Event Reporting System (AERS). The case study shows how Agios are using data generated by Linguamatics NLP text mining to help understand the progression of Adverse Events in on-going clinical trials.



Comparing Real World Evidence and FDA Data - Case Study

AstraZeneca set out to test the hypothesis that Real World Evidence (RWE) adverse reaction information from patients could effectively supplement information from clinical trials, giving scientists and clinicians a more accurate, well-rounded description of safety data for particular treatments. The Linguamatics I2E text-mining solution played a key role in rapidly assembling comparable data sets.

AstraZeneca's case study illustrates how they were able to use I2E to examine differences in Nausea Adverse Reactions (AR) frequencies between patient online self-reported data, and the data from FDA Drug Product Labels.



Data Mining at the FDA

Linguamatics I2E was included in a recent paper from the FDA on the “Use of data mining at the Food and Drug Administration” (Duggirala et al, 2016, J Am Med Inform Assoc).

This FDA review covers a very broad range of text and data mining approaches, across both FDA databases (e.g. MAUDE, VAERS) and external data such as MEDLINE® clinical study data, and social media.

The FDA review describes the use of I2E “to study clinical safety based on chemical structure information contained in medical literature. Linguamatics I2E enables custom searches using natural language processing to interpret unstructured text. The ability to predict the clinical safety of a drug based on chemical structures is becoming increasingly important, especially when adequate safety data are absent or equivocal.”

Learn more about Data Mining at the FDA.

Pharmacovigilance and Post-market Surveillance

In recent years, regulatory authorities such as the FDA and EMA have placed an increased emphasis on the safety of marketed drugs, particularly the tracking and reporting of adverse events.

Pharmaceutical companies are expected to regularly screen the worldwide scientific literature for potential adverse drug reactions, at least every two weeks. The use of text mining and other tools to streamline the literature review process for pharmacovigilance is more crucial than ever in order to ensure patient safety, without overloading drug safety teams.

Systematic literature searching for pharmacovigilance signals

Eric Lewis (Safety Development Leader at GlaxoSmithKline) talked at a recent Linguamatics Text Mining Summit about the challenges of reviewing medical literature for safety signals. For example, he looked for literature for a sample of just 20 marketed products across a 300-day period. Eric found that there were on average 60 new references per day (with a total of over 11,000 documents). He found that manual review time was 1.2 to 1.6 minutes per abstract. He extrapolated this to a typical pharmaceutical company product portfolio of 200 marketed products, and showed that this volume of literature would take over 2,200 hours to review – hugely time-consuming.


Leveraging NLP for Adverse Event Reporting

Eric went on to describe how, by using NLP, it is possible to use linguistic processing to focus more specifically on potential drug-related adverse events. This is achieved by searching for the most appropriate relationships between a drug and an adverse event.

Eric presented a specific search to find the adverse events associated with the selective androgen receptor modifier, Enobosarm (an investigational drug also known as MK-2866 or Ostarine). Searching manually across literature databases, Eric pulled out 132 abstracts, but manual review (3 hours) found that only about 30% of these were relevant and actually described an association with an adverse event. However, using I2E to index and query for a precise and accurate pattern took just a few minutes, clearly demonstrating the value of I2E within a pharmacovigilance application.


Pharmacovigilance, clinical safety and regulatory

Organizations increasingly require auditable methods to check whether signals indicating adverse or toxicity related events appear in clinical records. If events do occur, companies need to be able to react fast to find out if they are caused by the drug, are side effects of the original disease or are the result of external factors.

Text mining can be used both to review clinical reports and also to understand potential mechanisms of action. In particular Linguamatics NLP platform has been used to highlight different adverse event profiles at different dosages. Researchers can also search medical records for particular adverse effects, and code the effects found. The linguistic capabilities of I2E are critical in providing a distinction between new effects, a history of an effect, the lack of an effect, or the lack of a history of an effect. 

Linguamatics Natural Language Processing solution reviewing Medline reports to monitor potential adverse effects

Reviewing Medline reports to monitor potential adverse effects 

Structuring Adverse Event Reports

Linguamatics I2E provides powerful linguistic tools to capture numerical data. The screenshot below shows structured normalized dose information for Olanzapine, extracted from recent MEDLINE® abstracts.

Linguamatics Natural Language Processing solution showing structured normalized dose information for Olanzapine, extracted from recent MEDLINE® abstracts

The ability to capture both dose details and causal relationships (e.g. does drug x cause adverse event y) in structured form means that safety assessment teams can quickly and effectively review literature reports for safety signals.

The Application of Text Mining and Real World Evidence at Pharmaceutical Company Pfizer

Real world data (RWD) is becoming increasingly available in the fields of pharmacovigilance and post-market surveillance. It provides top pharma companies, such as Pfizer, with a rich seam of data for real world data analytics. Using AI text mining technology, such as in Linguamatics NLP platform, real world data allows users to extract structured information from diverse data sources, including Voice of the Customer (VoC) data from patient surveys, customer complaints databases, focus groups, etc. and to regularly monitor specialized literature to search for potential adverse events not reported on drug labels.

For instance, Pfizer used Linguamatics NLP to categorize and tag call center feeds for key metadata such as caller demographics and reasons for calling allowing them to deepen the relationship for drug-disease associations by looking for information on pre-existing conditions, and relating these to the potential reported side effects versus ADRs.


Case Studies

Legacy Safety Reports

Learn how Merck developed an I2E workflow that extracted the key findings from safety assessment ante- and post-mortem reports, final reports and protocols.


Chronic Toxicology Studies 

Merck used I2E to review 32 chronic toxicology studies in non-rodents (22 studies in dogs and 10 in non-human primates) and 27 chronic toxicology studies in rats dosed with Merck compounds to determine the frequency at which additional target organ toxicities are observed in chronic toxicology studies as compared to sub-chronic studies of 3 months in duration.


Comparing Real World Evidence and FDA Data

AstraZeneca's case study illustrates how they were able to use I2E to examine differences in Nausea Adverse Reactions (AR) frequencies between patient online self-reported data, and the data from FDA Drug Product Labels.


Implementing an Adverse Event Reporting System (AERS)

Agios is in the process of implementing an Adverse Event Reporting System (AERS). The case study shows how Agios are using data generated by Linguamatics NLP text mining to help understand the progression of Adverse Events in on-going clinical trials.