Background

ResProt

JMIR Res Protoc

JMIR Research Protocols

1929-0748

JMIR Publications

Toronto, Canada

v8i5e11448

31066711

10.2196/11448

Protocol

The Adverse Drug Reactions From Patient Reports in Social Media Project: Protocol for an Evaluation Against a Gold Standard

Kuter

Nicola

Eysenbach

Gunther

Kass-Hout

Taha

Cossin

Sébastien

Arnoux-Guenegou

Armelle

MSc, PhD 1

http://orcid.org/0000-0003-3427-7086

Girardeau

Yannick

MSc, MD 1 2

http://orcid.org/0000-0003-3980-0104

Chen

Xiaoyi

MSc, PhD 1

INSERM U1138 - Team 22 Information Sciences to Support Personalized Medicine Centre de Recherche des Cordeliers

15 Rue de l'école de Médecine

Esc D, 1e étage

Paris, 75006

France 33 71196369 chenxiaoyi619@gmail.com

http://orcid.org/0000-0002-7378-5158

Deldossi

Myrtille

MSc 3

http://orcid.org/0000-0001-8268-3922

Aboukhamis

Rim

M Pharm 4

http://orcid.org/0000-0003-3066-0125

Faviez

Carole

MSc 5

http://orcid.org/0000-0002-1500-0236

Dahamna

Badisse

MSc 6

http://orcid.org/0000-0003-0762-2518

Karapetiantz

Pierre

MSc 1

http://orcid.org/0000-0001-6486-9838

Guillemin-Lanne

Sylvie

MSc 3

http://orcid.org/0000-0003-3528-2514

Lillo-Le Louët

Agnès

MD, Prof Dr 4

http://orcid.org/0000-0001-8135-7340

Texier

Nathalie

M Pharm 5

http://orcid.org/0000-0003-3749-254X

Burgun

Anita

MD, PhD 1 2 7

http://orcid.org/0000-0001-6855-4366

Katsahian

Sandrine

MD, PhD, Prof Dr 1 7 8 9

http://orcid.org/0000-0002-7261-0671

1 INSERM U1138 - Team 22 Information Sciences to Support Personalized Medicine Centre de Recherche des Cordeliers

Paris

France 2 Département d'Informatique Médicale Hôpital Européen Georges-Pompidou Assistance Publique - Hôpitaux de Paris

Paris

France 3 Innovative Projects - Text Mining Expert System

Paris

France 4 Centre Régional de Pharmacovigilance Hôpital Européen Georges-Pompidou Assistance Publique - Hôpitaux de Paris

Paris

France 5 Kappa Santé

Paris

France 6 Service d'Informatique Biomédicale, D2IM Centre Hospitalier Universitaire de Rouen

Rouen

France 7 INSERM U1138 - Team 22 Information Sciences to Support Personalized Medicine Paris Descartes University, Sorbonne Paris Cité

Paris

France 8 Clinical Research Unit Hôpitaux Universitaires Paris Ouest Hôpital Européen Georges-Pompidou Assistance Publique - Hôpitaux de Paris

Paris

France 9 INSERM CIC1418 Clinical Epidemiology Hôpital Européen Georges-Pompidou

Paris

France

Corresponding Author: Xiaoyi Chen chenxiaoyi619@gmail.com

052019

07 05 2019

8 5

e11448

29 6 2018 1 8 2018 16 11 2018 21 12 2018

©Armelle Arnoux-Guenegou, Yannick Girardeau, Xiaoyi Chen, Myrtille Deldossi, Rim Aboukhamis, Carole Faviez, Badisse Dahamna, Pierre Karapetiantz, Sylvie Guillemin-Lanne, Agnès Lillo-Le Louët, Nathalie Texier, Anita Burgun, Sandrine Katsahian. Originally published in JMIR Research Protocols (http://www.researchprotocols.org), 07.05.2019.

2019

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://www.researchprotocols.org, as well as this copyright and license information must be included.

Background

Social media is a potential source of information on postmarketing drug safety surveillance that still remains unexploited nowadays. Information technology solutions aiming at extracting adverse reactions (ADRs) from posts on health forums require a rigorous evaluation methodology if their results are to be used to make decisions. First, a gold standard, consisting of manual annotations of the ADR by human experts from the corpus extracted from social media, must be implemented and its quality must be assessed. Second, as for clinical research protocols, the sample size must rely on statistical arguments. Finally, the extraction methods must target the relation between the drug and the disease (which might be either treated or caused by the drug) rather than simple co-occurrences in the posts.

Objective

We propose a standardized protocol for the evaluation of a software extracting ADRs from the messages on health forums. The study is conducted as part of the Adverse Drug Reactions from Patient Reports in Social Media project.

Methods

Messages from French health forums were extracted. Entity recognition was based on Racine Pharma lexicon for drugs and Medical Dictionary for Regulatory Activities terminology for potential adverse events (AEs). Natural language processing–based techniques automated the ADR information extraction (relation between the drug and AE entities). The corpus of evaluation was a random sample of the messages containing drugs and/or AE concepts corresponding to recent pharmacovigilance alerts. A total of 2 persons experienced in medical terminology manually annotated the corpus, thus creating the gold standard, according to an annotator guideline. We will evaluate our tool against the gold standard with recall, precision, and f-measure. Interannotator agreement, reflecting gold standard quality, will be evaluated with hierarchical kappa. Granularities in the terminologies will be further explored.

Results

Necessary and sufficient sample size was calculated to ensure statistical confidence in the assessed results. As we expected a global recall of 0.5, we needed at least 384 identified ADR concepts to obtain a 95% CI with a total width of 0.10 around 0.5. The automated ADR information extraction in the corpus for evaluation is already finished. The 2 annotators already completed the annotation process. The analysis of the performance of the ADR information extraction module as compared with gold standard is ongoing.

Conclusions

This protocol is based on the standardized statistical methods from clinical research to create the corpus, thus ensuring the necessary statistical power of the assessed results. Such evaluation methodology is required to make the ADR information extraction software useful for postmarketing drug safety surveillance.

International Registered Report Identifier (IRRID)

RR1-10.2196/11448

social media drug-related side effects and adverse reactions natural language processing data mining MedDRA Racine Pharma

Introduction Background

The detection of new adverse drug reactions (ADRs) has been based on postmarketing surveillance by government agencies (national authorities) derived from spontaneous reporting by health care professionals and patients. The US Food and Drug Administration (FDA) and the European Medicines Agency collect ADR case reports through the FDA’s Adverse Event Reporting System (FAERS) [1-4] and the EudraVigilance system, respectively. These systems are useful tools for drug agencies, which mine these huge amounts of structured data to look for new safety concerns that might be related to a marketed product [5-8].

In the last 20 years, internet and social media have become an integral part of people’s daily life. Social media is now often used to communicate with other persons having the same health concerns and share information regarding their health conditions, feelings, medications, and many other aspects [9]. Social media is, therefore, a potential provider of information on ADRs. In 2005, the International Society of Drug Bulletins already recognized such a use: Patient reporting systems should periodically sample and evaluate the scattered drug experiences patients report on the internet [10]. This new source of knowledge captured the interest from health informatics, statistics, and public health researchers. Although in its infancy, related scientific literature increased in the last decade [4,11-13].

Objectives

In this context, the French Ministry of Industry funded and launched the Adverse Drug Reactions from Patient Reports in Social Media (ADR-PRISM) project. The objective of ADR-PRISM was to make available the contents about ADR, informal and embedded in forums and discussions on the Web, to the actors involved in drug safety (drug companies, agencies, and pharmacovigilance experts). In the end, the tool developed in the ADR-PRISM project should generate hypotheses about new or poorly documented adverse events (AEs). To reach its goals, the ADR-PRISM consortium gathers a company developing text mining softwares (Expert System), a company specialized in pharmaco-epidemiology (Kappa Santé), 3 academic research groups providing expertise in medical informatics and statistics (National Institute of Health and Medical Research & Cordeliers Research Centre: umrs 1138 team 22 dedicated to Information Sciences to support Personalized Medicine and Laboratory in Medical Informatics and Knowledge Engineering in e-Health, and Biomedicine informatics, Service Catalogue and Index of French Language medical websites SIBM-CISMeF), 2 experts in pharmacovigilance (regional center of pharmacovigilance), as well as Vidal group that supplies the drug database used in most drug prescription systems in France.

From a natural language processing (NLP) perspective, we considered ADR as relationships between drug and AE concepts. On the basis of that, the NLP-based Skill Cartridge for pharmacovigilance developed by Expert System for ADR-PRISM includes a relation extraction module based on (named) entity recognition combined with rules and regular expressions. Before applying it to large collections of forum discussions, we designed a protocol to assess the performance of this ADR information extraction module. The objective of this study is to present this protocol.

Methods Synopsis

With the objective of sharing a methodology that guarantees the confidence in the results in ADR-PRISM, we followed a way of reasoning inspired by the standards widely adopted in clinical research such as the Standards for Reporting Diagnostic Accuracy studies [14]. These items are displayed in Textbox 1.

Synopsis of the protocol.

Rationale

Nowadays, patients extensively use social media. They report on the adverse events they feel because of their medications (further called adverse drug reaction [ADR]) on health forums. Promising studies exist on the extraction of ADR information from social media with natural language processing (NLP) or machine learning tools.

The consortium Adverse Drug Reaction from Patient Reports in Social Media (ADR-PRISM) has been constituted to create a tool extracting the ADR information from social media. Teams specialized in text mining NLP and pharmacovigilance participate in the consortium.

Before applying the ADR-PRISM’s tool on a larger scale (millions of posts) to draw conclusions on ADRs, our goal was to adapt the principles adopted in clinical research to assess the ADR-PRISM. For example, evaluation was done against a gold standard based on manual annotation of the posts.

Primary objective

To estimate robustly the performance of the ADR information extraction tool against gold standard.

Primary expected results

Precision, recall, and f-measure for ADR extraction.

Secondary objectives

To verify the quality of the gold standard and to evaluate the impact of various conditions (eg, different granularities and sentence constraint of the tool) on the performance of the ADR information extraction tool.

Secondary expected results

Kappa metrics for interannotator agreement; and precision, recall, and f-measure for ADR extraction in various conditions.

Database

Posts from the Kappa Santé Detec’t database published between January 1, 2007, and October 28, 2016.

Eligibility criteria

Posts from the Kappa Santé Detec’t database that contain at least one drug’s or molecule’s (active substance) name, and posts randomly selected from the rest of the Kappa Santé Detec’t database; a list of drugs and adverse events (AEs) of interest is established by drug safety experts; and ADR: any explicit and positive relationship between a drug and a potential AE where either the drug or the AE or both belong to the list.

Index test method

ADR information extraction tool: this tool classifies each co-occurrence of drug and AE in a post as positive ADR or negative ADR or no ADR, and maps the drug and the AE expression to Racine Pharma and Medical Dictionary for Regulatory Activities (MedDRA), respectively.

Reference method

Gold standard: manual annotations of the co-occurrence of drug and AE as positive ADR or negative ADR or no ADR including the mapping of the drug and the AE expression with a Racine Pharma and a MedDRA entity, respectively. Manual annotations will be provided by 2 annotators with experience in medical terminology.

Sample size

A 95% CI, with a total width of 0.1, is used to determine the sample size.

Statistical analyses

Recall, precision, and f-measure calculations for evaluation of the performances of the ADR extraction information tool and interannotator agreement for the evaluation of the gold standard with a Cohen kappa.

Design and Ethics Design

This project is based on retrospective data collected among threads of discussion accessible on social media (messages from French health forums). A total of 2 experts in pharmacovigilance helped delineating the project. The objective is twofold: (1) focus on certain pharmaceutical products of interest and identify the related ADRs and (2) detect the emergence of general potential problem in public health. On this basis, the extraction of ADR information is expected to (1) perform well on specific concepts for use case study and (2) be able to extract all potential concepts correctly for screening purpose.

To evaluate the ADR information extraction, we therefore build up the global process in 2 phases (Figure 1). In phase 1, we implemented an iterative process of validation and improvement of the Skill Cartridge for pharmacovigilance. The objectives were (1) to correct the most frequent errors and to train the ADR information extraction module and (2) to estimate the performance indicators and the time required for manual annotation for phase 2 sample size determination. Then, in phase 2, we will conduct a definite assessment of the performance of the ADR information extraction module.

In the rest of the paper, we use the term AE for medical events that are present in text for a potential ADR, whereas the term ADR is used when a relation between a drug and an AE is established.

Figure 1

Study overview. ADR: adverse drug reaction.

Ethics

This research did not involve experiment on either humans or animals. Ethics and guarantee of data privacy constituted an integral and dedicated working group set up for the ADR-PRISM project. To comply with national regulations, we first registered data collection to the French Data Protection Agency (CNIL for Commission Nationale Informatique et Liberté), which is known as a normal notification in technical terminology, on December 23, 2015. We later submitted an authorization request on March 30, 2016, regarding data analysis and validation of the approach adopted about ethics and confidentiality to the same agency. ADR-PRISM’s consortium detailed approach is explained in the study by Bousquet et al [15].

The ADR-PRISM project was supported by an ethics advisory board, which was composed of scientists with different scientific expertise: drug safety, public health, and ethics. Their role was to give independent advice regarding ethical issues to the project consortium. The board approved the whole study design, including the protocol presented in the paper.

Adverse Drug Reactions From Patient Reports in Social Media Adverse Drug Reaction Extraction Resources

We used 3 lexicons to represent drug information and AEs. We codified AEs with the Medical Dictionary for Regulatory Activities (MedDRA) v15.1 [16] classification. Racine Pharma, maintained by the SIBM-CISMeF [17], provided an extensive source of drug names that covered all medications available on the French market, including brand names and active ingredients. Racine Pharma entries were mapped to the Anatomical Therapeutic Chemical (ATC) system [18], which was used as a classification system for drugs.

The corpus of messages was extracted from the Detec’t database [19], a database developed by Kappa Santé that collects messages from several French forums using a Web crawler. We limited ourselves to a list of 5 major health forums in French obtained using netscoring [9]. Message extraction was based on a named entity recognition module using a drug lexicon designed by Kappa Santé and a fuzzy matching algorithm. The lexicon was based on Racine Pharma, the ATC classification, and a list of medications extracted from the French National Health Insurance database.

Drugs, Adverse Events, and Adverse Drug Reaction Extraction

ADR extraction was performed in 2 steps: first, named entity recognition modules were used to identify drug names and AEs in posts, and then, a relation extraction algorithm was applied to these entities (Figure 2).

Figure 2

Adverse drug reaction pipeline. ADR: adverse drug reaction; ATC: Anatomical Therapeutic Chemical (in ATC classification); MedDRA: Medical Dictionary for Regulatory Activities.

Drugs and Adverse Events Extraction

Regarding drugs, Expert System has developed a named entity recognition module capable of identifying words or tokens (occurrences) listed in Racine Pharma and extracting their positions from forum posts.

We mapped the extracted expressions from Racine Pharma to the chemical substance level (5th level) of the ATC (last updated version: December 19, 2016) [18]. Considering that the same active ingredient could be found under different trade names, we pooled all mapped expressions within the same ATC chemical subgroup (4th level) to define the drug concepts.

Expert System has developed another module based on MedDRA to identify the AEs and extract their position in the posts. Fuzzy-matching and enrichment of thesaurus with colloquial terms enabled to take into account the characteristics of the posts on health forums.

Drug and Adverse Event Relationship Extraction (Skill Cartridge for Pharmacovigilance)

An NLP module has been developed by Expert System to capture the specific information regarding the relationship established between a drug and an AE by the post’s author. The module combined a set of rules and regular expressions, with a Patient lexicon constituted to ensure that the post’s author set out a situation of a person taking the drug and experiencing the symptom and excludes general information regarding a drug or an AE. This lexicon contains terms such as I, me, my, cause, test, take, feeling, because of, provoke, intolerance, and allergies. This NLP module also included negation detection.

The algorithm can be summarized as follows:

Text was split into sentences based on the punctuation mark.

For each sentence, (named) entity recognition modules extracted drugs and AEs.

For each pair of drug and AE co-occurring in a sentence, if the NLP module extracted specific information regarding a relation, then the co-occurrence was classified as follows:

Explicit and positive ADR (eg, Abilify causes me such fears that I cannot concentrate to read, work, etc...);

Explicit and negative ADR (eg, I took some Doliprane and I didn’t feel any nausea);

If no specific information was identified, the co-occurrence was classified as no ADR (eg, Usually, Focalin’s adverse effects are loss of appetite, insomnia, and naso-pharyngitis).

Phase 1: Iterative Improvement of the Adverse Drug Reaction Information Extraction Module

Phase 1 consisted of an iterative process of validations and improvements of the tool. The review on the corpus was constituted by manual annotation of the messages by 2 experts in pharmacovigilance. The manual annotation process used a tool developed by Expert System for in-house testing purposes [20]. This tool did not integrate functionalities for blind manual annotation. No gold standard was established at this step. All along the process, both annotators could see the drugs, AEs, and ADRs extracted by the tool, and their task consisted of validating or invalidating these extractions and, if needed, complementing the annotations performed by the system. The annotators were asked to (1) check all drugs and all AEs in the post, irrespective of whether they have been involved in an ADR relationship, and (2) identify all drug-AE relationships in the same sentence. The output of phase 1 was a first estimation of precision, recall, and f-measure, assessing the capacity to extract information regarding drugs, AEs, and ADRs. More details about the corpus construction, evaluating concept selection, sample size determination, and the preliminary results of this phase were reported in the study by Chen et al [21].

Phase 2: Evaluation Corpus for Evaluation Evaluation Dataset

For phase 2, the selection of datasets is given in Figure 3.

On October 28, 2016, Kappa Santé Detec’t database contained about 23 millions of posts, and approximately 21 million posts were published after January 1, 2007. Messages about drugs, for which marketing authorization has been withdrawn before the cut-off date, were considered as some AEs might appear at long term or patients might discuss about old drugs and their safety.

The software developed by Kappa Santé for purpose of Web discussions collection offered the possibility of preidentifying the posts containing at least one pharmaceutical or molecule name. We considered the dataset of all posts containing at least one preidentified compound or brand name and combined it with a complementary set of posts randomly sampled from the remainder of the Kappa Santé Detec’t database.

The drug, AE, and ADR extraction modules developed for ADR-PRISM were executed on both sets of posts, and the output of the extraction module was considered as the evaluation dataset.

Figure 3

Phase 2 messages selection flowchart. ADR: adverse drug reaction; ATC: Anatomical Therapeutic Chemical (in ATC classification); MedDRA: Medical Dictionary for Regulatory Activities.

Drug Selection

The evaluation focused on 3 categories of drugs: (1) the most frequently extracted drugs in the corpus, (2) the most sold drugs in France, and (3) the drugs of interest of pharmacovigilance because of recent safety alerts. For the last 2 categories, the frequencies of the recognized named entities were also taken into account to cope with the necessary sample size.

The most sold drugs in 2013 in France are listed below, in their French commercial names, including the 10 most sold drugs with mandatory medical prescription for dispensation and the 10 most sold drugs with optional medical prescription for dispensation [22]:

Mandatory medical prescription for dispensation: Levothyrox, Uvedose, Lamaline, Dafalgan codéiné, Méthadone, Crestor, Pivalone, Seresta, Emla patch, and Seroplex

Optional medical prescription for dispensation: Doliprane, Dafalgan, Efferalgan, Kardégic, Spasfon, Gaviscon, Dexéryl, Météospasmyl, Biseptine, and Eludril.

The 2 experts additionally worked out a list of pairs of 1 drug concept and 1 AE concept corresponding to alerts that emerged in the last years (Table 1). This list of use cases has been further employed as a basis for the concept selection for evaluation. The concept selection in phase 1 (iterative improvement) [21] was based on the same consideration; therefore, those concepts on which the extraction tool performed well should be excluded from the selection for evaluation.

Table 1

List of drug and adverse event concepts selected as pharmacovigilance use cases.

Active drug ingredient: French names (date of marketing in cases where drug has been withdrawn)	Adverse events of interest	Corresponding Medical Dictionary for Regulatory Activities term	Media coverage or alert date
Gardasil	Autoimmune disease, complex regional pain syndrome, and postural orthostatic tachycardia syndrome	Autoimmune disorders (HLGT^a) + complex regional pain syndrome (PT^b) + postural orthostatic tachycardia syndrome (PT)	2013
Meningitec	Quality defect and any adverse event	All terms	2015
Methylphénidate: Ritaline, Concerta, Medikinet, Quasym, and Ritaline	Neurological and cardiac effects	Cardiac disorders (SOC^c) + nervous system disorders (SOC)	2009 (recommendation)
Acétate de cyprotérone: Diane 35, Cypropharm (2010-2013), Elleacnelle (2010-2013), Evepar, Holgyme (2002-2015), Lumalia (2003-2013), Minerva, and Arielle (1997-2011)	Thromboembolic risk	Embolism and thrombosis (HLGT)	2012 (alert)
Fluoxétine: Prozac, Elindra (2001-2008), Fluctine (1998-2008), and Fontex (2001-2008)	All adverse events	All terms	Mostly 2013
Méthadone	Prolonged time from the start of the Q wave to the end of the T wave during electrocardiogram (approximates the time taken from when the cardiac ventricles start to contract to when they finish relaxing)	QT interval prolongation	2007
Sofosbuvir: Harvoni, Epclusa, and Sovaldi	Bradycardia	Bradycardia (PT)	2013
Codeine: Codenfan, Codoliporane, Migralgine, and Néocodion et Prontalgine	Respiratory disorders	Respiratory disorders (HLGT)	2012 re-evaluation
Hydroxyzine: atarax	Rhythm disorders	Cardiac arrhythmias (HLGT)	2014
Nicorandil: Adancor and Ikorel	Skin ulceration	Skin ulcer (HLT^d)	2012
Midodrine: Gutron	Hypertension	Blood pressure increased (PT)	2013
Galantamine: Reminyl, Galanthen (2012-2015), and Galema (2011-2015)	Skin reaction	Skin and subcutaneous tissue disorders (SOC)	2014
Crizotinib: Xalkori	Heart failure	Heart failures HLGT	2014
Valproate de sodium: Depakine, Imaslav, and Micropakine	Teratogenic effects	Congenital, familial, and genetic disorders (SOC)	2012?
Isotrétinoine: Curacné, Acnetrait, Contracné, Procuta, and Roaccutane	Teratogenic effects + psychiatric disorders	Congenital, familial, and genetic disorders (SOC) and psychiatric disorders (SOC)	2002
Fingolimod: Gilenya	Leukoencephalopathy	Toxic leukoencephalopathy (PT)	2014
Aripiprazole: Abilify	Suicidal behavior	Suicidal and self-injurious behavior (HLGT)	2013 (words of warning: 2016)

^aHLGT: high-level group term.

^bPT: preferred term.

^cSOC: system organ class.

^dHLT: high-level term.

Adverse Event Concepts Selection

The AE concepts for evaluation corresponded to preferred term (PT) level in the MedDRA hierarchy, and we focused similarly on 2 categories: (1) the most frequent extracted PTs and (2) the PTs of interest of pharmacovigilance (Table 1), guided also by the frequencies of the recognized entities in the corpus to cope with the necessary sample size.

Adverse Drug Reaction Definition

We defined an ADR as any explicit and positive relationship between a drug and an AE where the drug (respectively the AE) belonged to the list of concepts previously selected.

Corpus for the Evaluation (Random Sample From Evaluation Dataset)

Finally, a random sample from the evaluation dataset (ie, among all the ADR extracted by the ADR information extraction module) constituted our corpus for evaluation.

Gold Standard Adverse Drug Reactions From Patient Reports in Social Media Manual Annotation Platform

We developed a Web application dedicated to manual annotations valuable as gold standard, for the project purposes in phase 2.

This application was based on Java Servlets and JavaScript libraries and connected in Java Database Connectivity to the dataset of posts selected for the gold standard annotation. We used a self-completion mechanism to attach portions of message to Racine Pharma and MedDRA terms. We used drag-and-drop operations to fill in a table containing the manual review of the co-occurrences, where each line was dedicated to 1 co-occurrence. For each co-occurrence, the drug and the AE were dragged-and-dropped in the first and the second column, respectively. In the third column, the manual annotator was given a drop-down menu that presented 3 possibilities for defining the co-occurrence: explicit positive ADR, explicit negative ADR, or no ADR. We finally offered the possibility to export this table containing the manual annotations obtained via the application. The interface of this application is shown in Figure 4.

Figure 4

Interface of the manual annotation platform. ID: identifier; MedDRA: Medical Dictionary for Regulatory Activities.

Gold Standard Construction

An annotator guideline has been established to standardize manual annotations. A total of 2 experts in medical terminologies have annotated all ADR relationships for each post in the evaluation corpus, and then, the manual annotation was considered as gold standard. Different from phase 1, here we expected that the annotators annotated all causal relationships between a drug and an AE, even if the drug and AE were not in the same sentence, which will allow us further evaluate the impact of the sentence boundary constraint of the extraction tool.

Each annotator annotated 55.1% (261/474) of the posts in the evaluation corpus; thus, a subset of 10.1% (48/474) of the posts was annotated by both of them. The interannotator agreement could be assessed on these double-annotated posts. In case of disagreements, the 2 annotators discussed to achieve a consensus. If a lot of disagreements had occurred, the annotators would have been asked to learn again the guideline and revise their annotations. All along the process, both annotators were double-blinded: first, from the ADRs identified by the ADR information extraction module, and second, from the other annotator’s annotations.

Statistical Analyses Primary Analysis

To assess the efficacy of the ADR information extraction module as compared with the gold standard, we will globally calculate the recall, precision, and f-measure.

In a post, if the ADR information extraction module identifies the same expression from Racine Pharma at the same position; and the same AE expression at PT level from MedDRA hierarchy at the same position; and the same type of relationship, as did the gold standard, then we will count the extracted ADR as a true positive (Figure 5).

Figure 5

Recall, precision, and f-measure definitions. ADR: adverse drug reaction; ADR-PRISM: Adverse Drug Reaction from Patient Reports in Social Media; FN: false negative, manually annotated ADR by the gold standard that is not extracted by the ADR information extraction tool; FP: false positive, positive explicit relationship extracted by the ADR information extraction tool that is not manually annotated by gold standard; TN: true negative; TP: true positive.

F-Measure Gold Standard’s Evaluation

The interannotator agreement will be assessed by a hierarchical Kappa [23]. A hierarchical Kappa will enable to take into account the situation where the 2 annotators will disagree in either the level or the expression in ATC or MedDRA terminologies but will agree on higher levels than the annotated ones. Separate calculations for interannotator agreements by hierarchical Kappa on drug and AE expressions will be provided as well.

Secondary Analyses

We will complete these principal results with the following analyses.

We will take into account MedDRA and ATC granularities by reproducing precision, recall, and f-measure at each level of the hierarchies (system organ class [SOC], high-level group term [HLGT], high-level term [HLT], and preferred term [PT] of MedDRA and anatomical main group, therapeutic subgroup, pharmacological subgroup, and chemical subgroup of ATC).

We will provide a relaxed definition of true-positive ADRs combining the following 3 conditions: (1) the positions in the post of the extracted expressions for drug and AE both match the positions of the drug and AE expressions manually annotated, (2) the extracted and the manually annotated expressions for AE from MedDRA hierarchy are found in the same SOC levels, and (3) the classification of the identified relationship match the classification of the manually annotated relationship. We will calculate recall, precision, and f-measure with this definition for ADRs, drug, and AE identification separately.

We will provide a global estimation of the performance of ADR information extraction module by taking into account other types of messages. These messages are those without information on at least one ADR concept, that is, messages in which (1) ADR information is extracted but is not concerned by the drug- or AE-selected concepts, (2) no ADR information is extracted but only co-occurrences of information about 1 drug and 1 AE, (3) only AE information is extracted, (4) only drug information is extracted, and (5) neither drug nor AE information is extracted. After having sampled each type of message described above, we will then calculate global recall or precision or f-measure on all types of messages by marginal calibration.

All analyses will be performed in R software [24].

Sample Size Calculation

Recall can be estimated from a random sample of ADR concepts annotated by the gold standard, whereas precision can be estimated from a random sample of ADR concepts extracted by the ADR information extraction module. However, creating the gold standard on the whole evaluation dataset before sampling in it for the recall was an overwhelming task. At the same time, a vast majority of messages did not contain any information on the drug concept. Thus, we expected a very low proportion of ADR concepts identified by the ADR information extraction module in the evaluation dataset. Therefore, the recall estimated from a random sample of ADR concepts annotated by the gold standard was mathematically approximated by the recall estimated on ADR concepts sampled for the precision.

The sample size required for different expected precision or recall is provided in Table 2. By hypothesizing a global recall of 0.5, we needed at least 384 identified ADR concepts to obtain a 95% CI, having a total width of 0.1. The final determination of sample size was based on the previously estimated precision and recall.

Table 2

Sample size required according to the expected precision and recall and total width of confidence intervals (CIs).

Range of CI	Expected precision or recall values
	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
90% CI	35	61	81	92	96	92	81	61	35
95% CI	138	246	323	369	384	369	323	246	138

Differences Between Phase 1 and Phase 2

The main differences between phase 1 and phase 2 are threefold (Table 3). First, in phase 1, the annotations targeted all drugs and AEs even if not involved in an ADR relation, whereas in phase 2, annotation focuses on the ADR relations. Second, in phase 2 (and only in phase 2) a gold standard based on blind manual annotation by 2 experts in medical terminologies was established and the interannotator agreement was measured. In contrast, the manual annotations of phase 1 were not blind, and most of the work consisted of validating the automatic annotations; thus, the interannotator agreement was not addressed. Finally, in phase 2, additional parameters such as the granularity and the segmentation of the messages (sentence boundaries) will be analyzed. The precision, recall, and f-measure will be calculated with a gold standard at the message level and might be compared with precisions, recalls and f-measures calculated at the sentence level and for all MedDRA and ATC granularities.

Table 3

Common features and differences between phase 1 and phase 2.

Issues	Phase 1	Phase 2
Evaluation focus	Drugs and adverse events entity recognition	Adverse drug reaction relationships recognition
Gold standard (manual annotations)	Not blind	Blind
Gold standard annotators	Pharmacovigilance experts	Experts in medical terminologies
Interannotator agreement	No	Yes
Granularity issue	No	Yes
Sentence boundary issue	No	Yes

Results

At the time of publishing this evaluation protocol, several steps of this project would have been completed. The corpus for evaluation has been constituted, and the NLP tools have identified and extracted the information about ADR inside this corpus. We selected the ADR concepts and constituted the samples of the entities necessary to set up the gold standard. The 2 annotators completed the annotations process.

Data analyses for assessing the interannotator agreement and the performance of the ADR information extraction module as compared with the gold standard are ongoing, and the study results are expected internally before the end of 2018.

Discussion Summary

With the objective of using messages on health forums as a new source for drug safety, the systems developed to mine the messages must follow strict evaluation rules. This is even more important as these systems might be used to support decision making. The protocol presented in this paper has been designed to evaluate the ADR information extraction Skill Cartridge developed in ADR-PRISM in a pharmacovigilance perspective.

Study Strengths

The protocol presented in this paper is an attempt to apply clinical research–level guidelines [14] to the assessment of such systems.

First, we paid particular attention to the establishment and the validation of the gold standard. The gold standard was established by 2 trained specialists of medical terminologies who annotated selected messages. Manual annotation was performed in a double-blind manner, namely, from both the other annotator and the ADR identified by the ADR information extraction module. The annotated corpus, therefore, constituted a valuable gold standard. On the basis of the study by Sarker et al [4] (see Multimedia Appendix 1), we could find 13 studies where a gold standard, that is, manual annotations used for evaluation, was implemented [25-37]. Conversely, in 8 studies, there was no gold standard [38-45], and the extracted ADRs were compared with already known AEs from FAERS [39-41,43,44] or drug label declared to FDA [38,45] or even AE described in websites [42]. Moreover, in the ADR-PRISM protocol, a common subset of randomly selected messages was annotated in a blind manner by the 2 experts; interannotator agreement evaluation will also ensure ourselves about the quality of this gold standard. In most of studies, the authors gauged the interannotators agreement [25,27,30-32,34,35]. However, it is not systematic [26,28,29,33,36,37].

Second, by calculating a sample size of messages collected from social media, to assess recall, precision, and f-measure, we guaranteed the statistical power to place reliance on our study results.

The chosen terminologies are another crucial aspect of this work. On the one hand, the Racine Pharma thesaurus, with 5164 entries, exhaustively covers a large range of drug names and active ingredients. On the other hand, the MedDRA hierarchy is used daily by drug safety experts and considered expressive for this task. By using these terminologies, we expect to increase the sensitivity of the ADR-PRISM Skill Cartridge for pharmacovigilance. Our choice to map Racine Pharma to ATC was guided by 2 aspects. First, ATC like MedDRA has a hierarchical structure. Thus, we will be able to evaluate the ADR information extraction tool based on different hierarchical levels of these terminologies. Second, ATC and MedDRA are widely used and internationally agreed reference terminologies. Hence, we expect to provide strong and reproducible results.

The ADR information extraction module was not only based on drug and AE information identification but also on rules and regular expressions. As such, we expect to discard noninformative sentences, addressing general information about the drug or drug indications. We would also be able to identify unexpected positive effects, as for example, headaches that would be reduced by a drug without indication for the treatment of this kind of pain. Only few studies have been able to take this aspect into account [27,29,30,32,34,46].

Study Limitations

Despite its positive aspects, the study exhibits several limitations.

The mapping between the terms listed in the drug thesaurus Racine Pharma and the terms referenced in the chemical substance in the ATC hierarchy was incomplete. Among the 5164 terms in Racine Pharma, 852 (16.5%, 852/5164) could not be mapped to ATC, for example, some phytotherapies (St John’s wort herbal tea, Silver birch juice, extract of licorice root, arum triphyllum compound, arnica, etc). This could have a negative impact on the recall scores calculated according to the ATC levels.

In social media’ posts, slangs and colloquial languages are frequent; likewise, syntactic rules are approximate. We chose to use fuzzy-matching and thesaurus enrichment to take into account this bias. However, this approach is inherently nonexhaustive with negative impact on the recall of the ADR information extraction module.

Regarding the gold standard, 2 experts in medical terminologies trained in drug safety performed manual annotations. However, contrary to the phase 1 manual review, none of them could be considered as a pharmacovigilance expert.

The medical informatics community needs shared open corpora to evaluate their methods and tools. Recent efforts have led to making several datasets more accessible and the evaluation of the methods more standardized, for example, the Multiparameter Intelligent Monitoring in Intensive Care corpus [47] and the Informatics for Integrating Biology and the Bedside challenges [48]. Drug safety could benefit from open-access validation datasets made available for research purposes [25,35]. However, several obstacles remain. For example, the ADR-PRISM consortium got approval to reuse posts for research purposes, but the approval is restricted to the project.

Conclusions

The objective of this study is to present the scientific approach developed in the first stage of the ADR-PRISM project. At this stage, our principal objective was to evaluate the performance of the ADR information extraction Skill Cartridge against a gold standard constituted by human annotations. To address this question, our goal was to adapt the principles adopted in clinical research to assess sound and trustworthy measurements of precision, recall, and f-measure.

With the statistical theory, we calculated a sample size. This guaranteed enough ADR information and sufficient narrowness of the confidence interval, to scientifically conclude on our principal objective. We also avoided unnecessarily large extractions for which the process of manual annotations is nothing but both wasteful and time-consuming.

Multimedia Appendix 1

Designs of articles related to ADR information extraction on social media (replicated from Sarker et al).

Abbreviations

ADR

adverse drug reaction

ADR-PRISM

Adverse Drug Reaction from Patient Reports in Social Media

adverse event

ATC

Anatomical Therapeutic Chemical (in ATC classification)

FAERS

Food and Drug Administration’ adverse event reporting system

FDA

Food and Drug Administration

HLGT

high-level group term

HLT

high-level term

MedDRA

Medical Dictionary for Regulatory Activities

NLP

natural language processing

preferred term

SIBM-CISMeF

Biomedicine informatics service catalogue and index of French language medical websites (French Service d’Informatique Biomédicale-Catalogue et Index des Sites Médicaux de langue Française)

SOC

system organ class

This work was labeled in 2013 by the competitiveness cluster Cap Digital and funded by the French General Direction for Enterprises (Direction Générale des Entreprises) and territorial collectivities (Ile de France and Haute Normandie) under the 16th French Unique Interministerial Fund request for proposal through the ADR-PRISM project.

None declared.

Bates

Evans

Murff

Stetson

Pizziferri

Hripcsak

Detecting adverse events using information technology

J Am Med Inform Assoc 2003 10 2 115 28

12595401

PMC150365

Golomb

McGraw

Evans

Dimsdale

Physician response to patient reports of adverse drug effects: implications for patient-targeted adverse effect surveillance

Drug Saf 2007 30 8 669 75

10.2165/00002018-200730080-00003

17696579

3083

Hazell

Shakir

Under-reporting of adverse drug reactions : a systematic review

Drug Saf 2006 29 5 385 96

10.2165/00002018-200629050-00003

16689555

2953

Sarker

Ginn

Nikfarjam

O'Connor

Smith

Jayaraman

Upadhaya

Gonzalez

Utilizing social media data for pharmacovigilance: a review

J Biomed Inform 2015 04 54 202 12

10.1016/j.jbi.2015.02.004

25720841

S1532-0464(15)00036-2

PMC4408239

Bate

Lindquist

Edwards

Olsson

Orre

Lansner

De Freitas

A Bayesian neural network method for adverse drug reaction signal generation

Eur J Clin Pharmacol 1998 06 54 4 315 21

9696956

DuMouchel

Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system

The American Statistician 1999 08 53 3 177

10.2307/2686093

Szarfman

Machado

O'Neill

Use of screening algorithms and computer systems to efficiently signal higher-than-expected combinations of drugs and events in the US FDA's spontaneous reports database

Drug Saf 2002 25 6 381 92

10.2165/00002018-200225060-00001

12071774

250601

van Puijenbroek

Bate

Leufkens

Lindquist

Orre

Egberts

A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions

Pharmacoepidemiol Drug Saf 2002 11 1 3 10

10.1002/pds.668

11998548

Katsahian

Simond

Leprovost

Lardon

Bousquet

Kerdelhué

Abdellaoui

Texier

Burgun

Boussadi

Faviez

Evaluation of internet social networks using net scoring tool: a case study in adverse drug reaction mining

Stud Health Technol Inform 2015 210 526 30

10.3233/978-1-61499-512-8-526

25991203

ISDB

International Society of Drug Bulletins 2005 01

2018-06-29

ISDB EU

ISDB EU: Berlin Declaration on Pharmacovigilance https://www.akdae.de/Arzneimittelsicherheit/Weitere/Archiv/2005/85_200501252.pdf

Thai

Taewijit

Data-driven approach to detect and predict adverse drug reactions

Curr Pharm Des 2016 22 23 3498 526

27157416

CPD-EPUB-75526

Wong

Plasek

Montecalvo

Zhou

Natural language processing and its implications for the future of medication safety: a narrative review of recent advances and challenges

Pharmacotherapy 2018 06 09 38 8 822 41

10.1002/phar.2151

29884988

Tricco

Zarin

Lillie

Pham

Straus

Utility of social media and crowd-sourced data for pharmacovigilance: a scoping review protocol

BMJ Open 2017 12 19 7 1 e013474

10.1136/bmjopen-2016-013474

28104709

bmjopen-2016-013474

PMC5253521

Bossuyt

Reitsma

Bruns

Gatsonis

Glasziou

Irwig

Lijmer

Moher

Rennie

de Vet

Kressel

Rifai

Golub

Altman

Hooft

Korevaar

Cohen

STARD Group

STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies

BMJ 2015 10 28 351 h5527

10.1136/bmj.h5527

26511519

PMC4623764

Bousquet

Dahamna

Guillemin-Lanne

Darmoni

Faviez

Huot

Katsahian

Leroux

Pereira

Richard

Schück

Souvignet

Lillo-Le

Texier

The adverse drug reactions from patient reports in social media project: five major challenges to overcome to operationalize analysis and efficiently support pharmacovigilance process

JMIR Res Protoc 2017 09 21 6 9 e179

10.2196/resprot.6463

28935617

v6i9e179

PMC5629348

Board

Medical Dictionary for Regulatory Activities 2018-06-27

https://www.meddra.org/

Catalog and Index of Medical Sites French-language 2018-06-27

http://www.chu-rouen.fr/cismef/outils/outils-et-services/

World Health Organization 2018-06-27 ATC/DDD Index 2017 https://www.whocc.no/atc_ddd_index/

Kappa Santé 2018-06-27 Pharmaco-epidemiology Interventions in public and numeric health https://www.kappasante.com/

Kostadinov

GitHub 2018-06-27 TEMIS LUXID 7.0.1 SKILL CARTRIDGE DEVELOPMENT CYCLE http://fabian-kostadinov.github.io/2015/10/04/temis-luxid701-skill-cartridge-development-cycle/

Chen

Deldossi

Aboukhamis

Faviez

Dahamna

Karapetiantz

Guenegou-Arnoux

Girardeau

Guillemin-Lanne

Lillo-Le-Louët

Texier

Burgun

Katsahian

Mining adverse drug reactions in social media with named entity recognition and semantic methods

Stud Health Technol Inform 2017 245 322 6

29295108

Cavalié

Djeraba

ANSM [National Agency for the Safety of Medicines and Health Products] 2014

2018-06-27

[Analysis of drug sales in France in 2013] https://ansm.sante.fr/var/ansm_site/storage/original/application/3df7b99f8f4c9ee634a6a9b094624341.pdf

Kiritchenko

Matwin

Nock

Famili

Learning Evaluation in the Presence of Class Hierarchies: Application to Text Categorization

2006

19th Conference of the Canadian Society for Computational Studies of Intelligence

June 7-9, 2006

Québec City, Québec, Canada

Berlin, Heidelberg

Springer

395 406

R Core Team

R: A language and environment for statistical computing 2016

Vienna, Austria

R Foundation for Statistical Computing

https://www.r-project.org/

Ginn

Pimpalkhute

Nikfarjam

Patki

O'Connor

Sarker

Smith

Gonzalez

Arizona State University 2014

2019-04-01

Mining Twitter for adverse drug reaction mentions: a corpus and classification benchmark http://www.nactem.ac.uk/biotxtm2014/papers/Ginnetal.pdf

Hadzi-Puric

Grmusa

Automatic Drug Adverse Reaction Discovery from Parenting Websites Using Disproportionality Methods

2012

IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

August 26-29, 2012

Istanbul, Turkey

792 797

10.1109/ASONAM.2012.144

Leaman

Wojtulewicz

Sullivan

Skariah

Yang

Gonzalez

Towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks

2010

Workshop on Biomedical Natural Language Processing (BioNLP 2010) ACL 2010

July 15, 2010

Uppsala, Sweden

117 125

Liu

Chen

AZDrugMiner: an information extraction system for mining patient-reported adverse drug events in online patient forums

2013

International Conference on Smart Health

August 3-4, 2013

Beijing, China

134 50

10.1007/978-3-642-39844-5_16

Nikfarjam

Gonzalez

Pattern mining for extraction of mentions of adverse drug reactions from user comments

AMIA Annu Symp Proc 2011 2011 1019 26

22195162

PMC3243273

Nikfarjam

Sarker

O'Connor

Ginn

Gonzalez

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

J Am Med Inform Assoc 2015 05 22 3 671 81

10.1093/jamia/ocu041

25755127

ocu041

PMC4457113

O'Connor

Pimpalkhute

Nikfarjam

Ginn

Smith

Gonzalez

Pharmacovigilance on Twitter? Mining tweets for adverse drug reactions

AMIA Annu Symp Proc 2014 2014 924 33

25954400

PMC4419871

Patki

Sarker

Pimpalkhute

Nikfarjam

Ginn

Oconnor

Smith

Gonzalez

Arizona State University 2014

2019-04-01

Mining Adverse Drug Reaction Signals from Social Media: Going Beyond Extraction https://tinyurl.com/y4xwmjco

Sampathkumar

Chen

Luo

Mining adverse drug reactions from online healthcare forums using hidden Markov model

BMC Med Inform Decis Mak 2014 14 91

10.1186/1472-6947-14-91

25341686

1472-6947-14-91

PMC4283122

Sarker

Gonzalez

Portable automatic text classification for adverse drug reaction detection via multi-corpus training

J Biomed Inform 2015 02 53 196 207

10.1016/j.jbi.2014.11.002

25451103

S1532-0464(14)00231-7

PMC4355323

Segura-Bedmar

Revert

Martínez

Detecting drugs and adverse events from Spanish social media streams

Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi) 2014

5th International Workshop on Health Text Mining and Information Analysis (Louhi)

April 26-30, 2014

Gothenburg, Sweden

106 15

Yang

Wang

Kiang

Identification of Consumer Adverse Drug Reaction Messages on Social Media

PACIS 2013 Proceedings 2013

The Pacific Asia Conference on Information Systems

June 18-22, 2013

Jeju Island, Korea

Yates

Goharian

ADRTrace: detecting expected and unexpected adverse drug reactions from user reviews on social media sites

2013

European Conference on Information Retrieval

March 24-27, 2013

Moscow, Russia

10.1007/978-3-642-36973-5_92

Benton

Ungar

Hill

Hennessy

Mao

Chung

Leonard

Holmes

Identifying potential adverse effects using the web: a new approach to medical hypothesis generation

J Biomed Inform 2011 12 44 6 989 96

10.1016/j.jbi.2011.07.005

21820083

S1532-0464(11)00123-7

PMC4404640

Bian

Topaloglu

Towards large-scale Twitter mining for drug-related adverse events

SHB12 (2012) 2012 10 29 2012 25 32

10.1145/2389707.2389713

28967001

PMC5619871

Chee

Berlin

Schatz

Predicting adverse drug events from personal health messages

AMIA Annu Symp Proc 2011 2011 217 26

22195073

PMC3243174

Freifeld

Brownstein

Menone

Bao

Filice

Kass-Hout

Dasgupta

Digital drug safety surveillance: monitoring pharmaceutical products in Twitter

Drug Saf 2014 05 37 5 343 50

10.1007/s40264-014-0155-x

24777653

PMC4013443

Jiang

Zheng

Mining Twitter Data for Potential Drug Effects

2013

The 9th International Conference on Advanced Data Mining and Applications (ADMA )

2013

Hangzhou, China

10.1007/978-3-642-53914-5_37

Yang

Jiang

Postmarketing drug safety surveillance using publicly available health-consumer-contributed content in social media

ACM Trans Manage Inf Syst 2014 04 01 5 1 1 21

10.1145/2576233

Yang

Jiang

Zhang

Social media mining for drug safety signal detection

2012

Proceedings of the international workshop on Smart healthwellbeing

Octiber 29, 2012

Maui, Hawaii, USA

Yeleswarapu

Rao

Joseph

Saipradeep

Srinivasan

A pipeline to extract drug-adverse event pairs from multiple data sources

BMC Med Inform Decis Mak 2014 14 13

10.1186/1472-6947-14-13

24559132

1472-6947-14-13

PMC3936866

Bollegala

Maskell

Sloane

Hajne

Pirmohamed

Causality patterns for detecting adverse drug reactions from social media: text mining approach

JMIR Public Health Surveill 2018 05 09 4 2 e51

10.2196/publichealth.8214

29743155

v4i2e51

PMC5966656

Saeed

Villarroel

Reisner

Clifford

Lehman

Moody

Heldt

Kyaw

Moody

Mark

Multiparameter intelligent monitoring in intensive care II: a public-access intensive care unit database

Crit Care Med 2011 05 39 5 952 60

10.1097/CCM.0b013e31820a92c6

21283005

PMC3124312

Zhu

Cherry

Kiritchenko

Martin

de Bruijn

Detecting concept relations in clinical text: insights from a state-of-the-art model

J Biomed Inform 2013 04 46 2 275 85

10.1016/j.jbi.2012.11.006

23380683

S1532-0464(12)00180-3