This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://www.researchprotocols.org, as well as this copyright and license information must be included.
Patient narrative data in online health care forums (communities) are receiving increasing attention from the scientific community for implementing patient-centered care. Natural language processing (NLP) methods are gaining more and more attention because of the enormous data volume. However, state-of-the-art NLP still cannot meet the need of high-resolution analysis of patients’ narratives. Manual qualitative analysis still plays a pivotal role in answering complicated research questions from analyzing patient narratives.
This study aimed to develop a systematic framework for qualitative analysis of patient-generated narratives in online health care forums.
Our systematic framework consists of 4 phases: (1) data collection, (2) data preparation, (3) content analysis, and (4) interpretation of the results. Data collection and data preparation phases are constructed based on text mining methods for identifying appropriate online health forums for data collection, differentiating posts of patients from other stakeholders, protecting patients’ privacy, sampling, and choosing the unit of analysis. Content analysis phase is built on the framework method, which facilitates and accelerates the identification of patterns and themes by an interdisciplinary research team. In the end, the focus of interpretation of the results phase is to measure the data quality and interpret the findings regarding the dimensions and aspects of patients’ experiences and concerns in their original contexts.
We demonstrated the usability of the proposed systematic framework using 2 case studies: one on determining factors affecting patients’ attitudes toward antidepressants and another on identifying the disease management strategies in patient with diabetes facing financial difficulties. The framework provides a clear step-by-step process for systematic content analysis of patient narratives and produces high-quality structured results that can be used for describing patterns or regularities in patients’ experiences, generating and testing hypotheses, and identifying areas of improvement in the health care systems.
The systematic framework is a rigorous and standardized method for qualitative analysis of patient narratives. Findings obtained through such a process indicate authentic dimensions and aspects of patient experiences and shed light on patients’ concerns, needs, preferences, and values, which are the core of patient-centered care.
RR1-10.2196/13914
Online health forums (communities) are increasingly accessible and convenient platforms for patients and caregivers to share health care experiences together with concerns of diagnosis, treatment, and outcomes. Almost 30% of the US population actively share and discuss health-related experiences on various online health forums, such as askapatient.com, patientslikeme.org, and dailystrength.org [
There are 2 methods for analyzing patient-generated narrative data: natural language processing (NLP) and qualitative content analysis. NLP is a set of methods and techniques to process human language components, such as identifying sentence structure and recognizing sentence meaning as humans do [
The second method for analysis of patient-generated narratives is qualitative content analysis. Qualitative content analysis is a research method designed to identify the thematic structure of text documents by subjectively interpreting the context of the text [
In this study, we propose an efficient and cost-effective systematic framework built on text mining and qualitative content analysis approaches for analyzing patients’ narratives. This framework comprises 4 phases: (1) data collection, (2) data preparation, (3) content analysis, and (4) interpretation of findings. Data collection and data preparation phases utilize text mining methods to facilitate and accelerate the process of data collection and preparation for content analysis. The content analysis phase utilizes the framework method [
In this study, we first provide a brief introduction to the framework method [
The framework method was developed in the qualitative research unit at the National Centre for Social Research in the United Kingdom in 1980 for analyzing narrative data related to policy [
Structure of the framework method. Sentences from patient posts (unit of analysis) organized in the rows and themes (adverse drug reactions) organized in the columns. Value “1” indicates to which theme a sentence was assigned.
Themes in the framework method are generated through deductive, inductive, or a combination of the deductive-inductive approach. The selected approach depends on the research question, the study’s aim, and the available knowledge about the phenomena under study. The deductive approach is used when the aim of the study is to retest the existing model, concept, or knowledge in a new context [
Building on our previous experiences and lessons learned from the content analysis of patients’ narratives, we propose a systematic framework to address the subjective nature of patients’ narratives [
In the following sections, we demonstrate the implementation of each step of this systematic framework using 2 studies: (1) a case study that identified factors affecting patients attitudes toward antidepressants; we use the short title
A schematic view of the systematic framework designed for content analysis of patient narrative data.
The data collection phase of the systematic framework included selecting health forums, protecting patients’ privacy, and collecting data from these health forums.
Health forums vary in covering patients’ experiences with the health care system. Researchers can select a health care forum for their study using (1) description of the forum or community, (2) initial analysis of randomly selected posts, and (3) analysis of medical concepts using text mining tools such as MetaMap [
Although data in the forums are mostly anonymous and publicly available, further protection of patient’s privacy and requesting permission from owners of the data collection are recommended. Researchers need to submit the institutional review board’s (IRB’s) study approval to the affiliated institute. The IRB submission usually receives an exemption. In addition, to further protect patient privacy, deidentification of the data is recommended. For example, in both projects, we formulated regular expressions to eliminate emails, phone numbers, and URLs from posts. For the project
Patient posts in the online health forums are mostly stored in the HTML format. To collect these data, the research team may use the application programming interface (API) specifically developed for the forum or community. If the API is not available, the research team may customize the existing open-source Web crawlers or develop a new one to collect data. For example, we used Beautiful Soup [
Data preparation phase consists of 3 steps: differentiating patient posts from other stakeholders, sampling, and defining the unit of analysis.
Patients’ interests and perspectives on treatment are different from that of clinicians and caregivers who share their experiences and concerns for their patients in online health forums. Distinguishing patients’ experiences from other stakeholders can be achieved by utilizing text mining approaches such as unsupervised algorithms for text clustering [
Having a representative sample of online forums content is pivotal for statistical reliability and generalizability of the findings. To increase the likelihood of having a representative sample, the research team may utilize retrieval methods such as phrase-based vector space model [
If the size of retrieved relevant patient posts is extremely large, probability sampling methods (such as simple random sampling, stratified random sampling, or cluster sampling) are useful to obtain a robust sample size
Determining the sample size is another concern in content analysis studies. There is no single formula for determining the sample size. The size of the sample is a factor of time and financial sources and data heterogeneity. Researchers may use the standard sampling formula for computing the sample size [
A unit of analysis is the smallest unit in the data sample containing information regarding the research question. Graneheim et al discussed that the unit of analysis should be large enough to convey a whole perspective and small enough to be kept in mind as a context for meaning unit during the analysis process [
For both case studies, the initial analysis showed that patients’ comments were composed of multiple sentences that covered various dimensions and aspects of experiences and concerns. Therefore, we used sentences as the unit of analysis. In addition, data analysis at the level of sentences ensured that no important segment of patient narratives was missed. Splitting patient posts into sentences is not an easy task because of colloquial language and grammatical and punctuation errors. Therefore, we preprocessed the data to remove noisy patterns and then split the patient posts into sentences using open-source Natural Language Toolkit [
After preparing the patient posts, the next step is on defining themes for content analysis. The framework method allows different approaches for generating themes: deductive, inductive, and combination of deductive-inductive. In this section, we illustrate the step-by-step procedure of generating themes using deductive-inductive approach for the 2 case studies. This approach allowed us to retest the available knowledge in the literature in the context of patient narratives while leaving space for discovering new aspects of the patient experiences in online health forums.
In this section, we explain the process of generating themes for the case study
Our literature review showed that existing knowledge in the literature is useful for generating themes to analyze and summarize patient experiences with antidepressants in online forums. Accordingly, we conducted a systematic literature review to identify significant factors affecting patients’ attitudes toward antidepressants. We identified 5 main themes including pharmacological treatment, health care system, social-cognitive and psychological factors, patient-related factors, and depression that influence patients’ attitudes toward antidepressants. For each theme, we identified subthemes.
To start coding patient posts using the predefined themes, developing guidelines with clear operational definitions for each theme is necessary. Operational definitions should include well-defined statements with explicit inclusion and exclusion criteria describing the segment of a text assigned to a theme. Each statement must accompany 1 or more examples extracted from patient posts.
Themes generated in deductive approach were used for generating the initial analytical framework. We constructed the framework by organizing predefined themes in the columns and sentences of patient posts (unit of analysis) in the rows. Each patient post was split into sentences and identified using post ID and sentence index indicating the position of the sentence in the patient post.
Generated themes using deductive approach for the case study attitudes to antidepressants.
Example of operational definition for the themes for the project
Pharmacological treatment factors (predefined codes) | Description |
Perceived effectiveness | The patient’s subjective assessment of antidepressant helpfulness in the reduction of depression symptoms, enhancing emotional and cognitive functionalities, and overall, enhancing life quality. |
Side effects | Any adverse reactions that the patient reports as adverse reactions to antidepressants intake. Antidepressants’ adverse reactions may include physiological side effects, emotional syndromes, cognitive impairment, and limitations on daily functioning and quality of life. |
As patients in the forums have the freedom to anonymously share their experiences and concerns in the lay language without any limitations, it is likely that patient posts include information that may not fit into the predefined themes in the initial analytical framework. Therefore, in this step, although we coded sentences (of patient posts) using predefined themes, we used the inductive approach to generate new themes for sentences that could not be assigned to the predefined themes.
It is not necessary to use the whole sample for inductive analysis. Researchers may select a random portion of the sample (eg, 30%), regarding the availability of resources, size of a sample, and the level of heterogeneity in patient narratives. For example, in the study
Generated themes using inductive approach for the case study “attitude to antidepressants”.
Some of the themes generated in deductive approach may not fit into patient posts. For example, we could not find any sentences in the subsample of the study
Themes generated using inductive and deductive approaches need to be refined before developing the final analytical framework. Theme refining can be conducted by creating rules such as setting a threshold on the number of sentences that should be assigned to a theme. For example, for the study
To maintain consistency and uniformity of coding patient posts using the final themes across the sample, developing guidelines are necessary. Guidelines should include the aim of the project, operational definition of the themes with specific examples from patient posts, and inclusion and exclusion criteria for assigning a unit of analysis to themes. Operational definition for a theme should include a clear and precise statement that enables the annotators to recognize a segment of patient post that fit the theme. For example, theme
The guidelines should also include instruction on coding the unit of analysis using themes. For example, whether the unit of analysis can be assigned to more than 1 theme or whether the unit of analysis should be interpreted in the context of the patient posts are important questions. Clear answers to these questions can certainly facilitate the process of coding and increase the quality of generated structured data. Finally, it would be useful if coding guidelines include the list of qualifications for hiring annotators and the estimated time for training.
The research team should select a coding environment that facilitates construction of the analytical framework and the process of coding. For both case studies explained in this study, we used a spreadsheet to construct the analytical framework (see
Overall, the final themes should meet the following criteria: (1) valid—themes should accurately reflect what is being measured, (2) mutually exclusive—themes should not overlap between operational definitions, and (3) exhaustive—themes should cover all the aspects of the data related to the research question.
Final themes for the case study “attitude to antidepressants”.
Before researchers summarize and interpret the data, they should evaluate the quality of the produced structured data. As assigning a certain observation of patient narratives to themes is a subjective process, a disagreement may happen between annotators (coders). In this section, we explain measures for computing IAA and then discuss summarizing and interpretation of the findings.
Cohen kappa is the most popular method for computing IAA. It measures the agreement between 2 annotators who annotate N items (eg, 100 sentences) into M mutually exclusive themes (eg, 10 themes) and corrects the result for the agreement that would be expected by chance [
To improve the quality of produced structured data, researchers may decide to annotate each document (eg, patient post) by more than 2 annotators. In this case, Fleiss kappa (an adaptation of Cohen kappa for 3 or more raters) should be used for computing the IAA [
There are other methods for calculating IAA, such as pairwise agreement. If the annotation task requires identifying terms or phrases and determining their correct boundaries (eg, identifying sign or a symptom) in the patient’s posts, the pairwise agreement would be an appropriate measure. The kappa coefficient would not be suitable because the chance agreement is effectively 0 in this case. Please see Zolnoori et al [
If the structured data developed during the process of content analysis is rich enough, it can provide substantial insight into patients’ concerns, needs, preferences, and attitudes. Interpretation of the result could start with a general description of the themes, followed by reporting the most frequent and infrequent identified patterns, and finally reporting the unexpected patterns in data.
The findings of content analysis can go beyond a simple description of themes. In fact, it can be used for describing patterns or regularities, generating and testing hypotheses, describing a phenomenon and the associated factors, identifying problematic areas in the health care systems, or even developing predictive models to predict a specific patient’s behavior, such as medication nonadherence behavior. For example, for the project
Qualitative content analysis approaches are nonlinear, and iterative processes are more complicated than quantitative approaches because they are less structured and standardized. There are no single guidelines for content analysis. Selecting a specific approach strongly depends on the aim of the study, the research question, and the type of qualitative data. Researchers collecting and analyzing qualitative data, such as patient narratives, often wish to have a systematic approach including the detailed instruction on how to conduct qualitative research efficiently.
This study provided a systematic framework for the content analysis of patient-generated narratives in online health forums (communities). The systematic framework was built on text mining approaches for data collection and data preprocessing and qualitative content analysis using the framework method with the deductive-inductive approach for themes generation. We showed the feasibility and usefulness of the proposed systematic framework using 2 case studies: (1) a published study with a focus on identifying factors affecting patients’ attitudes toward antidepressants [
The core component of the proposed framework (phase 3) is the framework method for qualitative content analysis [
We used a combination of deductive-inductive approach to develop themes for both case studies in this study. However, the proposed systematic framework can be applied equally to studies aimed to use only inductive or deductive approach for data analysis. Our literature review showed that studies with focus on the qualitative content analysis of patient narratives in online health forums mostly used inductive approach for theme generation. But, we showed that the deductive analysis could accelerate the inductive analysis of patient narratives and identify new patterns and themes. There are major differences between this systematic framework proposed in this study and the framework of content analysis suggested by other papers. Please see
We acknowledge some limitations with our proposed framework:
It is not appropriate for the analysis of very heterogeneous patients’ narratives, for example, if the patients’ experiences and the discussions in health forums are very diverse and cover a wide range of health topics. The systematic framework is most suitable for the studies with research questions targeting specific patient cohort with shared health concerns and experiences (eg, medications’ effects or difficulty in access to a health services).
It is not suitable for qualitative studies aiming at developing a theory or analysis of the structure of the experiences or language or the social context associated with the language. The research team may adopt other qualitative approaches to achieve the aims, such as approaches for developing theories derived from the data (eg, grounded theory) [
Although it provides a detailed instruction on analysis of patient posts, which may save time and resources similar to other qualitative analysis methods, it is still time consuming and resource intensive when involving time needed for developing guidelines and training annotators. This time needs to be factored into the study methods and approach.
Previous experiences with and lessons learned from the content analysis:
Qualitative content analysis may seem confusing and complicated for novice researchers. They may find this process to be chaotic and grapple with the qualitative research terms and concepts, such as patterns, categories, and themes. But experiencing chaos during the analysis is normal. Qualitative researchers need to be open to the complexity of content analysis [
During the content analysis process, it would be very helpful to review the research questions constantly. Frequently referring to the research question and aim of the study will help researchers to stay focused on only dimensions of the dataset that answer the research question. It is also very important to take a note of new ideas and identified themes during the whole process of analysis. If the data analysis is conducted in an Excel sheet, assigning a column to notes and ideas would be very useful.
Content analysis is a very time-consuming process and unexpectedly challenging [
It is important to avoid any preunderstanding of the dataset to minimize the risk of bias during the process of content analysis and interpretation of the results [
It is important to have a weekly meeting to discuss new ideas and identified patterns in the group. All team members should be open and receptive to new ideas. The research team should proceed with defining and updating analytical framework based on the summary of the meeting discussion each week.
Creating a table or figure containing information about the process of analysis from the raw data to a meaningful unit of analysis, to the identified themes with examples from patients’ post would be very useful. Including the figure or table in the manuscript of the study will show the validity of the study and improve appreciation of reviewers and readers of the study’s findings.
Exploring patient-reported experiences and concerns in online health care forums (communities) and translating such content into meaningful concepts (themes) has become a challenge for health care researchers and health care providers. In this study, we introduced a systematic framework as a rigorous and standardized method to collect patient-reported experiences from online forums and convert their content to themes that are reliable and easily interpretable. The framework was built on the text mining approaches and the framework method with the deductive-inductive approach that benefit both researchers and clinicians by minimizing the cost, time, and human errors during the process of data processing and analysis. We showed the reliability and efficiency of this framework using 2 case studies: one identifying factors associated with patients’ attitude toward antidepressants and the other identifying solutions and strategies of patients with diabetes facing financial difficulties to access medications and supplies. Finding meaningful information through such a process indicates authentic dimensions and aspects of patient experiences and sheds light on patients’ concerns, needs, preferences, and values, which are the core of patient-centered care.
Definition of HTML, API, Web crawler, Python package, and XML.
Kinship terminology and UMLS (Unified Medical Language System).
Example of a formula for determining the sample size.
Examples of regular expression codes for reducing noise patterns in patients' posts.
Case study #2: strategies and solution of diabetes patients with financial difficulties for accessing medications and supplies.
Operational definitions for all predefined themes for the case study “identifying factors affecting patients’ attitudes towards antidepressants”.
New themes generated through inductive approach for the case study “identifying factors affecting patients’ attitudes towards antidepressants”.
Rules for refining the identified themes.
Guidelines for annotating the patients comments for the case study “diabetes patients solutions to access medications and supplies in the context of financial difficulties”.
Final analytical framework for the case study “identifying the underlying factors affecting patient attitudes toward antidepressants”.
Interpretation of the obtained results from the Cohen kappa.
Descriptive interpretation of the findings of the case study “access to diabetes medications”.
The major difference between the systematic framework and other content analysis frameworks.
adverse drug reaction
application programming interface
interannotator agreement
institutional review board
natural language processing
Unified Medical Language System
This publication was supported by CTSA grant number TL1 TR002380 from the National Center for Advancing Translational Science. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.
None declared.