This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://www.researchprotocols.org, as well as this copyright and license information must be included.

In recent years, remarkable progress has been made in deep learning technology and successful use cases have been introduced in the medical domain. However, not many studies have considered high-performance computing to fully appreciate the capability of deep learning technology.

This paper aims to design a solution to accelerate an automated Gram stain image interpretation by means of a deep learning framework without additional hardware resources.

We will apply and evaluate 3 methodologies, namely fine-tuning, an integer arithmetic–only framework, and hyperparameter tuning.

The choice of pretrained models and the ideal setting for layer tuning and hyperparameter tuning will be determined. These results will provide an empirical yet reproducible guideline for those who consider a rapid deep learning solution for Gram stain image interpretation. The results are planned to be announced in the first quarter of 2021.

Making a balanced decision between modeling performance and computational performance is the key for a successful deep learning solution. Otherwise, highly accurate but slow deep learning solutions can add value to routine care.

DERR1-10.2196/16843

In recent years, remarkable progress has been made in deep learning due to the emergence of big data processing technology. Deep learning is a family of machine learning that consists of multiple neurons in multiple layers. A neuron is a mathematical function with weights and biases, known as parameters. It receives real numbers from the neurons in the previous layer, generates another real number, and transmits it to the neurons in the next layer. The parameters for each of these neurons are optimally determined by a backpropagation algorithm, such as stochastic gradient descent, that looks for the minimum of a function. This contributes to the success of deep learning of image data compared with conventional techniques because it is able to learn the intrinsic data features without handcrafted feature engineering.

Gram stain is a laboratory procedure for the rapid classiﬁcation and identiﬁcation of microbial pathogens. Unlike other microbiology processes that can be fully automated [

Smith et al [

Despite the high accuracy achieved by Smith et al, there are still many open questions to be addressed. With regard to modeling, transfer learning could be improved with fine-tuning [

This study aims to design a rapid deep learning solution for Gram stain interpretation without acquiring hardware resources, and it provides the optimal proportion for the fine-tuning. The hypothesis and the study design to evaluate the hypothesis will be explained in the following section.

This section addresses the hypothesis, study design, data collection and description, study population, statistical considerations for nonbiased model construction and evaluation, and tools in detail.

This study does not investigate a clinical hypothesis but performs an empirical evaluation of a deep learning framework for Gram stain image interpretation. The hypothesis to be tested is that the optimization of a deep learning framework will perform better than a scale-up strategy with a single GPU. In order to test this hypothesis, two strategies will be examined, as shown in

Two strategies will be compared. The lineage of model A is the implementation of the scale-up strategy (highlighted in blue). On the other hand, the lineage of model B is the implementation of the optimization strategy (depicted in green). Model A is the base model with FPA framework, while model B replaces the floating-point arithmetic with IAO. Each model is built on top of a predecessor model. For instance, model A1 is empowered with a single GPU and model B1 is empowered with the optimal minibatch size. FPA: floating-point arithmetic. GPU: graphics processing unit. IAO: integer arithmetic only.

In order to avoid model bias, 4 pretrained models (Mobilenet [

Once the base model is implemented with Tensorflow [

Specification of the hyperparameter space for the model B family. Minibatch size and dropout rate are quantified to avoid an exhaustive search.

Model | Hyperparameter | Original value range | Quantified values |

B1 | Minibatch size | {1-infinity} | {32, 64, 128, 256, 512} |

B2 | Batch normalization | {on, off} | {on, off} |

B3 | Weight normalization | {on, off} | {on, off} |

B4 | Dropout rate | {0-1} | {0, 0.1, 0.2, 0.3, 0.4, 0.5} |

The objective of this study is to understand the relation between computation time and each hyperparameter, not to create a hyperparameter optimization [

This study will use 8728 Gram stain images from between 2015 and 2018 for modeling and images generated in 2019 for testing. Data are archived in a workstation at the Institute for Clinical Chemistry at the Medical Faculty Mannheim of Heidelberg University, Germany. Sample images and labels are shown in

A sample image of Gram stain data. The image label does not have a link to personal information.

The label data corresponding to the image are stored in a central database for reporting purposes and extracted for this study. Each image is associated with 2 labels: (1) Gram stain class (ie, either gram-positive or gram-negative) and (2) a class for the genus. The genus label includes 5 of the most frequently encountered germs: (1)

The population for this study is a group of sepsis patients, whose blood samples contain at least one harmful bacterium, such as

This section will address and describe the 3 underlying statistical considerations towards a solid study design: (1) the class balancing strategy for the input data set, (2) the proper split ratio for training and evaluation, and (3) the metric for the model evaluation.

Imbalanced input data sets are a common limiting factor that degrades model quality. Chawla et al [

The data set will be split into a training set, a hold-out development set, and a test set. The hold-out development set is different from the test set, as the development set will only be used for tuning the model parameters in order to not bias classification. The training set for deep learning algorithms is increased to 99% of the entire data set when there are more than a million data points. However, this study will follow best practice in machine learning, in which the splitting ratio is 60%, 20%, and 20% [

Cross-validation is not used in this study for model validation. Cross-validation estimates the performance of the model statistically, but it is not the chosen method for evaluating a deep learning model. For instance, a 10-fold cross-validation creates a model with 9 folds and tests the model with the hold-out data (1 fold) 10 times. When we evaluate the model with 100 whole-slide (28,032×28,032 pixels) images, each round will take at least 900 minutes with a workstation powered by Intel Core i7 with 32 GB of RAM and a Nvidia GTX 1070 GPU, which is the same hardware setting and the same image size used in the study by Smith et al [

Despite considerable efforts that have been devoted to deep learning research, not many studies consider the computational efficiency, but focus solely on model evaluation. In order to provide more insightful information, this study will evaluate models with the classical metrics, such as accuracy, confusion matrix, and area under curve, as well as the training and testing times of models, to achieve the target accuracy proposed by the Stanford Data Analytics for What’s Next project team [

This study will use Tensorflow [

All solutions will be developed and deployed in the data center at the Heinrich-Lanz-Center for Digital Health. The hardware configuration for this study is one Intel Xeon Silver 4110 CPU (Intel Corp), one Tesla V100 PCIe 32 GB GPU (Nvidia Corp), and 189 GB memory. The server is virtualized by Docker technology (Docker Inc) [

This study will provide an empirical guideline on how to accelerate a high-performance deep learning model without losing predictive power. Concretely, 2 results will be highlighted: (1) the performance improvement of an integer arithmetic–only deep learning framework for Gram stain image classification and (2) the optimal setting of fine-tuning and hyperparameters for 4 pretrained models (Mobilenet version 1 and 2 and Inception version 3 and 4). All models and the code for training and evaluation will be freely accessible in a public repository for reproducible research.

As of October 2019, this study has been approved by the institutional review board of Medical Faculty Mannheim of Heidelberg University, and the image data for the retrospective data analysis are available. The results are planned to be announced in the first quarter of 2021.

Distributed computing across multiple machines will not be covered in this study. Although it is the usual method to process big data, it is not always the most efficient choice to process the data. According to Boden et al [

This study does not aim to propose novel neural network architecture, which requires many days of GPU processing time with state-of-the-art computational infrastructure that is not available within the scope of this project. Also, designing an outperforming architecture for image classification is a saturated topic, as many researchers have devoted their endeavors to this problem in the last decade. Nevertheless, for those who are interested in this topic, Elsken et al published a state-of-the-art review paper [

An insufficient amount of image data could lead to an underpowered deep learning solution. The proper input data size is still an open question in the computer vision community. The answer is, “it depends.” It depends on the number of classes, image size, image quality, and complexity of the problem. For instance, classifying a black image versus a white image demands fewer input data compared with classifying a gram-positive image versus a gram-negative image.

In medical data analysis, power analysis is widely applied for determining the minimum sample size required. Unfortunately, power analysis is not applicable to unstructured data such as images. A rule of thumb for a good input size is 1000 images per class [

Although this study does not use any personal information for data analysis, the name of the input data consists of a unique identifier for the experiment. This experiment identifier harbors a remote risk of linking back to personal information in the database. In the interest of data protection, this identifier is anonymized and securely stored at the Heinrich-Lanz-Center for Digital Health , which is protected by the hospital network ﬁrewalls. Unlike data pseudonymization, which transforms the identifier, data anonymization is an irreversible technique that removes the identiﬁer permanently. The anonymized data will be archived for reproducibility.

The study will comply with the latest version of the Declaration of Helsinki [

central processing unit

graphics processing unit

None declared.