Orlando G. Díaz-Ramos
Jonathan Katz
Ashley Gordon
Archan Khandekar
Hemendra N. Shah
Robert Marcovich
Julio Ojalvo
Sarvesh Saini
Ubbo Visser
Aravind Rathinam

Universidad Central del Caribe School of Medicine, Bayamón, Puerto Rico
University of Miami, Miami, Florida

Introduction

Computer vision for assisted technologies in ureteroscopy with laser lithotripsy (URS) often requires kidney stone identification as a first step in the processing and training of data. Prior studies have developed models for this task, but diverse labeled data are needed to make these models more robust. In this study, we employ the “You Only Look Once” (YOLO) segmentation model, evaluate its performance, and share our experimental pipeline for the benefit of future researchers.

Methods

We performed an IRB-approved study wherein we collected images from four URS procedures performed with the LithoVue™. We extracted images during laser lithotripsy at 30 frames per second and then selected the clearest images from each video while attempting to curate representative images throughout the lithotripsy. We manually segmented the images using VGG Image Annotator. We used 80% of the data to train a YOLOv8 segmentation model, 10% to test, and 10% to validate the model. Results were reported using mean average precision (mAP) and F1 score. The mAP50 metric correlates with the ability of the model to correctly identify and outline a stone when the predicted and true locations have at least a 50% overlap. We compared our model with a previously published open-source kidney stone segmentation model.

Results

The segmentation model achieved an mAP50 of 0.979 on the validation subset and an F1 score of 0.94. In comparison, the previously published open-source model had an F1 score of 0.58 when applied to the same validation subset. The experimental pipeline was published on GitHub under an open-source license.

Conclusions

Our model outperformed another open-source segmentation model with a significantly higher F1 score of 0.94 compared to 0.58. The difference may be partially explained by variations in object mask representation. These results demonstrate that for novel settings, such as with a new ureteroscope, segmentation models can be trained with as few as four procedures. However, models trained on different datasets perform worse, highlighting the need for more robust and scalable models.