Video-Based Surgical Tool-Tip and Keypoint Tracking Using Multi-Frame Context-Driven Deep Learning Models

ISBI 2025

Bhargav Ghanekar¹, Lianne R. Johnson¹, Jacob L. Laughlin¹, Marcia K. O'Malley¹, Ashok Veeraraghavan¹

Rice University, Houston TX USA¹

Paper ArXiv Code Dataset

Rice

Background

Robotic-assisted minimally invasive surgery (RMIS) is a rapidly growing field and looks likely to be the future of surgery. Tracking trajectories of surgical tools during surgery is an important task, one which can enable several downstream use cases such as (1) automating surgical skill / expertise assessment, (2) real-time delineation of safe working zones, and (3) enabling augmented reality (AR) applications for surgical navigation.

Existing methods for surgical tool tracking mainly rely on kinematic data from the surgical robot, which is not always available. Vision-based methods are more flexible and can be used with a wider range of surgical robots. However, existing vision-based methods are limited by the need for high-quality annotations and the lack of multi-frame context in their models. In this work, we propose a novel multi-frame context-driven deep learning model for surgical tool-tip and keypoint tracking that leverages information from previous frames to improve tracking performance.

Methodology

We track surgical tool keypoints, such as tool-tips, tool jaw-base, etc. in a two-stage manner - (1) we first segment out the tool keypoints regions (referred to as keypoint ROI segments) in the current frame using a multi-frame context-driven deep learning model, and (2) we then estimate the keypoint location as the centroid of the output segmentation blob.

Methodology

Our keypoint ROI segmentation model is a multi-frame context-driven deep learning model that leverages information from previous frames to improve segmentation performance. Using a K-frame window as input, we predict the following intermediate maps - (1) per-frame keypoint ROI segmentation maps, (2) K-1 optical flow maps (using a pretrained RAFT model), and (3) K depth maps (using a pretrained Depth-Anything-v2[2] model).

We pass the maps to MFCNet, and propose two variants for MFCNet:
MFCNet-Basic (MFCNet-B): All intermediate maps from segmentation backbone, optical flow model, and depth estimator model are concatenated and passed through a 4-layer CNN.
MFCNet-Warp (MFCNet-W): Using the K-1 optical flow maps, the intermediate segmentation outputs of the previous frames and their corresponding depth maps are all warped to the current frame, and then concatenated to the current frame segmentation output and depth map before passing through the 4-layer CNN.

Pipeline

Acknowledgement

This work was supported by Rice-Houston Methodist Seed Grant fund.