Copenhagen, Denmark
Onsite/Online

ESTRO 2022

Session Item

Monday
May 09
10:30 - 11:30
Poster Station 2
20: Head and neck
Annett Linge, Germany
Poster Discussion
Clinical
Explainability of deep learning-based HPV status prediction in oropharyngeal cancer
Agustina La Greca, Switzerland
PD-0820

Abstract

Explainability of deep learning-based HPV status prediction in oropharyngeal cancer
Authors:

Agustina La Greca1,2, Chiara Marchiori3, Marta Bogowicz1, Javier Barranco-García1, Ender Konukoglu4, Oliver Riesterer5,1, Panagiotis Balermpas1, Cristiano Malossi3, Matthias Guckenberger1, Janita E. van Timmeren1, Stephanie Tanadini-Lang1

1University Hospital Zurich, University of Zurich, Department of Radiation Oncology, Zurich, Switzerland; 2ETH Zurich, Department of Information Technology and Electrical Engineering, Computer Vision Laboratory , Zürich, Switzerland; 3IBM Research Zurich, AI Automation, Zurich, Switzerland; 4ETH Zurich, Department of Information Technology and Electrical Engineering, Computer Vision Laboratory, Zurich, Switzerland; 5Cantonal Hospital Aarau, Center for Radiation Oncology KSA-KSB, Aarau, Switzerland

Show Affiliations
Purpose or Objective

Patients with human papilloma virus (HPV)-positive oropharyngeal tumors are characterized by a more favorable prognosis when compared to their negative counterparts and, thus, hold the potential for treatment de-escalation. In clinical practice, HPV diagnosis requires the analysis of biopsy samples, while medical image analysis tools have been proposed in literature as complementary non-invasive methods. In this study, we aimed to assess the diagnostic accuracy and explainability of deep learning (DL) for HPV status prediction in computed tomography (CT) images of oropharyngeal cancer (OPC) patients.

Material and Methods

One internal (n1=96) and two public cohorts (n2=498; n3=146) of OPC patients were employed. The dataset was split in a stratified fashion based on HPV status into training (60%), validation (20%) and test (20%) sets. All CT scans were resampled to a cubic resolution of 2 mm3 and a sub-volume of 96x96x96 pixels was cropped. In the axial direction, the sub-volume spanned from the nasal columella to 96 pixels below, i.e., approximately the start of the lungs. On the axial plane, the crop was centered around the center of mass of the first cranial slice. ModelsGenesis, a publicly available 3D model pre-trained on lung CT, was fine-tuned to perform the classification task. The model with the highest F1-score on the validation set was selected and applied to the test set. Class activation maps (CAMs) of those test subjects belonging to the internal dataset (n=25) were obtained post-hoc by means of two explainability methods, Grad-CAM and Score-CAM. CAMs were posteriorly thresholded using the 70th and 90th percentile values to select the most important regions (CAM70th and CAM90th) and their volumetric overlap with the gross tumor volume (GTV) was calculated using Szymkiewicz–Simpson formula for the primary tumor (GTVpt) and the affected lymph nodes (GTVln), separately and together (GTVall). 

Results

The model achieved an AUC/accuracy/F1-score of 0.89/0.82/0.78, 0.83/0.77/0.70 and 0.87/0.79/0.74 on the training, validation, and test cohorts, respectively. Figure 1 shows the visual explanation obtained after applying Grad-CAM for two test subjects. Among the 25 internal test cases, 19 were correctly classified. An overlap between GTVall and Grad-CAM70th of at least 0.8 was observed in 21 cases, while the same was true for 24 cases using Score-CAM70th. The overlap coefficients of GTVall with Grad-CAM90th and Score-CAM90th were at least 0.5 for 13 subjects. The mean overlap coefficients of the GTVpt, GTVln and GTVall with the different CAMs are shown in Table 1.



Conclusion

Two explainability methods were employed to explore which CT regions were the most relevant in HPV status prediction by a 3D DL model. Our study showed a promising classification performance and volumetric overlap between the resulting heatmaps and the GTVpt and GTVln. These findings contribute to address reliability concerns of DL in diagnostics and bring closer its application in a clinical setting.