Using supervised machine learning methods to automatically extract diagnostic information from skin cancer pathology reports for QIMR Berghofer Medical Research Institute.
Problem
Keratinocyte cancers are the most common cancers in caucasian populations. In most jurisdictions these cancers are not routinely registered, and thus estimates of incidence are derived from administrative data that do not discriminate between basal or squamous cell carcinomas, and other diagnoses. Automated extraction of diagnostic information from pathology reports would provide timely and affordable incidence data at a population level.
Solution
We employed supervised learning methods to develop algorithms to classify diagnosis (BCC, SCC, keratoacanthoma and intraepidermal carcinoma), number of lesions, and site of lesions from free-text pathology reports. The resulting algorithms were incorporated into a web application capable of processing large numbers of pathology reports.
The training dataset included all pathology reports for participants (including non-skin lesions, benign skin lesions and melanoma). Separate supervised machine learning algorithms were developed for each classification task (i.e., diagnosis and site).
We developed a web application to upload pathology reports and analyse the free-text on a local server. This web application is capable of parsing and analysing reports across a range of formats, as used by various laboratories.
To assess ‘real-world’ performance of the algorithms, we compared algorithm-derived output against ‘gold-standard’ data.
Challenges and Roadblocks
Since pathology reports often contain discussion of multiple lesions. it can be very challenging to extract structured information from them. We implemented a multi-label classification algorithm as this delivered significant improvement over more traditional approaches.
Outcomes
Supervised learning methods were used to develop a web application capable of accurately and rapidly classifying large numbers of pathology reports for keratinocyte cancers and related diagnoses. In the absence of population-based skin cancer registration, this solution assists with accurately measuring subtype-specific skin cancer incidence.