Latent topic-driven cyber intelligence model for tactics, techniques, and procedures (TTPs) detection using hybrid framework and Birch-inspired optimisation.
Musaed Mutared Alanazi, Ainuddin Wahid Abdul Wahab, Mohd Yamani Idna Idris
Abstract
Open AccessEarly disruption of Advanced Persistent Threat (APT) campaigns hinges on recognising the attackers' tactics, techniques and procedures (TTPs), which remain stable even after individual indicators of compromise (IOC) are rotated. Most prior studies derive such patterns from vendor-curated cyber-threat-intelligence reports, sources that can reflect commercial or geopolitical bias. To reduce this bias, a corpus of 2,097 malware samples confidently attributed to ten established APT groups was assembled, and every sample was detonated in a high-fidelity sandbox. The sandbox generates lengthy, unstructured text traces that capture both static artefacts and dynamic behaviour. The contributions are fourfold: this bias-reduced malware-TTP corpus is published, LTDCT-TTPDBIO is introduced to turn raw sandbox logs into precise ATT&CK labels with low latency, and an extensive comparative evaluation is provided that links TTP prevalence to APT groups' capabilities. These raw traces are then processed by LTDCT-TTPDBIO, a latent-topic model tuned with a Birch-inspired optimiser and paired with a random-forest classifier, which converts them into precise MITRE ATT&CK labels in about 1.45 min per sample. This latency is roughly 4.5 times shorter than the fastest baseline and six to eight times shorter than recent transformer- and graph-based approaches, yet the model still delivers the best detection quality. With an 80-20 train-test split, it reaches an accuracy of 95.33%, a precision of 97.32%, a recall of 94.61% and an F1-score of 95.65%; with a 70-30 split, the F1-score climbs to 95.78%. These figures outperform the eight baseline algorithms evaluated in this study, with the highest F1-score among them being 93.38%. The resulting structured behaviour ground-truth dataset quantifies how malware and tactics are distributed across the ten APT groups. It pinpoints the most frequently observed techniques together with their defensive implications. Efficient extraction of TTPs from raw sandbox text therefore offers a durable and bias-resistant foundation for proactive APT defence.