Family Ties: A Close Look at the Influence of Static Features on the Precision of Malware Family Clustering
Nov 7, 2025·
,
,
,
,
,
·
0 min read
Antonino Vitale
Kevin Van Liebergen
Juan Caballero
Savino Dambra
Platon Kotzias
Simone Aonzo
Abstract
Malware family clustering plays a crucial role in many security tasks, including malware analysis, classification, labeling, triage, threat hunting, and lineage studies. This work takes a close look at the influence on malware family clustering of 11 popular static similarity features, including whole-file fuzzy hashes (e.g., SSDeep, TLSH), structural hashes (e.g., PE Hash, Import Hash, VirusTotal’s VHash), certificate-based features, and icon-based features. Our goal is not to propose new features or clustering approaches. Instead, we aim to measure how often these 11 features make clustering errors, i.e., cluster together samples belonging to different malware families. We also investigate the root causes behind those errors, which often lead to misinterpretations of malware relationships, hinder effective threat detection, and propagate inaccuracies in downstream analyses. To study this phenomenon, we leverage three public datasets comprising 79,993 labeled Windows malware samples. We cluster those samples by using each of the analyzed features, measure their accuracy with a focus on their precision, and examine the reasons that caused some clusters to contain samples from different families. Our analysis identifies intrinsic limitations of some of the features and highlights the severe impact of EXE-building tools (like software protectors, installers, and self-extracting archives) on malware clustering. Finally, we discuss mitigations and evaluate potential improvements to address the problems we observed. Our findings provide a critical foundation for improving static malware clustering methodologies by emphasizing the importance of dataset curation and feature refinement for robust and precise clustering outcomes.
Type
Publication
eCrime 2025