Family Ties: A Close Look at the Influence of Static Features on the Precision of Malware Family Clustering

Nov 7, 2025·

Antonino Vitale

Kevin Van Liebergen

Juan Caballero

Savino Dambra

Platon Kotzias

Simone Aonzo

· 0 min read

Cite URL

Abstract

Malware family clustering plays a crucial role in many security tasks, including malware analysis, classification, labeling, triage, threat hunting, and lineage studies. This work takes a close look at the influence on malware family clustering of 11 popular static similarity features, including whole-file fuzzy hashes (e.g., SSDeep, TLSH), structural hashes (e.g., PE Hash, Import Hash, VirusTotal’s VHash), certificate-based features, and icon-based features. Our goal is not to propose new features or clustering approaches. Instead, we aim to measure how often these 11 features make clustering errors, i.e., cluster together samples belonging to different malware families. We also investigate the root causes behind those errors, which often lead to misinterpretations of malware relationships, hinder effective threat detection, and propagate inaccuracies in downstream analyses. To study this phenomenon, we leverage three public datasets comprising 79,993 labeled Windows malware samples. We cluster those samples by using each of the analyzed features, measure their accuracy with a focus on their precision, and examine the reasons that caused some clusters to contain samples from different families. Our analysis identifies intrinsic limitations of some of the features and highlights the severe impact of EXE-building tools (like software protectors, installers, and self-extracting archives) on malware clustering. Finally, we discuss mitigations and evaluate potential improvements to address the problems we observed. Our findings provide a critical foundation for improving static malware clustering methodologies by emphasizing the importance of dataset curation and feature refinement for robust and precise clustering outcomes.

Type

Publication

eCrime 2025

Last updated on Sep 13, 2023

Computer Science - Cryptography and Security

Clean up the mess: Addressing data pollution in cryptocurrency abuse reporting services Jun 1, 2025 →