I’ve been extremely excited about the MITRE ATT&CK evaluation since it decided to open it up to vendors earlier this year. The endpoint detection and response (EDR) market represents the direction of endpoint security, yet the state of endpoint efficacy testing has been underwhelming.

• Antimalware testing has become a standard part of the endpoint protection (EP) space, but it’s frequently been observed that a majority of vendors score over a 99% in efficacy testing. This isn’t a comparison; it’s a benchmark.
• NSS Labs has done some testing of EDR products, but its latest report failed to include multiple noteworthy vendors. Correlation is not causation, but it should also be noted that NSS is currently involved in a lawsuit against many of these conspicuously missing vendors to block attempts by the Anti-Malware Testing Standards Organization (AMTSO) to create what AMTSO claims is a standardized and transparent testing methodology.
• Analyst firms, such as Forrester, don’t do rigorous efficacy testing. When I do a Forrester Wave™, I’m comparing features, strategy, client satisfaction, and providing demo scripts to allow me to infer the efficacy of these solutions, but I’m not creating test environments and throwing exploits at the systems.

I was disappointed at the lack of fanfare that accompanied the release of these results last week. My initial excitement for this testing was that I would have fair and transparent test results that I would be able to use as an individual evaluation criteria in my upcoming Forrester Wave evaluations on EDR. I got exactly what I wanted from these test results: a detailed and technical assessment of how these products performed under attack simulation that would allow me to make my own assessment of the efficacy of these products. I also realized that without a scoring or ranking system, this evaluation was inaccessible to many buyers. Beware the gypsy curse, “May you get everything you want.” To support the community at large, and hopefully bring more visibility to what MITRE has accomplished with this evaluation, I’ve gone through the results and have developed a repeatable methodology for scoring the vendors based on the 56 ATT&CK techniques analyzed using 136 procedures in the evaluation.

Methodology

I began the process by parsing out the JSON-formatted data that was provided with each of the evaluations and dumped the qualitative descriptions provided for each of the procedures. Using these qualitative descriptions and the documentation available on the evaluation website, I developed the following scoring criteria (similar to how I would approach a Wave):

5 – Alerting. An adversary attacks your system, is detected, and an alert is generated in response. This is what you expect to be paying for when you invest in these products.

3 – Delayed detection or real-time enrichment. The product couldn’t generate an alert in real time, but they bring your attention to the issue eventually. This is probably coming from a managed service or some other post processing to generate alerts. Alternatively, an operation didn’t warrant its own alert, but detection did happen in real time, and this information was associated with another alert for further context.

1 – Threat-hunting capabilities. The telemetry exists to allow a threat hunter to detect the adversary after the fact. There’s no alerting because there’s no detection, but at least the data exists to reconstruct the crime scene while you’re sending out breach notifications.

0 – No detection or requires configuration changes to expose data not usually available to the user. These types of configuration changes have value but are frequently only deployed during an active digital forensics investigation by the vendor itself. Since you’re not making these configuration changes in your day-to-day environment, I’m not providing additional credit for this nuance.

It should be noted that all vendors were scored against the same number of procedures, but in some cases, multiple scores were provided in a single procedure in which there were multiple detection types. An example of this may be found here, in which Endgame generated telemetry and alerted on a particular event. While it did not change the ranking of the vendors to additively apply the above quantitative score for each reported detection type, I elected to score the vendors strictly based on their highest-scoring detection type to keep the scale between 0 and 5 for each procedure.

Conclusion

I’m not revealing results in this blog, but I am making the code available on GitHub that I used to quantify solution efficacy based on the MITRE ATT&CK evaluations so you can run the checks and see the results yourselves. Keep in mind that efficacy testing is a bit of a Holy Grail in that the results only tell you part of the story. For instance, due to the nature of this evaluation using positive testing to check for alerts, this evaluation favors false-positive-prone solutions that are going to alert more frequently. Look forward to a report in the coming weeks with a detailed analysis of the scoring, other findings, and a deeper dive into the importance of what MITRE has accomplished for the industry with this round of testing.