Function start detection in stripped binaries is an undecidable problem. This should be no surprise. Many problems in the program analysis domain fall into this category. Add to that the numerous types of CPU architectures, compilers, programming languages, application binary interfaces (ABIs), etc. and you’re left with an interesting, multifaceted, hard problem. Accurate detection of both function starts and the low-level basic blocks is often the first step in program analysis. Performing this task accurately is critical. These foundational analysis artifacts are often the starting point for many automated analysis tools.
Being undecidable, there is no single algorithm to identify all function starts across the wide variety of program binaries. Among some of the approaches enlisted to try and solve this problem are machine learning algorithms (e.g. Byteweight and Neural Networks) which use signature based pattern recognition. At first glance the results of these learners look promising but they are often biased to their training set.
Other approaches to function detection (e.g. Nucleus and Function Interface Analysis) are more sensible. Their research is focused on control-flow, data-flow, and function interface analysis rather than signature detection. Yet, these solutions fall short of addressing the entire problem domain across architectures in a generalized way.
For example, the Nucleus approach can be improved upon since it relies on heuristics to handle indirect control flow. What’s wrong with heuristics? They must be updated constantly for the latest compiler constructs, and can frequently produce invalid results, especially in highly optimized binaries. Even today, many tools rely on heuristics. That’s a dilemma. Even though they solve a specific problem really well, they rarely generalize to solve an entire class of problems.
Read more...