Using Static Code Metrics to Model LLM Test Creation Ability
A Project for Software Verification Class done with Hengchen Yuan and Arsh Guntakal
Introduction and Purpose
Companies nowadays are leveraging LLMs to perform all sorts of coding work like code generation, documentation, bug annotation, and test creation. This increase in LLM usage can result in thousands or even tens of thousands of LLM runs per day, which can incur significant costs—whether using API calls, cloud-based solutions, or in-house machines. This raises an important question: What size LLM is necessary to do the task? For example, you want to select a model to generate tests for your code. If the code under test is complex, do you need a model with 8 billion parameters? Can you get away with a 1.5 billion parameter model if it's not complex? What does complex even mean in this scenario? This tradeoff is necessary to understand because larger models can be prohibitively expensive if done at scale. For example, 100 runs of a 2,500-word input using GPT-4 can exceed $17, but getting away with a smaller model like Google Gemini 1.5 could cost around $0.06. [gptforwork.com] (as of the end of 2024).
One thought is whether an adaptive approach could be employed, selecting smaller models for simpler code and reserving larger models for more complex scenarios. If so, how do we meaningfully define "simple" versus "complex" code in a way a computer could differentiate? This project proposes that by using static code metrics, one could determine the "complexity" of code and, using those static code metrics, measure the relative performance of different models (or different sizes of the same model). By understanding this relationship, we can gauge how well LLMs might perform in various scenarios, optimizing cost and performance.
Data Collection Setup
Top Level
To understand this relationship, we need to follow a structured approach. First, we obtain the code under test and generate test cases using an LLM. Next, we execute these tests to evaluate their effectiveness and compare their performance against static code metrics. By repeating this process across multiple LLM models, we can analyze how these metrics correlate with different models' test generation capabilities. The ultimate goal is to determine whether static code metrics can reliably indicate the quality of generated tests. If they prove to be significant, these metrics could enable higher level models to dynamically select the most suitable LLM for a given code snippet, optimizing compute resources on the fly.
Figure 1. System Design
Code Under Test
The code we used to test our setup and collect data was eleven data structures: Binary Tree, Bin Heap, DAG, Disjset, Doubly Linked List, FibHeap, Heap Array, Red Black Tree, Search Tree, Singly Linked List, and Sorted List. This code was taken "as-is" and wasn't modified to make the LLM perform better. For example, if the model had trouble with private fields, we wouldn't switch them to public, or if the model had difficulty with illegal input, we would not add comments to guide it. This is to best emulate a real-world use case where an LLM might be used in a CI/CD scenario where it would be constantly testing non-LLM optimized changes to a code base.
The code library included a "repOk" method for all data structures, which we used to help the LLMs generate the tests. When making a test, the LLM was instructed to use this method to verify the correctness of the data structure when running the test.
Static Analysis
We utilized the Chidamber & Kemerer object-oriented metrics suite (https://github.com/mauricioaniche/ck) as our static analyzer tool. CK is a static analysis tool that gathers metrics specific to object-oriented programming languages like Java, such as lines of code (LOC), number of fields, and quantity of loops, without explicitly executing any code. This was run once over each of the data structures to get a table of characteristics that we could relate to performance for specific models.
Code Coverage Collection
We had two measures to evaluate a models effectiveness to generate a test suite for some code under test. One was if it made a successful test suite that could compile and run, and the second was the test suite's code coverage.
Both dynamic and static coverage metrics were used for the evaluation. We utilized JaCoCo, a widely recognized Java code coverage library for dynamic coverage assessment. JaCoCo effectively instruments bytecode to monitor and measure various coverage criteria, including instruction coverage, branch coverage, and method coverage, during test execution. To incorporate JaCoCo into the target projects, we automatically injected the JaCoCo dependency into the project's pom.xml file. Subsequently, we leveraged Maven plugins to execute the JaCoCo workflow, generate comprehensive coverage reports, and extract metrics.
In addition to dynamic coverage, we considered the time-sensitive nature of specific scenarios by incorporating static coverage metrics into our analysis. Specifically, we adopted the recently proposed Object State Coverage metric to evaluate the extent to which the generated tests explore the state space of the class under test. This static coverage was determined by utilizing JavaParser to perform reachability analysis for the fields of target classes within the generated test source code, thereby calculating the coverage ratio of the target class's fields.
Prompt Creation
We created three prompts: one for the first pass, which contained most of the initial instructions; a second prompt that was used to ask the LLM to create additional tests; and a final one called a system prompt, which gives the task context to the model.
System Prompt: "You are Qwen, created by Alibaba Cloud. You are a helpful assistant which should return correct Java code. You cannot use try catch statements. You can only use methods declared in the Java code, and declare a single object for each test. You can only use assertTrue with the repOK_complete method."
Initial Prompt: "Generate a couple basic standalone JUnit tests that efficiently cover the following Java code. Each test should be fully encapsulated, meaning it should not depend on any external state, other tests, or outer definitions. Only use public methods that are defined within the Java code itself to create each test. The tests should follow the structure of first declaring the object, second inserting nodes, and last checking the method "repOK_complete" returns true. All "repOK_*" methods must return true. Do not attempt to make a failing case. The top class should be called <test_name>. Here is the Java code: <initial_code>"
Secondary Prompt: "Can you create one or two additional standalone JUnit tests that efficiently cover the following Java code other than the ones previously created. Only using public methods that are defined within the Java code itself to create each test. You cannot use private methods or fields. The tests should follow the structure of first declaring the object, second inserting nodes, and last checking the method "repOK_complete" returns true. All "repOK_*" methods must return true. Do not attempt to make a failing case."
Much of these prompts came from trial and error, for example at first the LLM tried to create too many tests so it would never create a suite that could run so we asked it to “Generate a couple basic standalone JUnit tests“. The smaller models would try and use private methods, and also try to directly modify internal fields of the objects so we asked “Only use public methods that are defined within the Java code“. These prompts gave it guard rails to use the defined API of the code to give the highest likelihood of a working test.
The prompts were formatted into a chat-like data structure that the LLM model could understand. It consisted of an array of messages and the role of the entity making the message. For example, the role would be a system for the system prompt, user for our generated prompt, and assistant for the LLM-generated responses (only used for multi-pass experiments detailed later). This secondary prompt doesn't need much additional context as it has the code and previous tests inside the chat data object.
LLM Model
We used the model Qwen/Qwen2.5-Coder-(7/1.5/0.5)B-Instruct from Alibaba hosted on huggingface.com. We wanted the model itself to stay consistent and only vary the size to remove the variability of how different models might perform. Future research can determine if the metrics that determine an LLM performance are model agnostic, having similar correlations across models. This model, in addition, has been trained to understand code and the ability to be instructed to perform tasks. These models have a context window of 32k tokens, which could be expanded using rope scaling. However, we had very negative results when using this style of context scaling.
Experimental Setup
Initially we let the LLM continuously make tests using multiple iterations, however we found that it hallucinated to the point of failure if it kept being asked to make more tests. So, we split the experiment into two parts. First, see how well the model does just with a single pass at making the test suite, and secondly, how many iterations the model could make until it created failing tests (and if it could make tests with better coverage if asked to improve upon the suite.)
Single-Pass Experiments
We performed 50 runs using the initial prompt (detailed above) to generate a test suite for each data structure. We collected the test's success rate and coverage. To run the LLM and create the test suite, we used Python. We had infrastructure around the Python script where the script could take the LLM response, export the Java test in a file, and then run the test to collect information.
Multi-Pass Experiments
Did 25 runs of the following sequence.
1. Do a single run the same as above
2. Measure the results of the initial test suite.
3. If successful, run the prompt asking for additional tests.
4. Measure the results of the merged test suite.
5. If successful, go to 3. If not successful, stop and save the last successful test.
For example, if the LLM could perform 3 successful loops and then the test suite produced by the 4th failed, the saved results would be the merged test suite from the 3rd loop.
Results
Models Ability to Create Working Tests
For the first set of results, we looked at each model's ability to create a working test suite for each data structure. These had some pretty straightforward results, with the larger models having a better success rate than the smaller models. The 7B-sized model achieved a high success rate of 70% to 100%, with most being above 80%. The 1.5B model got about a 40% to 70% success rate, and the 0.5B model got a poor success rate of 0% to 20%. The interesting part, however, is the different metrics that correlate with the success rate of each sized model, which is detailed more below.
For the multi-pass runs, only the 7B model could average about 3 cycles before it started failing (sometimes for context window purposes), whereas 0.5B and 1.5B averaged just above 1 or below. Because of this, only the 7B model was considered from the multi-pass results.
Next, we took the static metrics of each data structure under test and correlated them with either the success rate for the 0.5B, 1.5B, and 7B models or the average successful passes for the 7B Multi Run model. We found that 5 or 6 metrics significantly correlated with the results for each model. All the ones listed below are the five most significant metrics (using Pearson Correlation), with all below having a p-value less than 0.15 and most having below 0.10. These p-values aren't the best. However, it is promising since there are only 11 data points to calculate a correlation on for each model.
Below is a table showing the top metrics with the correlation direction in parentheses:
Table 1. Static Metrics With High Correlation to Successful Test Generation (The ck Github has detailed descriptions of the metrics)
Interestingly, the most significant metrics were negatively correlated for smaller models and mainly dealt with structural characteristics like lines of code, private fields, or assignments. These are all natural properties of code. However, as the model size increased, more metrics were positively correlated, and these positive metrics could be tied to better OOO practices like Depth of Inheritance Tree (DIT), Fanout, or Inner Classes. It's almost like smaller models had a hard time grasping the code structurally, whereas larger models could take advantage of good coding practices taking place. Analysis like this can also give guidance in terms of methodology to make code more LLM-friendly.
These metrics will pop up again when trying to make linear estimators to see if they can make a model explaining the success rate of the tests. By looking at the linear estimators we are looking at these metrics when used together rather than just a one-on-one correlation.
Models Ability to Create Tests with Good Coverage
Next, we looked at the models' ability to produce tests with good code coverage. Much of the dynamic coverage followed the same pattern, so to make analysis easier, we focused specifically on only branch coverage and then static Object State coverage. These results are from the coverage of the tests that ran successfully.
We looked at which model gave the best coverage in three metrics (mean dynamic, max dynamic, and mean static) for each data structure and counted how many times each model won a metric. We did this, including and excluding the multi-run model, to see specifically the effectiveness of the multi-run model.
So to the left of the arrow is how many times each model won excluding the multi-run model. To the right is how many each model run after the multi-run. In parenthesis is the change between the two.
Table 2. Number of Wins for Each Model in Each Coverage Metric
These results are interesting because the smallest model was able to win in many cases for both mean and max branch coverage. Another interesting result was that the larger models always got better static coverage. These sorts of results help support the hypothesis that selecting a larger model and thinking it will always give the best results isn't just inefficient but also may be incorrect. This also could be pointing to the possibility that the prompt is too restrictive regarding test structure, meaning that since the larger model can follow instructions better, it is more confined to what the prompt is asking. The smaller models can ignore some of the prompts and can create better tests though more maverick approaches.
These results also show that LLMs can build upon themselves and create fuller test suites when asked to expand on their already generated test suites. This is seen in the multi-run model winning in many of the branch mean cases and taking some of the branch max cases away from the smaller models. However, multiple runs of the larger models do come with a computation cost.
Next, we looked at the correlation of code metrics to a combined score. In this case, the combined score gave equal weight to the three coverage metrics above. The results for that are below.
Table 3. Static Metrics With High Correlation to Test Coverage
Overall, we found that many of these metrics significantly correlated with the model's ability to achieve effective code coverage. The results aligned with expectations: as code complexity and size increased, the LLM faced greater challenges in generating tests that provided comprehensive coverage. They also point to if there is more things to cover, the worse coverage would be which is pretty obvious.
However, these findings can still be leveraged to identify aspects of your code that impact an LLM's ability to produce high-coverage tests. By understanding these factors, developers can better anticipate where LLMs might struggle and adjust their approach accordingly. Also we will again use these metrics when making linear estimators.
Using these Results
The results above show one-to-one the correlation from code metrics to an LLM's ability to create a test that can successfully run or make a test with good coverage. The fact that there were a lot of significant metrics that correlated well with the results was promising. However, to get an idea of whether one could make a model using these metrics, we will perform an OLS regression between all of these metrics and the success/coverage rate. This will show us how well we can expect a model using all of these parameters together could perform.
We will primarily look at the Adj-R2 value, which is R2 adjusted for the number of parameters included, and the F-Statistic. What the F-Statistic tells us if using these parameters is significantly better than a random model. (At least one of the linear coefficients should be non 0)
The first OLS model we made took the top 32 significantly correlated metrics of coverage ability along with the model ran and created a model to predict how much coverage we should be able to get. So the input would be the code metrics and the model, with the output being some coverage score. Below are the results:
Results 1. OLS of Table 3 Metrics verses Coverage Score
R2/Adj-R2 0.769/0.679
F-Stat 8.585
Prob (F-Stat) 7.3e-07
We see that the model has a strong R2 but a weaker adj-R2, meaning that it probably included too many parameters. However, the F-Stat is very strong, showing that a model using these parameters should be able to give significant results predicting the coverage one could get using these metrics.
The second set of OLS models we made regressed each model's five most significant metrics against its success metric, whether the success rate of the test running or the number of cycles the model could run. Below are the results.
Results 2. OLS of of 0.5B Table 1 Metrics versus Test Success
R2/Adj-R2 0.529/0.136
F-Stat 1.348
Prob (F-Stat) 0.359
Results 3. OLS of of 7B Table 1 Metrics verses Test Success
R2/Adj-R2 0.958/0.933
F-Stat 39.52
Prob (F-Stat) <0.001
Results 4. OLS of of 1.5B Table 1 Metrics versus Test Success
R2/Adj-R2 0.885/0.819
F-Stat 13.48
Prob (F-Stat) 0.002
Results 5. OLS of of 7B Multi Table 1 Metrics versus Test Success
R2/Adj-R2 0.886/0.792
F-Stat 9.365
Prob (F-Stat) 0.008
Overall, even these models with five parameters perform pretty well in predicting a model's ability to successfully generate tests that can run, except for the 0.5B model. This is most likely due to the very low success rate; however, the correlations still give good insight into what static metrics drive its ability to make successful tests.
These OLS regressions demonstrate how identifying and combining correlated code metrics can potentially build an effective model for predicting an LLM's success in generating runnable tests and the coverage those tests achieve. However, estimating the costs of such a system is highly context-dependent, as different implementations may prioritize specific metrics over others. Factors like whether the LLMs are running via API calls, on cloud platforms, or on native machines significantly influence the final design and cost-effectiveness of the system. For example in running multiple smaller models to get the better coverage may be more ideal when using in house computers than running larger models, where as if you use a cloud based solution having more predictable success with a larger model may be cost effective.
Additionally, the importance of test coverage varies: is it primarily for sanity checks, or is comprehensive coverage a critical goal? Ultimately, building a practical model requires implementation-specific details, so our analysis focused on demonstrating that these static metrics, when used together, can yield meaningful insights.
Conclusions and Future Work
In conclusion, our project highlights that static code analysis can provide valuable insights into an LLM’s ability to generate effective test cases. We have demonstrated this both in terms of producing functional tests and achieving meaningful code coverage. These findings open up two key areas for further exploration:
Developing methodologies to make code more LLM-friendly.
Creating models to predict an LLM’s success rate and effectiveness in generating test suites.
The next steps would be to extend our analysis beyond model size variations by evaluating different LLM architectures to determine whether these metrics hold consistently. Additionally, testing across a broader range of codebases will provide more data points, further clarifying the relationship between static code metrics and model performance.