Software engineering benefits greatly from testing, both for the applications and the underlying infrastructure setups. DevOps engineers are frequently entrusted with developing and automating testing logic, even if they may not create the tests themselves.
Organizations are beginning to dedicate substantial resources to developing and implementing software based on large language models (LLMs) and generative artificial intelligence (GenAI) as these systems have gained prominence. Testing is necessary for successful deployments, just like it is for any other software program. However, testing and validation of generative AI pose special difficulties. Generative AI models provide outputs that are intrinsically probabilistic and non-deterministic, in contrast to traditional software, which produces predictable and deterministic results. The testing story is now considerably more hazy as a result.
When creating and implementing a testing suite for generative AI-based applications and language learning modules, how should DevOps engineers approach the task? We’ll talk about the different kinds of testing and test design in this article. Additionally, we’ll offer a sample GitHub repository with a working LLM and testing setup.
- Different tests and their objectives
- Strategies for organizing and categorizing tests
- Basic testing for an LLM prompt
Different tests and their objectives
Comprehending the various tests and their use cases is crucial prior to testing. In addition to a brief description of unit and integration tests, this section also addresses the meaning of testing as it relates to machine learning (ML).
Unit tests are used to verify that the various parts of a software program operate as intended. Usually, these goals are the ones that guide their writing:
- Functionality: Does the expected behavior of this code exist? In this case, business logic is less crucial. Tests may verify that an array is never empty or that a function returns a proper JSON object.
- Regression avoidance: A well-covered test suite makes sure that updates don’t interfere with already-existing features. Many times, even the most basic testing will detect unexpected effects from a modification that seems unrelated.
- Finding defects or errors: Tests can be used to find mistakes such as unexpected null values, boundary conditions, or syntactical issues.
Unit tests may occasionally mimic certain aspects of integration testing. With tools like Mock, developers can imitate objects or answers instead of relying on external services, such as databases or API endpoints.
Testing of integration
The other main test category for software development is integration testing. Validating the interactions between various software components and outside services is the aim of integration tests. Among the goals of an integration test are:
- Functionality: With integrated testing, functionality has a wider definition. While it’s crucial that each component performs as planned, in an integrated system, this is frequently not the case since the components depend on one another. For instance, making sure a database query always returns a value or that an API always provides a certain set of headers.
- Data consistency: Before displaying a completed result or output, modern software programs frequently need to transfer data between other platforms. Ensuring data consistency over a request’s entire lifecycle is crucial. A portion of the program might, for instance, receive numerical values from a database, do mathematical operations on the values, and then write new values back to the database. Integration tests can verify that the database reacts correctly to read and write operations, and that the data types are consistent (that is, that the data is always integers or floats rather than strings).
- Performance: Because complex software systems include many moving parts, a decrease in any one of those parts’ responsiveness can have an effect on the system’s overall performance. A test may verify that an API call never takes more than n milliseconds to finish; if it does, this could be a sign of an issue with an upstream endpoint or inefficient query logic.
The business logic of an application is commonly assessed during integration testing. Even though these tests are very beneficial, they frequently need lengthier test cycles in order to produce meaningful results. Create and carry them out carefully.
Testing the capabilities of machine learning
It’s critical to distinguish between the kinds of testing that this article discusses and the real testing of machine learning models. A machine learning model’s testing and optimization are difficult processes that usually need a lot of parallelized GPU-driven workloads. To train and test these models, ML engineers and data scientists frequently need to use cutting-edge software development and research techniques. This is a different kind of testing from the operational-focused tests discussed in this article.
By no means does this section provide a comprehensive description of all the different kinds and uses for these tests. To increase their expertise and contribute to better application quality and performance, I advise anybody involved in software development and deployment to invest some time in learning about and gaining practical experience with software testing and test automation.
Strategies for organizing and categorizing tests
Engineers may make well-informed decisions about how, where, and when to deploy different types of tests by efficiently organizing and classifying tests.
Distinctive versus non-distinctive assessments
Tests can be categorized into deterministic and non-deterministic categories based on how predictable they are. In deterministic testing, the inputs, procedure, and outputs are all known to be consistent and repeatable, yielding the same results under the same circumstances each time the test is conducted. A user will always get output B if they apply input A. While some integration tests may also be deemed deterministic, unit tests are typically regarded as deterministic tests.
Non-deterministic testing, on the other hand, may produce random results. Even under identical testing settings, input A might not always result in output B. Integration tests can yield non-deterministic results in addition to deterministic results, as was previously mentioned. Think about these situations:
- Examining an API request sent to a weather data endpoint: Although the returned data’s structure can be tested by users, the actual data will nearly always differ.
- Reactivity: While the same data should always be returned from an external system call, test run to test run response times will vary.
- Unique ID or key values: Generating unique UUIDs or other values as metadata may fall under the purview of an application component. You can test the functionality, but if everything is working as it should, the returned value should always be unique.
In particular, LLM-based systems frequently produce non-deterministic results. For instance, if a user asks an LLM, “Where is the Eiffel Tower located?” the answers may differ. “Paris,” “paris,” “Paris, France,” or “Europe” are possible answers. While all of these responses are theoretically true, they differ from one another in terms of data validation. It gets more difficult to administer and assess these tests when the suggestions are more complicated.
Initial unit testing
It is advised to implement unit testing earlier—that is, before integration tests. These tests are great for identifying low-hanging fruit early in the development cycle because they are usually quick, simple, and isolated. During these tests, developers can mimic APIs or datastores using mocks or stubs. In addition to being able to be performed within the CI/CD pipeline, many unit testing suites can also be run locally from a development machine. This allows developers to obtain much faster, actionable feedback on their changes prior to committing their code and putting it through the longer test cycles of an integration testing suite.
In the CI/CD pipeline, integration testing
However, as part of the continuous integration/continuous deployment (CI/CD) pipeline, integration testing ought to be included. Run unit tests and linting on all commits before pushing them to the upstream repository, if at all possible, early in the pipeline. After unit tests have been completed, isolate integration tests in a different stage or pipeline job. Stages of the CI/CD pipeline should be configured, especially the latter ones, to more closely resemble the systems and environments into which the application will be deployed.
The remainder of this post will walk through an example GitHub repository and GitHub Actions workflow for an LLM-based Python application.
Configuring an environment for Python in GitHub Actions
We will create a testing environment and a rudimentary LLM application using Python in GitHub and GitHub Actions. It is assumed that you have some experience with Python development, CI/CD, and GitHub and GitHub Actions. Complete examples will be available in a GitHub repository. You’ll need a premium account and access to the OpenAI API in order to fully duplicate this example.
Setting Up GitHub Actions and Python
In the application itself, we will manage calls to the LLM (in this case, OpenAI’s GPT-4 model) using the fantastic LangChain library. This LangChain documentation example is what the `main.py` script will utilize. Poetry helps me manage virtual environments and local dependencies.
We’ll base the configuration of the GitHub Actions workflow on the example given in the instructions, with a few minor adjustments.
Pytest, ruff for linting, and pyspellchecker for evaluating the prompt and inputs will be used for testing.
Basic testing for an LLM prompt
The quality of input data and prompts has a significant impact on the software programs that rely on LLMs. Prompts are essentially text-based instructions that specify the context, tone, and intended response of a question as well as the LLM. It’s critical to keep prompts consistent in quality. Testing is useful in this situation.
This example’s unit tests are deterministic in nature, meaning that successful validation occurs without requiring calls to external APIs. They will each execute in a separate workflow step. In the event of a test failure, more time is not lost by calling potentially expensive external APIs during integration tests.
Verify the application’s grammatical accuracy: Linting searches for typical formatting problems and other flaws that frequently stop the code from executing.
Verify inputs and prompts: We verify that our input and output processing appropriately handles expected values and that the prompt is free of misspelled terms that could compromise the quality of the LLM output by using the `unit_tests.py} module.
Tests that are non-deterministic
Even under uniform testing conditions, the integration test examples have the potential to be non-deterministic and yield varying outcomes or test failures.
Assess responsiveness: A condensed assessment of response time is included in the testing fixture. The test will fail if the answer takes longer (in seconds) than the value specified below. Even though the code is correct, a failure could happen if there are problems with the OpenAI API or if there is a lot of demand.
Verify the output format: The LLM can provide an output that is not in the correct format even though it has been given clear instructions to return the data as a list. This claim verifies the format.
Verify the data in the response: The prompt asks the LLM to provide a list of “objects” that fall into a certain category, in this case, colors. Although this statement is supported by the definition of the primary and secondary colors, it is still feasible that the LLM will respond with “light gray,” which would result in a test failure.
Analyzing test findings
The test cases and code are artificially created and very simplistic. Generally speaking, a production LLM-based program will have more intricate behavior and logic, necessitating a larger testing suite. Nonetheless, these tests offer a useful foundation and a standard for test deployment and assessment.
Here’s an example of a GitHub Actions workflow run successfully.
Fixing the faulty code and rerunning the tests is frequently a straightforward solution to linting and unit test failures. Integration test failures usually call for a more considered response. It is likely that you will need to change the prompt in order to make it more effective if the LLM consistently produces output that is either inconsistently structured poorly or that makes no sense in respect to the query. When making changes, users with an OpenAI account don’t need to go through the testing cycle again because they can utilize the playground to quickly test different inputs and prompt combinations against the live model.
Appic Softwares will be a crucial ally for anybody negotiating the intricate world of DevOps Monitoring technologies in 2024. Understanding how important it is for development and operations teams to work together seamlessly, Appic Softwares is excellent at creating apps that connect with the top 16 DevOps monitoring tools listed. Because of the company’s experience, clients may manage metrics, improve performance, and keep a strong IT infrastructure by utilizing products like Pluralsight Flow, InfluxDB, Prometheus, and others. Appic Softwares is at the forefront of the DevOps evolution, offering customized solutions that support end-to-end observability, scalability, synchronization of the technological stack, debugging capabilities, and continuous monitoring. Clients can confidently traverse the dynamic world of DevOps with Appic Softwares, guaranteeing high-performance, secure, and efficient software delivery.
So, What Are You Waiting For?