This post is part of a larger series for discussing a regulatory landscape for AI. You can find the introduction to the series here.
Model Evaluations
Evaluations of potentially dangerous models are one of the most important tools regulators will have at their disposal in limiting AI harms to society. Evaluations allow stakeholders to understand the capabilities of new models, which in turn provides a framework for measuring the risks associated with their training and deployment. The following is a suggested set of mechanisms for using evaluations in regulation.
Accredited Evaluators
In any scheme, it is important for a set of trusted organisations to be able to independently carry out evaluations for dangerous capabilities on models. It is suggested that companies can receive accreditation for providing robust, accurate evaluations services to the government, as well as other firms and individuals. Although I’ve not yet fleshed out the conditions for this accreditation, they will resemble the requirements applied to examination bodies and auditors in other sectors.
Pre-Training Disclosure and Approval Scheme
Before potentially powerful models are trained, it can be useful for authorities to have knowledge and oversight of the process. For firms that are in scope of this scheme, the following details must be disclosed to the regulator:
- A description of the new model
- A description of its uses
- The expected training run length
- A description of the training programme and datasets used
- The amount of compute used
- The number of parameters the model is expected to have
- A detailed risk and control assessment of the model and its expected capabilities
- Details of the plan for deployment
- Details of any notable funding sources for development of the model
Training certain models will be classified as potentially high-risk – reasons for this could include:
- Use of a new model architecture
- Use of sufficiently large:
- Compute size,
- Training dataset size, or
- Training run lengths.
- Training the model using a new, highly customised dataset
- Plans for public deployment
- Identification of certain powerful, potential capabilities
- Deployment of the model in high risk applications
For such high-risk training plans, approval must be obtained by the regulator before they can commence – enforcement of this is going to be discussed in a future post. Firms can also opt to voluntarily submit training plans for approval even if they do not meet the requirements listed above. By doing this, they are able to access expedited procedures for pre-deployment approval, which is discussed in the next section.
For high-risk models, evaluations should be performed by both a company’s internal function that can utilise direct access to the model, and an external accredited evaluator. Other categories of risk could be defined that require varying levels of pre-training evaluation. Diverse evaluation methodologies will be encouraged.
Regulators may ask for amendments to the training plan or limitations on a model’s deployment before approving training.
Pre-Deployment Evaluation
After a model has been trained, a further evaluation step will be required before deployment is possible. This evaluation procedure will be more rigorous than is required pre-training, as there is a chance to test a model’s capabilities more directly. This step also poses a greater risk of harm to society if mistakes are made.
All models that are designed for commercial deployment will require evaluations, with the level of testing decided by an updated, audited risk assessment of the model’s potential capabilities. An accredited evaluator may reserve the right to fail any model, as well as recommend improvements.
Internal approval for model deployment should include the engagement of directors and other key senior staff members – this will be described in more detail in a later post on Internal Governance.
The evaluation could result in one of four potential outcomes, listed in the table below:
| Outcome | Description |
| Approval | The assessment has determined that the model is safe for the use cases reported and may be deployed. |
| Approval on a limited set of use cases | The assessment has found the potential for the model to be used dangerously in some applications, but is safe for other uses. The model is approved for use cases where it is safe, but may require more alignment training before full deployment is permitted. In such cases, there will be a ban on releasing the weights of the model, and there may be restrictions on what can be exposed in any API. |
| Further training necessary | The assessment has found that the model displays concerning capabilities and a strong risk of misalignment. Deployment is not permitted from this outcome, though the model may be submitted for reassessment following further training runs designed to improve alignment. |
| Full model ban | The assessment has found that the model is too dangerous to deploy for any application and will need to be erased, i.e. its weights, as well as any intermediate snapshots saved on the compute, fully removed from any storage. The regulator may follow up with the company to better understand how the model was trained. Failure to erase the model can result in a firm being blacklisted from conducting future training runs, as well as severe fines to both the firm and key individuals in the decision making process. |
Post-Deployment Evaluation Refresh Scheme
In addition to pre-training and pre-deployment evaluations, it is important to ensure that models are regularly re-evaluated post deployment to ensure that:
- Updates to the model do not significantly increase its risk profile.
- A deployment to production has not exposed any previously undiscovered emergent capabilities, through sandbagging or otherwise.
- The model is still only being used for the applications it is approved for.
These evaluations will be performed by accredited bodies.
The frequency and depth of these re-evaluations will be based on the determined risk profile of the model at its last evaluation. For example, a low risk model may only be re-evaluated every 3 years, whereas a high risk, public facing model may be re-evaluated every 6 months.
For models that have regulatory restrictions placed on which of the firm’s customers have access to the model, an external financial audit may be required as part of the refresh to demonstrate that such restrictions have not been deliberately broken.
In the case that the risk profile of a model has significantly increased, the evaluating body and AI lab both have a duty to report this to the regulator. Depending on the severity of the risk, the regulator may require that the firm:
- Require extra safety training to be factored into the next release of the model,
- Temporarily remove the model from deployment until a safer version is ready,
- Permanently remove the model from deployment.
In the case of a particularly severe incident, the regulator may conduct a supervisory examination of the firm’s internal processes, as well as trigger an evaluation refresh for any of its other deployed models.

Leave a comment