OpenAI has identified super-intelligence as the most significant and potentially dangerous technology that could be developed, and is emphasizing its alignment with human intent as a vital issue. While current techniques for aligning AI systems, like reinforcement learning from human feedback, rely on human supervision, these techniques are not scalable for superintelligence. OpenAI proposes to develop an automated alignment researcher with human-level capabilities, which can then be scaled up.
Key steps in this process will include developing scalable training methods, validating the model, and stress testing the alignment pipeline. The training will involve AI systems assisting in the evaluation of other AI systems (scalable oversight). For validation, the process includes automated searches for problematic behaviors (robustness) and internals (automated interpretability). Stress testing will involve deliberately training misaligned models and checking the effectiveness of techniques in detecting serious misalignments (adversarial testing).
OpenAI is assembling a team of top machine learning researchers and engineers, dedicating 20% of their compute resources over the next four years to this effort. The goal is to solve the core technical challenges of superintelligence alignment within this time frame. Although the goal is ambitious, there is optimism due to promising preliminary experiments, useful progress metrics, and the ability to empirically study many of these problems using today’s models.
Ilya Sutskever and Jan Leike will co-lead the team, which will include researchers and engineers from across the company. OpenAI also invites outstanding researchers and engineers to join this effort. They plan to share their findings broadly and aim to contribute to the alignment and safety of non-OpenAI models.
The superintelligence alignment project will run alongside OpenAI’s existing work aimed at improving the safety of current models, mitigating AI risks such as misuse, economic disruption, disinformation, bias, and addiction, and engaging with interdisciplinary experts to consider broader human and societal concerns.
Practically
The practical implementation of superalignment as described by OpenAI can be divided into several steps:
1. Assemble a Team: Gather top machine learning researchers and engineers to work on the problem. This includes experts from within OpenAI and new recruits.
2. Resource Allocation: Dedicate a substantial portion of the organization’s computational resources to the problem. In OpenAI’s case, they will commit 20% of their compute resources over the next four years.
3. Develop a Scalable Training Method: The goal is to create a human-level automated alignment researcher. This involves designing a training protocol that can effectively teach an AI system to align itself with human values and intentions.
4. AI Systems Evaluating AI Systems (Scalable Oversight): Leverage the abilities of existing AI systems to evaluate and provide feedback on other AI systems. This will help in providing a training signal for tasks that are hard for humans to evaluate.
5. Model Validation: Once the AI model has been trained, it needs to be validated. This involves automated searches for problematic behaviors (robustness) and problematic internals (automated interpretability) to ensure alignment with human values and intent.
6. Stress Test the Alignment Pipeline (Adversarial Testing): This involves deliberately training misaligned models and checking the effectiveness of the alignment techniques in detecting major misalignments.
7. Iterative Improvement: Use the lessons learned from the model validation and stress testing to improve the training method and alignment techniques.
8. Community Engagement and Transparency: Plan to share their findings with the broader AI and machine learning community, contributing to the safety and alignment of non-OpenAI models.
9. Consideration of Sociotechnical Problems: Alongside the technical work, actively engage with experts in various disciplines to consider broader human and societal concerns related to superintelligent AI.
10. Success Metric: The ultimate aim is to provide evidence and arguments convincing the machine learning and safety community that the problem of superintelligence alignment has been solved. If a high level of confidence in the solution isn’t achieved, the findings should at least allow the community to plan appropriately.