Travis LaCroix, Artificial Intelligence and the Value Alignment Problem

TRAVIS LACROIX
ARTIFICIAL INTELLIGENCE AND THE VALUE ALIGNMENT PROBLEM

REVIEWED BY
Rune Nyrup

Artificial Intelligence and the Value Alignment Problem

Travis LaCroix

Reviewed by
Rune Nyrup

Artificial Intelligence and the Value Alignment Problem ^◳
Travis LaCroix
Broadview Press, 2025, £32.95
ISBN 9781554816293

Cite as:
Nyrup, R. (2026). ‘Travis LaCroix’s Artificial Intelligence and the Value Alignment Problem’, BJPS Review of Books, 2026,
doi.org/10.59350/a5brt-nfc70

Join the mailing list

‘Value alignment’ has become an influential concept in the broad, multidisciplinary landscape of AI ethics. Standardly defined, it refers to the challenge of ensuring that AI systems pursue goals or encode values that in some sense match or capture human values and interests. Many researchers, especially (but not exclusively) within computer science and engineering, frame their research as seeking to create value-aligned AI or to solve the value alignment problem.

The value alignment problem is often motivated in terms of risks posed by hypothetical future technologies, such as artificial general intelligence (that is, systems with flexible, domain-general problem-solving capacities at a level similar to humans) or artificial super-intelligence (that is, systems that vastly exceeds human performance across all domains). Creating such technologies without being able to ensure they pursue goals that match our values, the argument goes, would be catastrophic.^[1] However, there are deep disagreements as to whether this presents an even remotely plausible or urgent threat. Thus, there are also many in AI ethics who dismiss concerns about artificial general and super-intelligence as irrelevant to—or worse: a distraction from—the significant harms and injustices that are already being perpetuated by and with existing technologies.

Travis LaCroix regards the value alignment problem as ‘one of the most pressing issues in AI ethics’ (p. 6); however, he does not base this on anything to do with artificial general and super-intelligence. Rather, he sees it as a serious, already occurring problem, encompassing many of the harms that AI systems are currently involved in. His book, Artificial Intelligence and the Value Alignment Problem, thus seeks to salvage the value alignment problem from its association with artificial general and super-intelligence, and instead connect it to the present-day concerns that animate the broader field of AI ethics.

To do so, the book promotes a wholesale reconceptualization of the problem. Rejecting the standard definitions, LaCroix instead proposes to model it on the principal–agent problem. In economics, this refers to a class of problems that can arise whenever a given person or entity (the ‘principal’) delegates authority to act on their behalf to another party (the ‘agent’), for example, lawyers acting on behalf of their clients or employees acting on behalf of a company. An important result from economic theory, which LaCroix highlights, is that information asymmetries between agent and principal are usually a key condition for the principal–agent problem to occur, as this asymmetry prevents the principal from negotiating and enforcing contracts effectively. Applying these ideas to AI results in what he calls the structural definition of the value alignment problem. Briefly put, the structural definition defines it as a problem that can arise whenever a human actor (the principal) delegates some tasks to an AI system (the agent). In these cases, value misalignment problems can arise along three ‘axes’ ^[2]:

(1) Objectives: The AI system is based on proxies that mis-specify or poorly track the principal’s true objectives.

(2) Information: The principal lacks access to relevant information about the AI system (for example, its behaviour, architecture, capabilities, or training data).

(3) Principals: There are multiple principals (or other stakeholders), whose objectives or access to relevant information differ.

Notably, this does not directly define what value (mis)alignment is. Instead, it defines the value alignment problem as a broad class of problems, characterized as those that arise from a specific set of structural conditions, namely, ‘the dynamics of multi-agent interactions involving the delegation of tasks’ by a human principal to an AI system (p. 82). LaCroix uses this definition to advance three broad lines of argument.

First, he argues that the structural definition is preferable to standard definitions (chapter 3), since it captures ‘everything that is conceptually appealing about current research on value alignment’ (p. 11), while staying ‘grounded in the actual functioning of real-world systems’ (p. 7). Among other things, he argues that the structural definition avoids abstract or imprecise claims about AI systems ‘pursuing goals’ or ‘encoding values’, or about what counts as the relevant ‘human values’. Instead, by replacing these with claims about task delegation, the structural definition allows us to focus on the concrete proxies and informational affordances of real-world AI systems, and how these impact specific principals and stakeholder groups.

Second, LaCroix seeks to show that on the structural definition, many of the prominent present-day issues that are discussed in broader AI ethics count as instances of the value alignment problem. For example, he construes bias and fairness as part of the objectives axis (chapter 4), transparency as part of the information axis (chapter 5) and ‘myriad social issues arising in AI ethics’ (p. 90), such as privacy, sustainability and accountability, as involving the principals axis (chapter 6).

Third, he uses the structural definition to critically evaluate existing approaches to value alignment, highlighting the limitations of methods from a range of different research fields, including AI safety (chapter 7), machine ethics (chapter 8), benchmarking (chapter 9), and linguistics (chapter 10). Many of these criticisms are based on the claim that there is no plausible way to circumvent the principal-relativity built into the structural definition. Thus, LaCroix emphasizes that value alignment should be seen as a fundamentally social problem (p. 84), rather than a purely normative or technical one. On a more optimistic note, he suggests that there are more promising ways to mitigate the problem involving processes like democratic engagement, participatory research, design justice, and regulation (chapter 11).

Advancing these original, substantive arguments is not the only purpose of the book, however. In fact, it is primarily intended, and written, as a textbook. As the preface acknowledges, the book sits somewhat uneasily between these two aims. I highlight some examples of this tension in the following, which begins with a discussion of the book’s pedagogical merits, and then offers some critical comments on the substantive arguments for the structural definition.

Two features distinguish this book compared to many other AI ethics textbooks. First, it takes a distinctively philosophy of science perspective, emphasizing epistemic and methodological issues involving proxies, transparency, and value-laden research (rather than, say, topics from moral theory, metaphysics, or philosophy of mind). Second, the book grew out of a mandatory course for computer science undergraduates. It thus explains the philosophical issues with minimal jargon and grounds the discussion in a technically precise understanding of modern machine learning and deep neural networks. I happen to be a philosopher of science who teaches mandatory philosophy and ethics courses to computer science undergraduates, so I found both of these features particularly appealing. The book is also, however, entirely suitable for philosophy students. In addition to a good overview of the history of AI (chapter 1), it also contains an admirably clear introduction to the core technical concepts (chapter 2). There are a modest number of equations and some mathematical terminology, but all the important points are illustrated with concrete examples throughout.

Chapter 3 discusses the limitations of the standard definitions and introduces the structural definition. The remaining main chapters (4–11, plus an appendix on the control problem for artificial super-intelligence) are written in the style of concise textbook overviews. Each chapter covers a broad section of the interdisciplinary AI ethics literature, with many short subsections explaining a key concept, argument, or case study. For example, chapter 6 includes subsections on stakeholders, trade-offs, the values encoded in AI research, chatbots and discursive ideals, autonomous vehicles, the ‘AI for social good’ movement, copyright and creativity, privacy, energy and the environment, differential power dynamics, accountability, and human flourishing—all in the span of twenty pages. While this unavoidably sacrifices some depth, every point is explained clearly and aptly. Those new to interdisciplinary AI ethics, whether students or researchers entering the field, will get a broad and well-informed overview of the research literature, with many relevant further references. It also provides a rich and useful teaching resource. I will definitely re-read the relevant subsections next time I’m preparing to teach a topic in AI ethics.

The overall structure of the book closely follows LaCroix’s substantive philosophical arguments. Discussion of the value alignment problem, the structural definition, and the three axes thus frame and weave through many parts of these chapters. For a course designed to follow LaCroix’s take on the field, this is a boon, though it does somewhat constrain the book’s usefulness as a more general textbook. Much of the material covered would be very relevant for a broader or differently framed course in AI ethics. But I would probably not assign a selection of chapters without also including chapter 3, or at least spending a decent amount of time in class explaining LaCroix’s overall project. Chapters 1 and 2 can easily stand on their own, however.

Returning to the substantive argument: how compelling is the structural definition as a reconceptualization of the value alignment problem? With regard to the objectives axis, I am fully on board. Reframing the issue in terms of task delegation, proxy problems, and mis-specified objectives neatly captures what is intuitively compelling about the standard definitions. It replaces loose or abstract talk of ‘values’ and ‘goals’ being ‘aligned’ with a concrete, well-defined class of problems, applicable to both existing and future technologies.

LaCroix motivates the information axis by highlighting results from economics on informational asymmetries in the principal–agent problem, such as moral hazard or adverse selection. Chapter 5 highlights some similarities between these concepts and issues of opacity in AI. There is a sound general point here: One important source of problems involving task delegation arises from the principal lacking access to relevant information about the agent. Thus, we can construe many opacity problems in AI as falling within the broader class of task-delegation problems. However, I’m not convinced there is a deeper analogy with the (human–human) principal–agent problem. For example, in the traditional principal–agent problem, information asymmetries usually involve the agent having access to better information about their own behaviour and capabilities than the principal. But an AI system does not necessarily have access to any information about its own behaviour or capabilities, and may not be able to use such information in anywhere near the same way as a human agent would, so leaning too heavily on these analogies risks introducing unnecessary anthropomorphism.

On the principals axis, some of the examples LaCroix discusses fit the structural definition well. Platform privacy is a clear case: different stakeholders (users and owners) delegate tasks to the same overall system, but their objectives are served in highly asymmetric ways. However, he also includes within the principals axis negative impacts on wider sets of stakeholders who do not necessarily interact with the AI system directly, such as copyright infringement in generative AI or environmental damage. Stretching the structural definition to encompass these cases isn’t compelling. Suppose a child is killed in a mine supplying rare earth minerals for microchips, though she never used or interacted with any of the technologies that are created from these minerals. Clearly, this would be a grave injustice. But do we capture it meaningfully by construing it as a problem involving task delegation? Isn’t a core aspect of the injustice exactly that she is completely excluded from the relevant kinds of task delegation?

Some of these issues seem to arise from the tensions with the pedagogical aims of the book. It is highly relevant for a general textbook in AI ethics—especially one aimed at computer science students—to cover environmental impacts, labour exploitation, unjust artistic appropriation, and so on. Trying to fit this within the project of reconceptualizing the value alignment problem is less promising. More generally, the fact that most chapters focus on textbook overviews means that many of the original arguments end up as interspersed remarks, merely sketched or gestured at.

Despite these limitations, I think the book contains a crucial and potentially quite powerful insight: many issues in AI ethics can be unified and explicated in terms of their arising from the same set of structural conditions, namely, those that involve human principals delegating tasks to AI systems.^[3] It is less clear what is gained conceptually by using ‘the values alignment problem’ as a general label for this class of problems. This seems to stretch the term too far beyond its intuitive, pre-theoretic usage. A better option might be to explicate the value alignment problem as equivalent to the objectives axis, and then highlight that this is just one aspect of the broader class of problems. This might be a terminological quibble, but given LaCroix’s distaste for imprecise language (which I share), I think it matters.

In conclusion, I can warmly recommend this book to anyone who wants a comprehensive overview of interdisciplinary AI ethics, as well as anyone who teaches this topic. There is also a wealth of important and original philosophical insights to be found, although many will require significant reconstruction and further refinement to realize their full potential. LaCroix concludes by stating that ‘This book is intended as the first word on the subject, rather than the last’ (p. 268). I look forward to reading what he has to say next.

Rune Nyrup
Aarhus Universitet
rune.nyrup@css.au.dk

Notes

^[1] Argued very influentially by Bostrom (2014), and more recently Yudkowsky and Soares (2025).

^[2] These are my reconstructions. For example, LaCroix’s official definition (pp. 82, 89) only mentions the AI system’s objective function. However, as chapter 4 discusses, there many other kinds of proxy problems that can arise in the machine learning development pipeline.

^[3] See also (Evans et al. 2025), which arrives at similar conclusions via a different line of argument.

References

Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies, Oxford University Press.

Evans, K. D., Robbins, S. A. and Bryson, J. J. (2025). ‘Do We Collaborate with What We Design?’, Topics in Cognitive Science, 17, pp. 392–411.

Yudkowsky, E. and Soares, N. (2025). If Anyone Builds It, Everyone Dies: The Case Against Superintelligent AI, Little, Brown.

Artificial Intelligence and the Value Alignment Problem

Travis LaCroix

Reviewed by
Rune Nyrup

Recent Reviews

Antonine Nicoglou, Plasticity in the Life Sciences
Reviewed by Olesya Bondarenko

Travis LaCroix, Artificial Intelligence and the Value Alignment Problem
Reviewed by Rune Nyrup

Mark Povich, Rules to Infinity
Reviewed by Jason DeWitt