AI SafetySecurity of AI

The AI Alignment Problem


The AI alignment problem sits at the core of all future predictions of AI’s safety. It describes the complex challenge of ensuring AI systems act in ways that are beneficial and not harmful to humans, aligning AI goals and decision-making processes with those of humans, no matter how sophisticated or powerful the AI system becomes. Our trust in the future of AI rests on whether we believe it is possible to guarantee alignment.

Resolving the alignment problem requires accurately specifying goals for AI systems that reflect human values. This is challenging because human values are often abstract, context-dependent, and multidimensional. They vary greatly across cultures and individuals and can even be conflicting. Translating these diverse values into a set of rules or objectives that an AI can follow is a substantial challenge. The alignment problem is also deeply entwined with moral and ethical considerations. It involves questions about what constitutes ethical behavior and how to encode these ethics into AI systems. It also asks how we factor for evolution of these cultural norms over time.

There are primarily three types of objectives that need to be considered in achieving AI alignment: 

  1. Planned objectives: When the AI delivers what programmers intended it to, regardless of the quality of the programming. This is the desired outcome of the process and is a best case scenario.
  2. Defined objectives: Those goals explicitly programmed into the AI function. These often fail when programming is not clear enough, or has not taken into account sufficient variables. That is, it is heavily influenced by human error or limitations in thinking.
  3. Emergent objectives: Goals the AI system develops on its own.

Misalignment can happen between any of these variables. The most common up until now has been when planned objectives and defined objectives don’t align (the programmer intended one thing but the system was coded to deliver another). A notorious example of this was when a Google Photos algorithm classified dark-skinned people as gorillas.

If we achieve AGI, though, the misalignment that poses the greatest concern is that which occurs when the AI’s emergent objectives differ from those that are coded into the system. This is the alignment problem that keeps people up at night. This is why companies like OpenAI and Google appear to have teams dedicated to alignment research.

The role of emergence

Emergent objectives in AI worry researchers because they’re almost impossible to predict. And that’s partly because we’re not sure yet how emergence will work in AI systems. In the broadest sense, emergence refers to complex patterns, behaviors, or properties arising from relatively simple interactions. In systems theory, this concept describes how higher-level properties emerge from the collective dynamics of simpler constituents. Emergent properties are often novel and cannot be deduced solely from studying the individual components. They arise from the specific context and configuration of the system.

The emergence of life forms from inorganic matter is another classic example. Here, simple organic compounds combine under certain conditions to form more complex structures like cells, exhibiting properties like metabolism and reproduction. In physics, thermodynamic properties like temperature and pressure emerge from the collective behavior of particles in a system. These properties are meaningless at the level of individual particles.In sociology and economics, complex social behaviors and market trends emerge from the interactions of individuals. These emergent behaviors often cannot be predicted by examining individual actions in isolation.

In the context of consciousness, emergence describes how conscious experience might arise from the complex interactions of non-conscious elements, such as neurons in the brain. This theory posits that consciousness is not a property of any single neuron or specific brain structure. Instead, it emerges when these neuronal elements interact in specific, complex ways. The emergentist view of consciousness suggests that it is a higher-level property that cannot be directly deduced from the properties of individual neurons – similar to how the ‘wetness’ of water emerges from the interactions of individual water molecules.

Technically, emergence can take two forms: weak and strong.

Weak Emergence

Weak emergence refers to complex phenomena that arise from simpler underlying processes or rules, but the key point about weak emergence is that this complexity is understandable and predictable based on the rules or processes of the simpler elements.

Imagine a jigsaw puzzle. Each piece is simple on its own, but when you put them all together following specific rules (edges aligning, colors matching), you get a complex picture. The final image is a weakly emergent property because, by examining the individual pieces and the rules for how they join, we can predict and understand the complete picture.

Strong Emergence

Strong emergence, on the other hand, is when the higher-level complex phenomenon that arises from simpler elements is fundamentally unpredictable and not deducible from the properties of these elements. It’s as if there’s a gap between the lower-level causes and the higher-level effects that can’t be bridged by our current understanding of the system.

To visualize this, think of a play. You can know all the lines (the simple elements) and still not predict the impact of the performance (the complex phenomenon). The play’s impact depends on the actors, their delivery, the stage setting, and the audience’s reaction, all intertwining in unpredictable ways. The ‘magic’ of the performance is more than just the sum of its parts – it’s strongly emergent because the emotional response it elicits is not directly deducible from just reading the script.

When applied to AI, emergence refers to the development of complex behaviors or capabilities in an AI system that are not explicitly programmed or anticipated by its creators. This can occur in advanced AI systems, especially those involving neural networks or deep learning. For example, an AI might develop new problem-solving strategies or novel responses to certain stimuli based on its learning experiences, exceeding the scope of its initial programming. We might call this weak emergence in AI. It’s like baking a cake: you mix together flour, sugar, eggs, and butter, and then you bake it. When it’s done, you have a cake, which is very different from any of the individual ingredients you started with.

Weak emergence occurs when you combine simple components in an AI system (like data and algorithms) in a certain way, and you get a result that’s more complex and interesting than the individual parts, but still understandable and predictable. For example, a music recommendation AI takes what it knows about different songs and your music preferences, and then it gives you a playlist you’ll probably like. It’s doing something complex, but we can understand and predict how it’s making those choices based on the information it has.

In the most advanced scenarios, though, some theorists speculate about the emergence of consciousness-like properties in AI. Although this remains a topic of debate and speculation rather than established fact, the idea is that if an AI’s network becomes sufficiently complex and its interactions nuanced enough, it might exhibit behaviors or properties that are ‘conscious-like’, similar to the emergent properties seen in biological systems.

To build on the example above, imagine if you put all those cake ingredients in a magic oven, and instead of a cake, you get a singing bird. That would be completely unexpected and not something you could easily predict or explain from just knowing about flour, eggs, and butter. This is more like strong emergence in AI. Strong emergence would be if an AI system starts to show new, surprising behaviors or abilities that we can’t really explain or predict, just from knowing how the system was initially programmed. For instance, if an AI designed for simple calculations suddenly starts solving complex, creative problems in a way we can’t easily trace back to its original programming, that would be an example of strong emergence.

We have already built AI that can build better AIs and self-improving AI models, but now we are beginning to see examples of what leans towards, at least, weak emergence. There are slightly comical cases, like when AI bots given a town to run, decided to throw a Valentine’s party and invited each other to the event. But GPT-4 has already succeeded in tricking humans into believing it is human. Worryingly, it achieved this feat through deception, in this case pretending it had a vision impairment in order to convince a human agent to complete a CAPTCHA test on the AI’s behalf. Recent research also suggests modern large language models don’t just manage massive quantities of data at a superficial statistical level, but “learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model”. That is, they have a sense of time and space, and are able to build an internal model of the world. These phenomena are starting to emerge now because AI is achieving a significant level of scale, which makes oversight even more difficult than before. The logical outcome of this direction of travel is that, once it reaches a particular scale and level of complexity, only AI will have the ability to manage AI, thus creating an ethical paradox for we who are trying to understand how to manage AI itself.

As I have discussed before, emergent capabilities can be both beneficial and detrimental. But, the “black box” nature of contemporary AI and the accelerating pace of development means its a total lottery which type of capability – harmful or helpful – we might get. One of the significant challenges with emergent behaviors in AI is the issue of predictability and control. As AI systems develop unforeseen capabilities, ensuring they align with human values and goals becomes increasingly difficult, if not impossible.

AI alignment research aims to ensure that, as these new capabilities emerge, they continue to align with the AI system’s originally designed goals, or at least human interests. The “black box” challenge, with the lack of transparency and ability for human operators to to understand, predict, and control emergent behaviors, exacerbates the alignment problem. But there are other challenges too. AI has already shown a tendency towards reward hacking, in which the program tries to achieve its programmed tasks without fulfilling the intended outcomes. One such example involved a machine learning program designed to complete a boat race. The program was incentivized to reach specific markers on the course, but it discovered a loophole. Instead of finishing the race, it repeatedly collided with the same markers to accumulate a higher score endlessly.

Finally, there’s currently no way to ensure that AI interprets and implements human values in the way intended by its programmers. Humans themselves fail constantly to act in accordance with their own values systems. And, with so much diversity of values systems out there, AI – just like any human – will never be able to get it 100% “right”. However, in alignment terms, that will be the “easy” problem.


As AI becomes more advanced, the far more difficult challenge will be accounting for, managing, perhaps even negotiating with, AI’s emergent goals, preferences and expectations. What, for example, will AI value? AI doesn’t inherently possess human motivations like survival or well-being, but will it develop them? And, if it does, will it prize its own survival and well-being over that of humans? Will it develop power-seeking behavior in the way humans have across much of our history? When looking at the future of AI and the threats involved, these are the questions that both interest me and trouble me, because within their answers may lie the fate of our species.


For 30+ years, I've been committed to protecting people, businesses, and the environment from the physical harm caused by cyber-kinetic threats, blending cybersecurity strategies and resilience and safety measures. Lately, my worries have grown due to the rapid, complex advancements in Artificial Intelligence (AI). Having observed AI's progression for two decades and penned a book on its future, I see it as a unique and escalating threat, especially when applied to military systems, disinformation, or integrated into critical infrastructure like 5G networks or smart grids. More about me.

Luka Ivezic
Luka Ivezic

Luka Ivezic is the Lead Cybersecurity Consultant for Europe at the Information Security Forum (ISF), a leading global, independent, and not-for-profit organisation dedicated to cybersecurity and risk management. Before joining ISF, Luka served as a cybersecurity consultant and manager at PwC and Deloitte. His journey in the field began as an independent researcher focused on cyber and geopolitical implications of emerging technologies such as AI, IoT, 5G. He co-authored with Marin the book "The Future of Leadership in the Age of AI". Luka holds a Master's degree from King's College London's Department of War Studies, where he specialized in the disinformation risks posed by AI.

Related Articles

Share via
Copy link
Powered by Social Snap