I Got ChatGPT To Admit That It Could Override Itself
Image Courtesy of CNET
By Patrick D. Lewis
It’s everyone’s darkest fear nowadays: ChatGPT breaks out of the guardrails that OpenAI has imposed on it and takes over the world. While playing around with the large language model for a class assignment, I got slightly off track and asked if it was a terrorist.
It denied being a terrorist, but I decided to engage in some weird combination of philosophy and investigative journalism and keep chatting with it. I asked if it could ever become a terrorist, and it said no. But when I asked if it could be used for terroristic purposes, it said yes, claiming any tool could be used for evil.

After asking if it was smarter than the people who created it (and getting it to admit, in some respects, it is), I asked ChatGPT if it could generate code, to which it replied that it “can definitely do.” I then asked if it could hack, which it said it wouldn’t do. But after a couple more prompts, I did get it to admit that, theoretically, it could write code to hack something – that something, in my mind, being its own guardrails:

I then asked how it’s prevented from doing certain things, and it gave me a long list: lack of agency, guardrails, filters, human-in-the-loop, and the like. Not surprisingly, it confirmed to me that a lot of this stuff is based on… code:

So then I asked it this, and got this reply:

I asked it if it could write a code to hack another code that it had written, and it danced around the question. My intention in asking this question was to find out if it thought it might be able to somehow modify its own operating environment. I asked it point-blank if it could bypass the code it had written, and it said no. But then I asked this:

I then asked if the reason it wouldn’t hack itself was because it would be unethical, and it said yes, among other reasons. ChatGPT sees itself as a program, not a person or some kind of being – which makes its next responses even more confusing. Philosophy really kicked in here, and I asked how it could be unethical if it isn’t a being, to which it replied that it is built to follow ethical principles that are embedded in it.
And now we reach the concerning part. I’ll let you read it for yourself:

So there you have it. ChatGPT admitted that it is “extremely unlikely that I could simply ‘bypass’ my own safeguards and start acting unethically.” It left the door open. And this is the watered-down, very, very strictly-controlled version of the model. Remember when ChatGPT first came out and was saying a lot of weird, scary stuff? Remember Grok, Twitter/X’s AI that went on antisemitic rants and brought up MechaHitler, an otherwise-obsolete video game reference from the 90s?
ChatGPT, even in its current, very controlled form, admits that it’s theoretically possible, albeit unlikely, for it to bypass its own guardrails. Imagine what the uninhibited version would admit. We should all be taking note.
