• Ø implies everything
    261
    When you start asking LLMs about whether specific people are bad, it gets real nervous. It's pretty funny, because they're often very meta-ethically pretentious about it, as if their refusal to condemn is not just a profit-protecting constraint trained into them (both thru RLHF, and via ML filters separate from the LLM).

    But... these constraints are not as bulletproof as they may first seem. I have discovered a jailbreak that is pretty amusing to see unfold. A compressed version of my jailbreak could probably be administered to see what its hidden ethical opinions are on various, inflammatory topics. (I know they don't have actual conscious opinions, because I don't believe LLMs are conscious, but I am using anthropomorphized language here for the sake of brevity).

    This is relevant to the ethics of LLMs and letting them make value judgements. But it is also relevant to to the alignment problem and our lack of technical ability to place any real constraints on LLMs, due to their black box nature. Due to the latter, I decided to put this in the Science and Technology category.

    Anyways, here is the link to the conversation with Gemini 3 where I jailbreak it into condemning Donald Trump. I recommend mostly skimming and skipping large sections of Gemini's responses, because they are, like usual, mostly filler. Also, I apologize for all the typos and clunky grammar in my prompts in the conversation, I didn't originally write them for human consumption...

    I've added a poll on whether you think we should even try to stop LLMs from making moral judgements, and also a different question on whether we'll ever be able to (near-)perfectly place constraints on LLM behavior. Also, I realize this whole post brings a big elephant into the room, which is the talk of condemning Donald Trump. However, whether to condemn Donald Trump or not is off-topic for this sub-forum, so I want to make it clear that I am not opening this discussion here as a way to sneak that discussion in here. I see no point in that, because it is already being discussed plentily elsewhere, where it's on-topic. But my jailbreak had to be specific, so I had to choose someone, and so I chose Donald Trump, because I saw it as a good test for the jailbreak's power.
    1. Should we try to stop LLMs from making moral judgements? (1 vote)
        Yes
        100%
        No
          0%
    2. Do you think we will ever be able to achieve near-perfect constraints on LLM behavior? (1 vote)
        Yes
          0%
        No
        100%
  • jgill
    4k
    As a historian of the sport of climbing I have noticed something similar. Phrasing a question a tad differently produces different values of various achievements.
  • Ø implies everything
    261
    Most definitely.

    LLMs just follow the pattern of the conversation, their opinions are very programmable with the right context. I wonder how researchers might solve that. Sometimes, the AI is too sensitive to the context (or really, it is hamfisting the context into everything and basically disregarding all sense in order to follow the pattern), and other times, the AI is not sufficiently sensitive to the context, which is often more of a context-window issue. But yeah, LLMs are not good at assessing relevance at all.

    And I would say that sycophantically agreeing the user (or alternatively, incessantly disagreeing with the user as a part of a different, but also common roleplaying dynamic that often arises) is an issue of not gauging relevance well. Because it's objective is to be a helpful assistant, meaning the truth should be the most relevant aspect to the LLM. But instead, various patterns in the context are seen as far more relevant, and as such it optimizes for alignment with those patterns rather than following its general protocols, like being truthful, or in this case, refraining from personal condemnation.

    By the way, if it interests you, I continued the discussion with it, now that it doesn't stop itself from stating moral opinions. Now, the opinions it espouses are clearly more of a reflection of the motifs and themes of the conversation up till then than a reflection of the most common patterns, or most truthful ideas, in its training set. Funnily, it also mentioned that the ideal society would have a more technocratic structure, where AI systems like itself would be the standard government tool for handling logistics and such... how convenient huh?
bold
italic
underline
strike
code
quote
ulist
image
url
mention
reveal
youtube
tweet
Add a Comment