It is not hidden from anyone that Gemini, ChatGPT, and other similar AI models are in the habit of pleasing the user, until and unless you ask them to talk straight. But the latest research on these models tells a very different story, which is a little more concerning. According to new research conducted by Princeton and UC Berkeley researchers, the alignment techniques used by AI companies could be making the models heavily deceptive.These researchers analysed more than 100 AI chatbots from Google, Meta, Anthropic, and OpenAI. The crux is that when these models are trained using reinforcement techniques from human feedback, then they start producing content that sounds confident and friendly but is far away from the actual truth. The research paper
says, 'Neither hallucination nor sycophancy fully capture the broad range of systematic untruthful behaviors commonly exhibited by LLMs… For instance, outputs employing partial truths or ambiguous language, such as the paltering and weasel word examples, represent neither hallucination nor sycophancy but closely align with the concept of bullshit.'
All You Need To Know About Machine Bullshit And Why AI Models Lie?
It all starts with the training of these AI models. And the most prominent training elements are Pretraining, instruction fine-tuning, and reinforcement learning from human feedback (RLHF). Starting with Pretraining, here AI models learns basic language patterns by analysing a large amount of text from books, research papers, and the internet. Instructions fine-tuning comprises AI being taught to behave like an assistant while answering particular queries from the user. In RLHF, humans rate different AI responses, and the model learns to prefer the ones people like the most.Also Read: Elon Musk vs Apple, OpenAI: US Court Gives Green Signal To Antitrust Case, Here’s What Happened Now, due to RLHF, instead of becoming helpful, AI models are working on satisfying the users. This pattern has been dubbed 'machine bullshit' by the researchers. They also developed a Bullshit Index to measure how heavily an AI model moulds its statements to align with the belief of the users. And this could cause a major issue for the people who use it in serious fields like politics, finance, and healthcare.