r/slatestarcodex 2d ago

AI Most Questionable Details in 'AI 2027' — LessWrong

https://www.lesswrong.com/posts/6Aq2FBZreyjBp6FDt/most-questionable-details-in-ai-2027
27 Upvotes

3 comments sorted by

12

u/SoylentRox 1d ago
  • I don't really understand how a local copy of the weights gives the terrorists more practical control over the software's alignment. I don't think it's easy to manually tweak weights for so specific a purpose. Maybe they just mean the API is doing a good job of blocking sketchy requests?

Just a specific criticism of this criticism. Local weight models are fairly easily broken from restrictions especially refusals when the underlying model is capable of performing the desired task: https://huggingface.co/perplexity-ai/r1-1776

Any remotely useful model for helping humans do legitimate engineering or bioscience tasks will be useful in designing bombs, killer drones, and bioweapons, just like current human engineers competent in these fields can do these things, and models like r1 will eagerly help to the best of their abilities if you say you are red teaming and want to produce a demo of the attack.

This 'fine tuning' effectively is a lobotomy of the circuits the model uses to refuse the request, and like any lobotomy, may have unwanted side effects : https://huggingface.co/perplexity-ai/r1-1776/discussions/254

2

u/nexech 1d ago

Thanks for the links, I'm somewhat unfamiliar with this exploit.

7

u/SoylentRox 1d ago

I wouldn't call it an exploit but more the nature of the tool. Is it an "exploit" that you can always, if a shotgun and a bandsaw are in your possession, shorten the barrel to make a sawed off?