

Discover more from Digital Spirits
Perhaps LLMs will have an easier time forgetting than Jim Carry Image source: https://filmdaze.net/eternal-sunshine-of-the-spotless-mind-15th-anniversary-review/
New AI developments are constantly blasting at us. Unfortunately, most reported ‘innovations’ hardly meet the term’s definition. There is a lot of hype, and in the sea of information I find the most interesting innovations or research often gets buried. On occasion I’d like to highlight a few developments from the innovation side of AI to highlight how the science is changing, and how those changes might impact downstream policy.
This week, I’ve selected two new studies. While neither received splashy media attention, both represent changing trends and innovations that policymakers should consider.
Eternal Sunshine of the Spotless LLM
We’ll start with the most interesting. In a small corner of AI research, Concept Erasure, the ability to select a discrete concept and erase it from an AI system’s knowledge and understanding, has remained a steadfast interest albeit one computer scientists have never gotten to work. On June 6th, however, that changed. In a pre-print Arxiv study, EleutherAI presented a novel method they call LEAst-squares Concept Erasure (LEACE). Not only does this research demonstrate that concept erasure is indeed possible, but presents a method that works exceedingly well.
While LEACE may not be the end all be all of concept erasure methods, this model requires us to consider how this can be both used, and misused. On both fronts, the potential cannot be understated. Today artificial intelligence engineering is associated with the unwieldy, unpredictable, and uncertain. While engineers hold some control over algorithmic models and training data, that grip is exceedingly loose. Inevitably this leads to unwanted correlations, behavior, and biases. The result is quality degradation. Without a scalpel to excise problematic coding, engineers have largely leaned on blunt fine-tuning methods and black lists of problematic outputs that act to paper over unwanted behavior rather than stop it at its source. Adding concept erasure to the AI engineering toolbelt, however, engineers may suddenly hold a powerful, targeted antiseptic.
What’s useful for engineers, may also pique Washington’s interest.
For many policy-relevant questions, the value of concept erasure is clear. As a demonstration, the paper shows that LEACE can effectively remove discrete manifestations of algorithmic gender bias (an AI challenge targeted in the White House’s AI Bill of Rights). Using LEACE, the researchers were able to largely remove the outdated correlation between ‘nurses’ and ‘female’ from a large language model.* Crucially, this success didn’t have side effects. The model’s overall performance was unaffected, and even its understanding of directly related concepts such as what nurses are and what the nursing profession does, remained untouched. Removing the bad was not the enemy of the good. Naturally, deleting this discrete gender-bias is not a panacea for meeting the complexities of gender and algorithmic gender bias. Still, mitigation suddenly feels not only tractable but in discrete cases might be relatively straightforward.
This bias mitigation ability may represent only one slice of concept erasure’s overall potential. What other policy relevant challenges can benefit from information pruning? Cyberthreats come immediately to mind. A common worry in the AI safety community and among regulators is the potential of publicly available LLM models to either write malware or guide otherwise cyber-ignorant average-joes through launching a sophisticated cyberattack. While it’s unclear how substantial or serious this concern may be, some worry is justified. CheckPoint research’s monitoring of underground hacking communities has already verified that hackers, including apparent coding novices, are actively using ChatGPT to craft malware. In one instance, a dark web forum poster showed how they used ChatGPT to write a basic info stealer, malware that searches for files and copies them to a separate, hacker- owned, server. While concept erasure certainly can’t stop AI systems from producing any-and-all code that might be dangerous, it could perhaps adjust a model’s knowledge to prevent such code from ever being written. If a model doesn’t understand certain cyberattacks, for instance, it cannot guide users through those attacks or write relevant code. To give this process structure, engineers could look to organizations like the Cybersecurity and Infrastructure Security Agency (CISA) which routinely publishes lists of top cyber vulnerabilities . Using concept erasure, and these codified vulnerability resources, perhaps developers could root out this cyber knowledge - promoting a modest cyber baseline.
While all this potential is exciting, it’s always important to ground things. At this juncture, these methods are still developing, this research is brand new, and all use cases only represent possibility. That said, policymakers can still start considering concept erasures’ use today. February’s Executive Order 149091 on advancing racial equity in the federal government ordered the Director of the Office of Management and Budget (OMB) to “consider opportunities” to update government guidance and processes related to Artificial intelligence and automation in order to “support equitable decision-making.” Given concept erasure’s demonstrated bias-fighting abilities, this tool may represent one such “opportunity.” Pursuant to this order, and especially as the OMB is actively writing forthcoming executive guidance on government AI use, the office should look into this concept, the LEACE paper, and other related methods to verify their potential. If concept erasure methods hold water, the OMB should consider investing in further study and its use to improve government systems.
Before moving on, I want to stress that concept erasure’s potential is a clear double-edged sword. While a useful mitigant to certain forms of bias or cyber threats, it may also serve to censor and control information. Any application of such methods should be done with restraint and care and avoid censorship at all costs. Policymakers should note that the same tools that can help scrub gender bias from AI can also scrub out mentions of ‘Tiananmen Square.’
Another step policymakers at all levels of government can take today is to consider limits on AI censoring tools such as concept erasure in government applications. Tackling these sticky questions today will ensure that by the time these methods are verified and mature, any needed limits on their use will already have been hashed out. This will avert unnecessary harm and ease adoption for positive use cases.
Un-Hidden figures
Next, lets briefly discuss the latest research from OpenAI “Let’s Verify Step by Step.” Admittedly, the policy implications of this research are not quite as deep. Rather, what makes this paper interesting, is the methodological progress it represents. Using a technique called “process supervision,” researchers were able to achieve state of the art mathematical reasoning and create a model with a 78% success rate when solving complex mathematical problems. On its own, this mathematical success is impressive. Math is one area systems like ChatGPT have traditionally struggled. The technique, however, is more interesting. Unlike more traditional AI training techniques which guide (or ‘reward’) system improvement by validating whether outcomes are accurate, this technique rewards systems that follow correct processes. During training, the AI walks its way through a problem, at each step the process is validated, and the system is only ‘rewarded’ if the steps it takes are correct.
Focusing on correcting algorithmic processes ensures better results. Intriguingly, this also ensures better explainability. Because systems are forced to discretely outline their processes during training, this creates a positive externality – learning how to produce step-by-step reasoning that is clear and can be human-validated.
Above is an example output. Not only does the system arrive at the correct answer to a (in my view) complicated question, but also walks through the numbers step by step and even manages the complexity of wielding mathematical identities to support its reasoning. Incredible.
Naturally, 78% is not yet good enough for the certainty demands of applied math. That said, these results are not only promising for math but show that the wicked problem of AI explainability is finding traction. When the original ChatGPT was released in November, its mathematical success rate was a dismal 40% .** The rate of improvement has not abated, and AI innovation continues its impressive momentum.
*Note that after concept erasure, a weak correlation between ‘nurse’ and ‘female’ still remained. The improvement was dramatic but perfection will continue to be a work in progress.
**Note this figure is measured on a slightly different math benchmark.
Innovation you May Have Missed
"Using LEACE, the researchers were able to largely remove the outdated correlation between ‘nurses’ and ‘female’ from a large language model." Can you explain why you characterize this correlation as "outdated"? Over 86% of nurses are female in 2023.
Hi Matthew. Great article and thanks for digging in. Curious if you think that concept erasure will work in the long haul when the AI algorithms are trained to mimic and/or otherwise learn from human behavior and statistics (as is the case with nurses being correlated with women). In other words, while the initial erasure “treatment” showed large improvement initially, doesn’t such an algorithm re-learn the biasness as an innate function of the AI coding? And, if the algorithm is to weight input parameters to correct for biasness, then would the algorithm be trustworthy (or would that depend on the weighting and the coding)? With so much to consider, the ongoing evolution of AI/ML will be quite the ride.
- Bob C (your old neighbor)