See inside the black box?
An example of such features might be all conjugations of a particular verb, or any term that suggests “more than.” This lets the researchers better understand how a model works by allowing them to identify whole “circuits” of neurons that tend to be linked together.
“Our method decomposes the model, so we get pieces that are new, that aren’t like the original neurons, but there’s pieces, which means we can actually see how different parts play different roles,” Josh Batson [an Anthropic researcher who worked on the project] said. “It also has the advantage of allowing researchers to trace the entire reasoning process through the layers of the network.”
Still, Anthropic said the method did have some drawbacks.
It is only an approximation of what is actually happening inside a complex model like Claude. There may be neurons that exist outside the circuits the CLT method identifies that play some subtle but critical role in the formulation of some model outputs.
The CLT technique also doesn’t capture a key part of how LLMs work —which is something called attention, where the model learns to put a different degree of importance on different portions of the input prompt while formulating its output. This attention shifts dynamically as the model formulates its output. The CLT can’t capture these shifts in attention, which may play a critical role in LLM thinking.
Anthropic also said that discerning the network’s circuits, even for prompts that are only tens of words long, takes a human expert several hours.
It said it isn’t clear how the technique could be scaled up to address prompts that were much longer.
Comments
Post a Comment
Empathy recommended