Limitations of multi-layer transformer


"For any constant L, we prove that any L-layer decoder-only transformer needs a polynomial model dimension (nΩ(1)) to perform sequential composition of L functions over an input of n tokens.

As a consequence, our results give: 
  1. The first depth-width trade-off for multi-layer transformers, exhibiting that the L-step composition task is exponentially harder for L-layer models compared to (L+1)-layer ones; 
  2. An unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; 
  3. A provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.
"On the technical side, we propose the multi-party autoregressive communication model that captures the computation of a decoder-only Transformer. 

"We also introduce a new proof technique that finds a certain indistinguishable decomposition of all possible inputs iteratively for proving lower bounds in this model."


Comments

Popular posts from this blog

Hamza Chaudhry

When their AI chums have Bob's data

Swarm 🦹‍♂️