Limitations of multi-layer transformer


"For any constant L, we prove that any L-layer decoder-only transformer needs a polynomial model dimension (nΩ(1)) to perform sequential composition of L functions over an input of n tokens.

As a consequence, our results give: 
  1. The first depth-width trade-off for multi-layer transformers, exhibiting that the L-step composition task is exponentially harder for L-layer models compared to (L+1)-layer ones; 
  2. An unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; 
  3. A provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.
"On the technical side, we propose the multi-party autoregressive communication model that captures the computation of a decoder-only Transformer. 

"We also introduce a new proof technique that finds a certain indistinguishable decomposition of all possible inputs iteratively for proving lower bounds in this model."


Comments

Popular posts from this blog

Perplexity

Hamza Chaudhry