Understanding the main architecture and tasks:

  1. The original transformer for the intro to scaled dot product attention (https://arxiv.org/abs/1706.03762)

  2. BERT for an encoder-style LLM and masked-language modeling for prediction tasks (https://arxiv.org/abs/1810.04805)

  3. GPT for a decoder-style LLM for generative modeling (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

  4. BART to combine both encoder & decoder parts again (https://arxiv.org/abs/1910.13461)

Efficiency

  1. FlashAttention to boost efficiency (https://arxiv.org/abs/2205.14135)

  2. Cramming to train on a single GPU (https://arxiv.org/abs/2212.14034)

Alignment

  1. InstructGPT to align LLMs (https://arxiv.org/abs/2203.02155)

  2. Constitutional AI for more explicit alignment (https://arxiv.org/abs/2212.08073)

(chat)GPT alternatives

  1. BLOOM, a distributed open-source effort, https://arxiv.org/abs/2211.05100

  2. Sparrow, DeepMind’s ChatGPT offering (since there is no ChatGPT paper), https://arxiv.org/abs/2209.14375

  3. BlenderBot3, Meta’s ChatGPT alternative that can search the internet, (https://arxiv.org/abs/2208.03188)

  1. Google’s : Attention Is All You Need : This is the first paper one should read because Transformer architecture is base of all LLMs
    https://arxiv.org/pdf/1706.03762.pdf
  2. Google’s : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    https://arxiv.org/pdf/1810.04805.pdf
  3. OpenAI’s : GPT-3 used in ChatGPT
    https://arxiv.org/pdf/2005.14165.pdf
  4. Google’s : LaMDA: Language Models for Dialog Applications
    https://arxiv.org/pdf/2201.08239.pdf
  • https://link.springer.com/article/10.1007/s11023-020-09548-1

  • neurons are not parameters

  • the weights and biases of neurons are one type of parameter

  • there are other internal variables which can be considered parameters

  • tokens are basically words

  • for each token model will try to predict the next word