Understanding the main architecture and tasks:
-
The original transformer for the intro to scaled dot product attention (https://arxiv.org/abs/1706.03762)
-
BERT for an encoder-style LLM and masked-language modeling for prediction tasks (https://arxiv.org/abs/1810.04805)
-
GPT for a decoder-style LLM for generative modeling (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
-
BART to combine both encoder & decoder parts again (https://arxiv.org/abs/1910.13461)
Efficiency
-
FlashAttention to boost efficiency (https://arxiv.org/abs/2205.14135)
-
Cramming to train on a single GPU (https://arxiv.org/abs/2212.14034)
Alignment
-
InstructGPT to align LLMs (https://arxiv.org/abs/2203.02155)
-
Constitutional AI for more explicit alignment (https://arxiv.org/abs/2212.08073)
(chat)GPT alternatives
-
BLOOM, a distributed open-source effort, https://arxiv.org/abs/2211.05100
-
Sparrow, DeepMind’s ChatGPT offering (since there is no ChatGPT paper), https://arxiv.org/abs/2209.14375
-
BlenderBot3, Meta’s ChatGPT alternative that can search the internet, (https://arxiv.org/abs/2208.03188)
- Google’s : Attention Is All You Need : This is the first paper one should read because Transformer architecture is base of all LLMs
https://arxiv.org/pdf/1706.03762.pdf - Google’s : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/pdf/1810.04805.pdf - OpenAI’s : GPT-3 used in ChatGPT
https://arxiv.org/pdf/2005.14165.pdf - Google’s : LaMDA: Language Models for Dialog Applications
https://arxiv.org/pdf/2201.08239.pdf
Links
-
https://link.springer.com/article/10.1007/s11023-020-09548-1
-
neurons are not parameters
-
the weights and biases of neurons are one type of parameter
-
there are other internal variables which can be considered parameters
-
tokens are basically words
-
for each token model will try to predict the next word