Large Language Model

Understanding the main architecture and tasks:

The original transformer for the intro to scaled dot product attention (https://arxiv.org/abs/1706.03762)
BERT for an encoder-style LLM and masked-language modeling for prediction tasks (https://arxiv.org/abs/1810.04805)
GPT for a decoder-style LLM for generative modeling (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
BART to combine both encoder & decoder parts again (https://arxiv.org/abs/1910.13461)

Efficiency

Alignment

InstructGPT to align LLMs (https://arxiv.org/abs/2203.02155)
Constitutional AI for more explicit alignment (https://arxiv.org/abs/2212.08073)

(chat)GPT alternatives

BLOOM, a distributed open-source effort, https://arxiv.org/abs/2211.05100
Sparrow, DeepMind’s ChatGPT offering (since there is no ChatGPT paper), https://arxiv.org/abs/2209.14375
BlenderBot3, Meta’s ChatGPT alternative that can search the internet, (https://arxiv.org/abs/2208.03188)

Google’s : Attention Is All You Need : This is the first paper one should read because Transformer architecture is base of all LLMs
https://arxiv.org/pdf/1706.03762.pdf
Google’s : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/pdf/1810.04805.pdf
OpenAI’s : GPT-3 used in ChatGPT
https://arxiv.org/pdf/2005.14165.pdf
Google’s : LaMDA: Language Models for Dialog Applications
https://arxiv.org/pdf/2201.08239.pdf

Knowledge Base | Daily Notes