This repo contains a reproducible data receipe for the RedPajama data, with the following token counts:

Dataset Token Count Commoncrawl 878 Billion C4 175 Billion GitHub 59 Billion Books 26 Billion ArXiv 28 Billion Wikipedia 24 Billion StackExchange 20 Billion Total 1.2 Trillion