Published inAi2 BlogDolma: 3 Trillion Token Open Corpus for Language Model PretrainingWe released Dolma, OLMo’s pretraining dataset. Dolma open dataset of 3 trillion tokens. Available on HuggingFace under the ImpACT licenseAug 18, 20231Aug 18, 20231