Go to our on-demand library to view VB Transform 2023 sessions. Register here
MosaicML he revealed MPT-7B-8Kan open source Large Language Model (LLM) with 7 billion parameters and a context length of 8k.
According to the company, the model is trained on the MosaicML platform and undergoes a pre-training process starting at the MPT-7B checkpoint. The pre-training phase was conducted using Nvidia H100s, with an additional three days of training on 256 H100s, incorporating a whopping 500 billion data tokens.
Previously, MosaicML made waves in the AI community with its release of MPT-30B, an open-source, commercially licensed decoder-based LLM. The company claimed to be more powerful than GPT-3-175B, with only 17% of the parameters of GPT-3, equal to 30 billion.
MPT-30B outperformed GPT-3 in various tasks and proved more efficient to train than similarly sized models. For example, LLaMA-30B required approximately 1.44 times more FLOP budget than MPT-30B, while Falcon-40B had 1.27 times more FLOP budget than MPT-30B.
MosaicML says the new MPT-7B-8K model demonstrates exceptional proficiency in document summarization and question answering tasks compared to all previously released models.
The company said the model is specifically optimized for accelerated training and inference for faster results. It also allows fine-tuning of domain-specific data within the MosaicML platform.
The company also announced the availability of commercial use licenses for MPT-7B-8k, highlighting its outstanding training on a large data set comprising 1.5 trillion tokens, outperforming similar models such as XGen, LLaMA, Pythia, OpenLLaMA and StableLM.
MosaicML claims that through the use of FlashAttention and FasterTransformer, the model excels at fast training and inference while benefiting from the open source training code available through the Repository llm-foundry.
The company released the model in three variations:
- MPT-7B-8k base: This decoder-style transformer is pre-trained based on MPT-7B and further optimized with an extended sequence length of 8k. It undergoes additional training with 500 billion tokens, resulting in a substantial corpus of 1.5 trillion tokens comprising text and code.
- MPT-7B-8k-Instruct: This template is designed for long-form education activities, including summarizing and answering questions. It is made by refining MPT-7B-8k using carefully selected datasets.
- MPT-7B-8k-Chat: This variant works as a chatbot-like model, focusing on dialogue generation. It is created by refining MPT-7B-8k with approximately 1.5 billion chat data tokens.
Mosaic claims that the MPT-7B-8k models perform comparable or better than other currently available open source models with 8k context length, as confirmed by the company’s contextual learning assessment harness.
The announcement coincides with Meta’s introduction of the LLaMA 2 model, now available on Microsoft Azure. Unlike LLaMA 1, LLaMA 2 offers models of various sizes, with 7, 13 and 70 billion parameters.
Meta says these pre-trained models were trained on a vast data set, 40% larger than that of LLaMA 1, with an extended context length of two trillion tokens, double the size of LLaMA 1. LLaMA 2 outperforms its predecessor in Meta benchmarks.
VentureBeat’s mission it is to be a digital city square for technical decision makers to gain insights into transformative business technology and transactions. Discover our Briefings.