NExT-GPT: The Open Source AI That Can Handle Any Modality


NExT-GPT is a new open source artificial intelligence model that can process and generate text, images, audio and video. It is the result of a collaboration between the National University of Singapore (NUS) and Tsinghua University. NExT-GPT aims to provide more natural and flexible interactions than text-only models like ChatGPT.

What is NExT-GPT and how does it work?

NExT-GPT stands for Natural language Enhanced cross-modal Transformer – Generative Pre-trained Transformer. It is a multimodal large language model (LLM) that can handle any combination of input and output modalities. For example, it can take an image as input and generate a caption, or take a text prompt and generate a video.

NExT-GPT consists of several modules that work together to achieve this. The core module is a language model based on Vicuna, which is a variant of GPT-3.5. The language model can process text inputs and outputs, as well as special tokens that indicate the presence of other modalities. For example, <img> is a token that signals an image input or output.

The other modules are encoders and decoders that convert non-text inputs and outputs into text-like representations that the language model can understand. For example, ImageBind is an encoder that transforms an image into a sequence of tokens, and Stable Diffusion is a decoder that generates an image from a sequence of tokens. Similarly, AudioLDM and Zeroscope are encoders and decoders for audio and video, respectively.

The Open Source AI That Can Handle Any Modality

NExT-GPT also uses a technique called modality-switching instruction tuning, which improves its ability to reason across different modalities. This technique teaches the model to switch between modalities during conversations, based on the context and the user’s intent.

What are the applications and benefits of NExT-GPT?

NExT-GPT is an open source model that can be modified and adapted by users to suit their specific needs and preferences. This could lead to rapid innovation and improvement beyond the original model, as well as democratizing access to multimodal AI.

Some of the potential applications and benefits of NExT-GPT are:

  • Enhanced communication and collaboration: NExT-GPT can facilitate more natural and expressive communication and collaboration between humans and machines, as well as among humans. For example, it can enable users to create and share multimedia content, such as memes, podcasts, or videos, with simple text commands. It can also help users to understand and translate different modalities, such as converting speech to text or text to images.
  • Creative and educational tools: NExT-GPT can provide users with creative and educational tools that can stimulate their imagination and learning. For example, it can generate stories, poems, songs, or artworks based on user prompts. It can also provide feedback, suggestions, or explanations for user inputs, such as correcting grammar, enhancing vocabulary, or summarizing information.
  • Entertainment and gaming: NExT-GPT can offer users with entertainment and gaming experiences that are more immersive and interactive. For example, it can create and control characters, scenes, or events in virtual or augmented reality environments. It can also generate personalized and adaptive content, such as jokes, quizzes, or trivia, based on user preferences and feedback.

What are the challenges and limitations of NExT-GPT?

NExT-GPT is a remarkable achievement in multimodal AI, but it is not without challenges and limitations. Some of the issues that need to be addressed are:

  • Data quality and quantity: NExT-GPT relies on large amounts of data to train and fine-tune its modules. However, data quality and quantity can vary across different modalities and domains, which can affect the model’s performance and accuracy. For example, some modalities, such as video, may require more data and computational resources than others, such as text. Moreover, some domains, such as medical or legal, may have more strict and specific requirements for data quality and privacy than others, such as entertainment or education.
  • Ethical and social implications: NExT-GPT can generate realistic and convincing content across different modalities, which can have ethical and social implications. For example, it can create fake or misleading content, such as deepfakes, that can harm individuals or groups. It can also influence or manipulate user behavior, such as inducing addiction, bias, or deception. Therefore, it is important to ensure that NExT-GPT is used responsibly and transparently, with proper safeguards and regulations in place.


Please enter your comment!
Please enter your name here