Microsoft has just introduced a new model named Visual ChatGPT, which combines visual foundation models (VFMs) such as Transformers, ControlNet, and Stable Diffusion with ChatGPT. In addition, the system enables ChatGPT interaction beyond language.
How does it work?
ChatGPT draws interdisciplinary interest because it provides a language interface with extraordinary conversational competence and reasoning abilities across various fields. However, ChatGPT is currently incapable of processing or producing images from the visual environment due to its linguistic training. On the other hand, visual foundation models, such as Visual Transformers or Stable Diffusion, are only adept at specialised tasks with one-round fixed inputs and outputs. However, demonstrate excellent visual comprehension and generating capabilities.
To this end, Microsoft researchers have developed a system called Visual ChatGPT, which incorporates many visual foundation models and enables users to interact with ChatGPT using graphical user interfaces. It is capable of:
1) transmitting and receiving not only text but also images
2) supplying complex visual inquiries or visual editing instructions requiring the collaboration of numerous AI models with multiple phases.
3) offering input and requesting corrections
Considering models with many inputs/outputs and models requiring visual feedback, the researchers have created a series of prompts to inject the visual model information into ChatGPT. Tests demonstrate that Visual ChatGPT makes it possible to investigate the visual roles of ChatGPT using visual foundation models.
What changed?
There's been tremendous progress in the development of large language models (LLMs) like T5, BLOOM, and GPT-3 in recent years. Based on InstructGPT, ChatGPT is trained to retain conversational context, reply appropriately to follow-up questions, and produce accurate responses. However, although ChatGPT is impressive, it has only been trained with a single language modality, restricting its capacity to process visual data.
VFMs have demonstrated tremendous potential in computer vision due to their ability to interpret and create complex pictures. In human-machine interactions, however, VFMs are less versatile than conversational language models due to the limits imposed by the nature of task specification and the specified input-output formats.
Training a multimodal conversational model is a logical way to construct a system comparable to ChatGPT, with the capacity to perceive and generate visual information. Yet, creating such a system would require large amounts of data and computing power.
Possible solution?
A new Microsoft study suggests that Visual ChatGPT, which works with vision models through text and prompt chaining, could be used to solve this problem. Instead of training a whole new multimodal ChatGPT from scratch, the researchers built Visual ChatGPT on top of ChatGPT and added several VFMs. They've made a Prompt Manager that connects ChatGPT and these VFMs. It has the following features:
- Sets the input and output formats and lets ChatGPT know what each VFM can do.
- It manages the histories, priorities, and conflicts between several visual foundation models.
- Changes different kinds of visual information, like PNG images, depth images, and mask matrices, into language format to help ChatGPT understand.
By integrating the Prompt Manager, ChatGPT can use these VFMs repeatedly and learn from how they respond until it either meets the users' needs or reaches the end state.
What does it do?
For example, let's say a user uploads a picture of a black elephant with a hard-to-understand instruction like "please make a white African elephant in the picture and then build it step by step like a cartoon."
With the help of the Prompt Manager, Visual ChatGPT starts the execution of linked visual foundation models. In particular, it uses a depth estimation model to figure out the depth information, a depth-to-image model to turn the depth information into a picture of a white elephant, and a style transfer VFM based on a stable diffusion model to make the image look like a cartoon.
In the processing chain shown above, the Prompt Manager acts as a dispatcher for ChatGPT by providing visual representations and keeping track of how the information changes. For example, after Visual ChatGPT gets "cartoon" hints from Prompt Manager, it will stop the pipeline and show the final result.
Challenges
In their work, the researchers note that the failure of VFMs and the irregularity of the Prompt are areas of concern because they result in less-than-satisfactory generation results. Due to this, a single self-correcting module is required to ensure that execution outputs are consistent with human intentions and to make the necessary corrections. Furthermore, it is feasible that the model's inference time would increase due to its propensity for constant course-correction. The team intends to investigate this issue in a future study.
Basically, a single image carries a lot of information - most obviously, form, colour, shape - and the system needs to understand both the user's requirement and how to render the image correctly. While visual foundation models have come a long way, it is still early days to ask generative AI to create and customise images with a simple voice command. Having said that, VisualGPT could be an exciting test case for it.
Click here to check out the GitHub repository.
Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!
Find the best of Al News in one place, specially curated for you every weekend.
Stay on top of the latest tech trends and biggest startup news.