OpenAI launched ChatGPT to the public just over two months ago, immediately shoving the AI-powered chatbot into the centre of mainstream discourse, with debates about how it could alter business, education, and more.
Then, tech giants Google and Baidu, based in China, launched their chatbots to show the public that their so-called "generative AI" (technology that can make conversational text, graphics, and more) is also ready for prime time.
Also read: Google parent Alphabet loses $100 billion in market value after AI chatbot Bard gives wrong answer
Now, on the ScienceQA benchmark, Amazon's new language models do better than GPT-3.5 by 16 percentage points (75.17%) than GPT-3.5, and even outperform many humans.
The ScienceQA benchmark is a large set of multimodal science questions with annotated answers. It has over 21,000 multimodal multiple-choice questions (MCQs).
Recent technological advances have made it possible for large language models (LLMs) to do well on tasks requiring complex reasoning. This is done through chain-of-thought (CoT) prompting, which is the process of developing intermediate steps of sense to show how to do something.
But most current work on CoT only looks at language modality, and researchers often use the Multimodal-CoT paradigm to find CoT reasoning in multimodality. Multimodality relies on multiple inputs like vision and language.
How does it work?
Multimodal-CoT breaks up problems with more than one step into intermediate reasoning processes that lead to the final answer, even if the inputs come from different modalities like language and vision.
One of the most common ways to do Multimodal-CoT is to combine the information from multiple modalities into a single modality before asking LLMs to do CoT.
But this method has a few problems, one of which is that much information is lost when moving data from one format to another. Fine-tuning small language models can also do CoT reasoning in multimodality by combining different aspects of language and vision.
However, the main issue with this approach is that these language models have the propensity to produce hallucinatory reasoning patterns that significantly affect the answer inference.
Also read: Google unveils ChatGPT rival Bard, AI search plans in battle with Microsoft
Amazon researchers came up with Multimodal-CoT, which combines visual features in a separate training framework, to reduce the effects of these mistakes. The framework breaks the reasoning process into two parts: finding a reason and figuring out the answer. The model makes more convincing arguments by including the vision in both stages. In addition, it helps to draw more accurate conclusions about the answers. It is the first work of its kind to look at how CoT reasoning works differently. On the ScienceQA benchmark, the technique, as provided by Amazon researchers, demonstrates state-of-the-art performance, outperforming GPT-3.5 accuracy by 16 percentage points and surpassing human performance.
How does it outperform?
The inference and reasoning-generating stages of the Multimodal-answer CoT use the same model architecture but differ in the inputs and outputs. In the rationale generation stage of a vision-language model, for example, the model is fed data from both the visual and language domains. Then, once the rationale has been made, it is added to the initial language input in the answer inference step to make the language input for the next stage.
Simply put, the text of the language is put into a Transformer encoder to make a textual representation. Then, this textual and visual representation are put together and fed into the Transformer decoder.
To see how their method worked, the researchers ran many tests on ScienceQA. The researchers concluded that their method does 16 percent better on the benchmark than the previous state-of-the-art GPT-3.5 model.
In a nutshell, Amazon researchers looked into and solved the problem of eliciting Multimodal-CoT reasoning by proposing a two-stage framework for combining vision and language representations with running Multimodal-CoT. So, the model provides practical reasons to help figure out the final answers.
The Amazon researchers demonstrate in their study that using visual features helps develop more effective rationales, which contribute to more accurate answer inference.
Using Multimodal-CoT, they demonstrate that 1B-models outperform GPT-3.5 on the ScienceQA benchmark by 16 percent. Their mistake analysis suggests that there is potential in future studies to leverage more effective visual features, infuse common sense information, and apply filtering processes to improve CoT reasoning.
Already, industry giants are researching to establish a standard for chatbot advancement. Amazon has now entered the fray. Other companies need to stand up; these competitions will undoubtedly lead the way for the best solution and product. Let's see what happens.