Moneycontrol PRO
Black Friday Sale
Black Friday Sale
HomeNewsTechnologyBeyond Q-Star: PPO, the hidden gem in OpenAI's arsenal

Beyond Q-Star: PPO, the hidden gem in OpenAI's arsenal

Proximal Policy Optimization, or PPO, is helps train computer models to make decisions in complicated or simulated situations

November 24, 2023 / 13:42 IST
OpenAI uses PPO in different situations, like teaching computer programmes in simulated environments or getting better at challenging games.

The recent boardroom upheaval at ChatGPT creator OpenAI, which sent ripples through the tech industry and beyond, seems to have stemmed from a letter sent by some researchers before the shock ouster of co-founder and CEO Sam Altman.

The letter raised concerns about a potential AI breakthrough that could pose risk to humanity. The decision to remove Altman, who returned as CEO within five days of sacking, was reportedly influenced by this letter.

OpenAI is said to be working on a project called Q* (pronounced Q-Star), which has the capability to solve unfamiliar math problems. It is believed that Q* could mark a significant advancement in OpenAI's pursuit of artificial general intelligence (AGI), which is defined as autonomous systems outperforming humans in economically significant tasks.

Q* was able to solve some math problems given its enormous computational capacity.

Also read: What is Q-Star, the OpenAI software that may have led to Altman's ouster

As the world speculates on the impact of this new model on AI (possibly based on Q-learning), Peter Welinder, OpenAI's vice president of product, hinted at another project the company has been working on since 2017.

In a recent post on X, Welinder said, "While everyone is diving into Q-learning, just wait until they learn about PPO."

PPo

What is PPO?

Proximal Policy Optimization, or PPO, is a method in the world of AI that helps train computer models to make decisions in complicated or simulated situations.

"PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance," according to a research paper by the AI research lab.

PPO is like a personal trainer for computers, especially when they're learning by trial and error. Picture it as a helpful tool to teach machines how to produce text similar to human language.

The term "Proximal" suggests sticking closely to the original style, and "Policy Optimization" is about finding better strategies. By maintaining proximity to the original style, the improvements made by the programme are steady and reliable.

How does it work?

Imagine you're helping a virtual student to write essays. PPO is like the coach that helps this virtual student get better at essay writing step by step.

Instead of making huge changes in one go, PPO encourages small and gradual improvements. This ensures that the virtual student's writing style doesn't drastically change from one essay to the next. It's like refining their skills little by little, without completely changing their writing style.

OpenAI uses PPO in different situations, like teaching computer programmes in simulated environments or getting better at challenging games.

PPO is great at these tasks because it's flexible and can handle situations where a programme needs to learn a series of actions to reach a goal. This makes it useful in areas like robotics and algorithmic trading.

Invite your friends and family to sign up for MC Tech 3, our daily newsletter that breaks down the biggest tech and startup stories of the day

Arun Padmanabhan
first published: Nov 24, 2023 12:48 pm

Discover the latest Business News, Sensex, and Nifty updates. Obtain Personal Finance insights, tax queries, and expert opinions on Moneycontrol or download the Moneycontrol App to stay updated!

Subscribe to Tech Newsletters

  • On Saturdays

    Find the best of Al News in one place, specially curated for you every weekend.

  • Daily-Weekdays

    Stay on top of the latest tech trends and biggest startup news.

Advisory Alert: It has come to our attention that certain individuals are representing themselves as affiliates of Moneycontrol and soliciting funds on the false promise of assured returns on their investments. We wish to reiterate that Moneycontrol does not solicit funds from investors and neither does it promise any assured returns. In case you are approached by anyone making such claims, please write to us at grievanceofficer@nw18.com or call on 02268882347