How the robot learns

Reinforcement Learning is a branch of machine learning managing issue of sequential decision making. In its most general form, it analyzes the issue of an agent that’s interacting with the outside world (i.e. the environment) by simply taking an action at each step. The selection of action will have consequences: first, it causes a new state for the agent; and secondly, the broker receives a reward signal in the environment, telling it how bad or good that the action was. The aim of the agent is to figure out how to behave, which is, what’s the best action at every state, such that in the long run, it has to collect the greatest possible number of rewards.
The keyword here is’long-term’: the activity that leads to immediate satisfaction is not always good for long-term achievement. That’s part of the sophistication the RL learning algorithms attempt to address: initially the agent doesn’t have any idea how bad or good that an act is, and also what next nations they will produce. The agent needs to, in a balanced manner, research the activity room to experience the effects of every action, and at precisely the same time figure out the ideal action strategy that contributes to the greatest possible long-term reward. The process of figuring out the best action policy using a training set of such agent experiences is the ultimate aim of Reinforcement Learning.
Think about a self-driving car, such as: while driving, at each moment it ought to understand what actions to take (should I break? Should I flip right? Should I keep going straight? Why don’t you simply accelerate?) . Am I secure enough? Did I get to stay away from the control while I had been parking?) . RL algorithms have the capability to instruct the automobile how to make the most of its long term rewards by taking optimal actions (that is, how to drive). Its core notions were developed within the past 30 decades, but given its complexity, it might just be applied to problems with rather compact state and action spaces. Incorporation of deep learning to RL opened the door to solving real-world problems with RL, where state and action spaces can be very large.
DeepMind was the first group that showed the energy of Deep RL, when in 2016 the game-playing agent they trained beat the world champion in the game of Go. There is also a vast collection of use cases for RL in various industries, such as finance, healthcare, and electronic advertising. Real-time bidding (RTB) is a mechanism for connecting advertisers with online publishers. The objective of publishers is to monetize the content they create. The target of the advertisers is to spend their budgets optimally, for example some pre-specified goals are attained. The practice of how to allocate advertising budgets is determined at a highly granular level, impression by belief, in real-time auction procedures happening countless times every day.
An advertising campaign’s bidding strategy decides, in real time, how much to bid for an opportunity to show an advertiser’s message to some specific user. This bid has to be decided based on all sorts of features for the ad chance, e.g: what is the web page or program? What is the geographic location? What is the time of day or day of week? And when it isn’t a complicated enough issue, the bidding plan should also provide pre-defined Key Performance Indicators (KPIs) goals (think of total budgets, or performance targets expressed by CPA, CPC, etc), that’s set on behalf of the advertiser.
We at the Copilot group at Xaxis use system learning/AI as the center of our bidding strategies: it helps us learn from historic data clues on the best way best to set the bid values. Our vision is that the goal of the best bidding strategy should be to get as near as possible to your advertiser’s pre-defined goals. To do this, we need to dynamically adjust the parameters of the bidding plan, such that it always moves the KPIs towards the ideal direction. It ends up that RL is an perfect tool to deal with the challenge of dynamically managing an advertising effort.