A lot of people might think this question is obvious. It's just a bunch of methods for optimizing behavior for an agent in some environment based on reward signals.
- But what's its history?
- How does it relate to optimization problem or traditional optimal control problem?
- How is it different from supervised or unsupervised learning?
- Why it's named "reinforcement learning" instead of "optimal behavior learning".
- Does the "reinforcement" word suggest something special?
Can you answer these questions?
Approximate dynam-ic programming (ADP) has emerged as a powerful tool for tackling a diverse collection of stochastic optimization problems. Reflecting the wide diversity of problems, ADP (including research under names such as reinforcement learning, adaptive dynamic programming and neuro-dynamic programming) has become an umbrella for a wide range of algorithmic strategies. Most of these involve learning functions of some form using Monte Carlo sampling. A recurring theme in these algorithms involves the need to not just learn policies, but to learn them quickly and effectively. Learning arises in both offline settings (training an algorithm within the computer) and online settings (where we have to learn as we go). Learning also arises in different ways within algorithms, including learning the parameters of a policy, learning a value function and learning how to expand the branches of a tree.
Approximate Dynamic Programming, First edition. By Frank Lewis and Derong Liu
This book seems to view the so-called "reinforcement learning" as an alias for approximate dynamic programming which is used for solving stochastic optimization problems. So:
Reinforcement Learning= Approximate Dynamic Programming?
Also:
The term “ADP” can be interpreted either as “Adaptive Dynamic Programming” (with apologies to Warren Powell) or as “Approximate Dynamic Programming” (as in much of my own earlier work). The long-term goal is to build systems which include both capabilities; therefore, I will simply use the acronym “ADP” itself. Various strands of the field have sometimes been called “reinforcement learning” or “adaptive critics” or “neurodynamic programming,” but the term “reinforcement learning” has had many different meanings to many different people.
Learning And Approximate Dynamic Programming. By Jennie Si, Andy Barto, Warren Powell, and Donald Wunsch
By the way, this book or this paper has drawn a comparison between supervised learning and reinforcement learning. Also you can find that comparison in Sutton and Barton's Reinforcement Learning book.
Why is it called reinforcement learning?
The term reinforcement comes from studies of animal learning in experimental psychology, where it refers to the occurrence of an event, in the proper relation to aresponse, that tends to increase the probability that the response will occur againin the same situation. The simplest reinforcement learning algorithms make use ofthe commonsense idea that if an action is followed by a satisfactory state of affairs,or an improvement in the state of affairs, then the tendency to produce that actionis strengthened, i.e., reinforced. This is the principle articulated by Thorndike inhis famous “Law of Effect” (Thorndike, 1911). Instead of the term reinforcementlearning, however, psychologists use the terms instrumental conditioning, or operantconditioning, to refer to experimental situations in which what an animal actuallydoes is a critical factor in determining the occurrence of subsequent events. Thesesituations are said to include response contingencies, in contrast to Pavlovian, orclassical, conditioning situations in which the animal’s responses do not influencesubsequent events, at least not those controlled by the experimenter. There are verymany accounts of instrumental and classical conditioning in the literature, and thedetails of animal behavior in these experiments are surprisingly complex. See, forexample, Hergenhahn & Olson, 2001. The basic principles of learning via reinforcement have had an influence on engineering for many decades (e.g., Mendel &McClaren, 1970) and on Artificial Intelligence since its very earliest days (Minsky,1954, 1961; Samuel 1959; Turing, 1950). It was in these early studies of artificiallearning systems that the term reinforcement learning seems to have originated. Sut-vi REINFORCEMENT LEARNING AND ITS RELATIONSHIP TO SUPERVISED LEARNINGton and Barto (1998) provide an account of the history of reinforcement learning inArtificial Intelligence.But the connection between reinforcement learning as developed in engineeringand Artificial Intelligence and the actual details of animal learning behavior is farfrom straightforward. In prefacing an account of research attempting to capturemore of the details of animal behavior in a computational model, Dayan (2002)stated that “Reinforcement learning bears a tortuous relationship with historical andcontemporary ideas in classical and instrumental conditioning.” This is certainly true,as those interested in constructing artificial learning systems are motivated more bycomputational possibilies than by a desire to emulate the details of animal learning.This is evident in the view of reinforcement learning as a combination of search andlong-term memory discussed above, which is a an abstract computational view thatdoes not attempt to do justice to all the subleties of real animal learning.For our mobile phone example, the principle of learning by reinforcement isinvolved in several different ways depending on what grain size of behavior weconsider. We could think of a move in a particular direction as a unit of behavior,being reinforced when reception improved, in which case we would tend to continueto move in the same direction. Another view, one that includes long-term memory,is that the tendency to make a call from a particular place is reinforced when a callfrom that place is successful, thus leading us to increase the probability that we willmake a call from that place in the future. Here we see the reinforcement processmanifested as the storing in long-term memory of the results of a successful search.Note that the principle of learning via reinforcement does not imply that only gradualor incremental changes in behavior are produced. It is possible for complete learningto occur on a single trial, although gradual changes in behavior make more sensewhen the contingencies are stochastic.