Sunday, June 30, 2019

Reward Prediction Error At The Gas Pump

A couple of years ago I was driving up to Minnesota resort country when I noticed something happened at gas stations. For decades, gas pump choices were arranged linearly with the lowest octane fuel on the left and the highest on the right.  The only difference was the occasional pump with diesel or racing fuel options and they were typically on the far right.  To illustrate I took this photo of a gas pump display on the way home form work yesterday.  Three years ago the display would have been (left to right) 87-88-91.  In this case, in addition to the octane shuffle there is a price dissociation.  Motorists have been trained for year to expect that 87 octane to be the least expensive and now it is not - 88 octane is the least expensive.  These changes at the pump reminded me of a paper I had read a few years ago by Wolfram Schultz one of the world experts in reward prediction error. 

Reward prediction error is basically the difference between a prediction about the nature of the reward and what really happens. See all of the definitions in the diagram below that are taken directly from Schultz.  Reward is clearly defined as well as predictions to both the positive and negative sides.

In the paper of interest Dr. Schultz introduces it by being confronted with a vending machine where he cannot read the language (Japanese).  There are 6 choices and his expectation is low that he will get his preferred choice so he pushes the second button.  He is surprised that it delivers exactly what he wants - black current juice. He points out that this experience will keep him pressing this second button until the machine is loaded differently and he does not get his expected juice.  He uses this as an example of positive reward prediction error (RPE) or the difference between his low expectation and the ideal outcome that resulted.  RPE systems are set up to optimize positive reward prediction error.  When the eventual negative outcome results in negative prediction error decision making will change in order to return to the positive RPE scenario and then no reward prediction error scenarios.

Getting back to the gas pumps. They appear to have been designed to defeat RPE at two levels.  First, the position of the buttons after everyone was conditioned to push the one at the far left.  The second, is the dissociation of gas price from button position.  There is no longer a linear correlation between octane and price.  I am not sure if both of these trends have been occurring over time or just recently. I do know from studying various gas pumps that the button positions do not necessarily reflect linear price or octane changes, but in some cases they still do. The good news is that they can both be overcome by carefully studying the octane button position and posted per gallon price rather than depending on the old learned patterns.  The other interesting aspect of the gas pump problem is that there don't have to be a lot of predictions.  The prediction error occurs only if the purchaser depends on old patterns without paying close attention to details. 

The neurobiology of RPE is more fascinating than the descriptive aspects. We know that the neurobiology of reward in the human brain is heavily dependent on dopaminergic systems in the ventral striatum.  Dopaminergic neurons code reward in the form of prediction error even  in complex tasks.  What is even more interesting  is that the coding is not in terms of quantifiable measures but subjective ones.  Utility functions incorporate subjective measures and can be used to determine the potential values of the reward.  Dopaminergic neurons code the utility of the received reward minus the utility of the predicted reward.  Looking at a computational model of the addictive process, Redish (7) discusses a value function V[s(t)] dependent on the state of the world s(t) and presents it as the calculation of expected future reward discounted by the expected time of reward:

As noted the value of delayed rewards are reduced and the actual discounting applied is based on empirical work on discounting in human and animal models.  The author in this case goes on to develop a computational model of addiction in this case that is based on reward prediction error and the fact that cocaine produces direct phasic increases in dopamine (DA).  The model is termed a temporal-difference reinforcement learning (TDRL) model.  According to RPE increases in DA occur after unexpected natural rewards. Over time DA release decreases, learning stops, and instead is paired with the cues for the reward.  That does not occur with a pharmacological reaction at the level of the dopaminergic neurons.  In that case, a drug like cocaine will release DA independent of the expected reward.  That produces a final state where a unexpected natural reward, a cue for a learned reward, and cocaine will all produce DA.

The Redish paper also looks in detail at a couple of associated issues. The first is rational addiction theory defined as the user maximizing value or utility over time. Long term rewards for quitting are discounted more than the short term penalties and therefore the user remains addicted.  In the author's model "the maximized function entails remaining addicted." (p 1946).  TDRL theory suggests that addiction is always irrational because the pharmacological effects of cocaine (in this case) always outweigh the associated DA surges from the universe of value functions available in the real world. Addictive drugs will produce an increase in DA, so that the user will not be able to encounter and learn a value function that is associated with an equal or greater DA surge than is produced by the drug.  Therefore the user remains addicted.  This has been taught in addiction seminars for years as the Hijacked Brain Hypothesis - meaning that the dopamine signal produced by addictive drugs overwhelms the dopamine signal produced by natural stimuli like eating, drinking water, sexual behavior, and social affiliation.  Both RPE and TDRL theory offer more explanatory power than the Hijacked Brain Hypothesis

As I wrap up this post, I pulled down the latest editions of the two major addiction texts to see what they had to say about reward prediction error, computational models of addiction, and some of the authors referenced in this post, especially Wolfram Schultz.  There were no references at all and the sections on the actual function of DA neurons in addiction was surprisingly thin. On the other hand a lot of concepts used in the field like salience are the direct product of these systems. In order to produce a more coherent picture of the neurobiology of addiction it is important to outline these DA systems and how they work normally and in addictive states.

I am hoping that addiction texts for clinicians will contain some of this information in the near future and ideally the chapters will be written by the scientists that have been studying these processes in some cases for decades.

George Dawson, MD, DFAPA

Supplementary 1:

Getting back to the gas pump example, considering the 3 octane ratings and the three prices that may or may not correspond to the octane ratings means that there are 6 possible combinations at any pump that need to be considered.  Any real world actor at the pump needs to consider this carefully when the gas pump has undergone a transition from the expected correlation between increasing octane ratings and price to one where this relationship does not exist. The advantage to the actor in this case is that all of this information is explicit and that behavior is more likely to be affected by negative prediction error when automatic selection behavior results in the wrong octane or fuel cost being selected.  That is unlike Dr. Schultz's example in the Japanese airport when he randomly chose a beverage and was unexpectedly rewarded.

Supplementary 2:

I can't say enough about the writings of Wolfram Schultz.  They are only peripherally mentioned in the addiction literature and yet his theories and experiments are some of the more important that I have read with regard to the neurobiological theories of addiction.

Papers of Wolfram Schultz - Journal of Neurophysiology Page Link

Home Page of Wolfram Schultz Link - contain some of the best PowerPoint slides that I have ever seen.


1: Schultz W. Dopamine reward prediction error coding. Dialogues Clin Neurosci.2016 Mar;18(1):23-32. Review. PubMed PMID: 27069377 full text

2: Schultz W. Reward prediction error. Curr Biol. 2017 May 22;27(10):R369-R371.doi: 10.1016/j.cub.2017.02.064. PubMed PMID: 28535383 full text

3: Stauffer WR. The biological and behavioral computations that influence dopamine responses. Curr Opin Neurobiol. 2018 Apr;49:123-131. doi: 10.1016/j.conb.2018.02.005. Epub 2018 Mar 2. Review. PubMed PMID: 29505948. full text

4: Takahashi YK, Batchelor HM, Liu B, Khanna A, Morales M, Schoenbaum G. DopamineNeurons Respond to Errors in the Prediction of Sensory Features of Expected Rewards. Neuron. 2017 Sep 13;95(6):1395-1405.e3. doi: 10.1016/j.neuron.2017.08.025. PubMed PMID: 28910622. full text

5: Keiflin R, Pribut HJ, Shah NB, Janak PH. Ventral Tegmental Dopamine Neurons Participate in Reward Identity Predictions. Curr Biol. 2019 Jan 7;29(1):93-103.e3. doi: 10.1016/j.cub.2018.11.050. Epub 2018 Dec 20. PubMed PMID: 30581025.

6: Tobler PN, Fiorillo CD, Schultz W. Adaptive coding of reward value by dopamine neurons. Science. 2005 Mar 11;307(5715):1642-5. PubMed PMID: 15761155.

7: Redish AD. Addiction as a computational process gone awry. Science. 2004 Dec10;306(5703):1944-7. PubMed PMID: 15591205.

8: Sweis BM, Thomas MJ, Redish AD. Beyond simple tests of value: measuring addiction as a heterogeneous disease of computation-specific valuation processes. Learn Mem. 2018 Aug 16;25(9):501-512. doi: 10.1101/lm.047795.118. Print 2018 Sep. PubMed PMID: 30115772.  

No comments:

Post a Comment