Mired in Myopia?

Reinforcement Learning (RL) deals with multi-step decision processes. One strategy for making decisions in a multi-step environment is always choose the option that maximizes your immediate payoff. In RL, they call this strategy "myopic" since it never looks beyond the payoff for the current action.

The problem is that this strategy might produce a smaller total payoff at the end of the process. If we look at the process as a whole, we may identify a sequence of actions that produces a higher overall reward while not maximizing the reward for each individual action.

This all relates to an experiment that I'm running on contact strategies. The experiment controls all calls other than appointments and refusal conversion attempts. The overall contact rate was 11.6% for the experimental protocol, and 9.0% for the control group. The difference is statistically significant.

But establishing contact is only an intermediate outcome. The final outcome of this multi-step process is completing an interview. It appears that the control and experimental groups produce interviews at about the same rate (147 control vs 153 experiment). Timestamps on the call indicate that the experimental contact strategy reduces time spent achieving the first contact by about 10%, but overall savings are only about 2% (i.e. not really different).

This looks like a case where a myopic policy (i.e. improving only the contact stage) does not lead to overall gains. The current experimental protocol will need to be expanded to all stages somehow. That's a big step...


