Deep-RL: DQN — Regression or Classification?

Setup: We assume a Deep-RL example of a DQN with 4 discrete available actions, and we want to select one of the available actions depening on our estimated state


When you first start working with DQNs (Deep Q-Networks) you are introduced to some challenging approaches to applying deep neural networks to address Reinforcement Learning problems.


First, let us talk about Q-Learning in the first place, so we understand how the Deep-Q-Learning works.

Monte-Carlo Methods

Much of the content covered on the Basic Understanding in Reinforcement Learning covers the Monte-Carlo Methods approach logic.

Temporal Difference

Temporal Difference comes to the rescue, so we do not need a complete episode to evaluate the action-value function for our states. Instead, we can use the temporal difference to evaluate for the action-value function at each consecutive action step we take during our episode.


It works by looking into this and the next action, so we can compare the current state/action value, if by comparing it to the next state/action. If the next state/action pair is higher, then our existing state/action should increase by a bit, otherwise decrease by a bit.

SarsaMax (Q-Learning)

SarsaMax is the Sarsa approach, by which we don’t need the next action to be taken to populate our existing q-table state/action pair. Instead, we consider as next action the one dictated by our policy, so we compare our existing action together with the the next best-scored action of our q-table.

Updating Q-Table using Sarsa

We already mentioned it earlier, but we will make it clear now…

Deep Q Networks

Now imagine that we want to transfer this logic to Deep Learning. In fact, what we just saw above is how we approach temporal difference with sarsamax via a deep neural network!

In the Neural Network, we have: “Input=State” and “Output=Action”

Classification or Regression?

So here we obviously saw that we train using a regression model, but our problem is in fact classification!

The important note here is that, each of these outputs (one for each possible action) measures the same thing! Estimated total reward!

So step back and think about it…

They all measure reward scores! So since we measure the same thing in multiple outputs, then the output with the highest reward over the same measurement is the topmost candidate.

So even though we represent discrete output values for our expected total rewards here (as we do with typical regression models), because each value represents different quantity over the same currency (our total estimated reward), it automatically transforms the output to a classification model.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ioannis Anifantakis

Ioannis Anifantakis

MSc Computer Science. — Software engineer and programming instructor. Actively involved in Android Development and Deep Learning.