Deep Mind, a subsidary of Google published an extended version of their 2013 paper in Nature along with source codes. Nature Paper. Code. The big picture: a deep network that has loss function as a function that maximizes its score. The input is the pixels of the screen (state) and output is action+score
The exact architecture is as follows. The input to the neural network consists of an 84 × 84 × 4 image produced by the preprocessing map . The first hidden layer convolves 32 filters of 8 × 8 with stride 4 with the input image and applies a rectifier nonlinearity31, 32. The second hidden layer convolves 64 filters of 4 × 4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3 × 3 with stride 1 followed by a rectifier. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action. The number of valid actions varied between 4 and 18 on the games we considered.