RLogue is a 'game' made by Michele Pirovano for the Reinforcement Learning PhD course at Politecnico di Milano, Italy. ################################################# The Simulation ####################### The simulation sees agents navigating on a grid of cells. Each cell has a reward that is given to the agent when stepped on. Agents can walk from one cell to the next one using the following actions: - Go up - Go down - Go left - Go right - Wait After an action, the agent will earn the reward corresponding to the cell it ends into. The standard reward in each cell is -1, to simbolize the cost of movement. Some cells may be traps and will instead have a reward of -20. The goal cell will earn a reward of 100. When an agent reaches the GOAL cell (the white circle), it stops there and waits for the other agents. When all agents reach the GOAL cell, the episode ends and the simulation is restarted. The agents retain what they have learnt. The simulation can be started in two modes: SOLO or GAME. - SOLO mode: The AI agent will be alone and its behaviour can be seen in action. The grid can be seen in its entirity. - GAME mode: The Human and the AI agent will race to get to the goal while earning the best reward. The grid is not entirely visibile until the human player navigates it, simulating the limited view of the AI, in order to make the game not too easy for the player. ################################################# Player Controls ####################### During the GAME mode, the player can move using the keyboard: - W -> Go Up - D -> Go Right - S -> Go Down - A -> Go Left - Spacebar -> Wait Note that the simulation will not advance until the player performs an action. ################################################# Menu Controls ####################### In the game screen, there are a few controls, depending on the chosen mode. In the GAME mode, we have the following controls: - RESET Retry the game with the same settings - EXIT Go back to the main menu. In the SOLO mode, we have additional controls: - PLAY Start the simulation - PAUSE Pause the simulation In the lower row, there are feedback controls: - REWARD Show the rewards, on the grid, color coded (green = high reward, red = low reward). Note: Only for the WAIT action, so that the rewards corresponding to a given cell can be seen. - VALUE Show the value function, as computed by the current AI agent, on the grid. Color coded (green = high value, red = low value). - Q-VALUE Show the action-value function, as computed by the current AI agent, on the grid. Color coded (green = high Q, red = low Q). Note: the Q value depends also on the action taken. Thus, only the Q-value for the action GO_DOWN is shown. Note: try enabling the Q value feedback in the Risky GridWorld example and see how the second row from bottom has low Q-values for GO_DOWN, but good values for the Value Function! - PATH Show the path taken by the agent. Color coded (light blue = recently visited, dark blue = remotely visited) - BEST ACTION Shows arrows pointing towards the best direction to take in the cell, as estimated by the algorithm. The action WAIT is shown as a pink cube. Note: To remove the arrows, click RESET. If Model-Based Interval Estimation is chosen, the following are also present: - EST. REWARD Show the estimated reward of each cell by the MBIE algorithm. Note: Only for the WAIT action. - OCCUPANCY Show the occupancy count in each cell. I.E. the number of times a given state-action pair has been encountered. Note: Only for the WAIT action. Note: The white line that appears in the middle after the first episode ends is a graph of the cumulative reward obtained in each run. It can be seen to converge to a fixed value when the algorithm converges. ################################################# The Agents ####################### The AI is controlled by Reinforcement Learning and, specifically, by one of the following methods (in parentheses the names of the agents): - Value Iteration (VIter) Value iteration is performed off-line, at the start of the game. Given the whole model, the agent computes the value function by iterating it until convergence. It will then compute the best path by taking the greedy policy at each state. Note: you cannot beat the Viter AI in a static condition, as it will compute the best possible route and always follow it, due to its complete understanding of the model. Note: Why was this implemented? Because MBIE needed it anyway, so to test whether it worked or not I added this too. - Q-learning (QGreedy, QSofty) The agent uses on-line Q-learning. Q-learning is a TD-learning algorithm, and as such approximates the rest of the path from experience. Due to this, it does not require complete episodes and can thus be used on-line. Q-learning is an off-policy algorithm: it does not require a policy to be taken to learn it. QGreedy uses an epsilon-greedy action-selection policy. QSofty uses a softmax action-selection policy. - Sarsa (SGreedy, SSofty) The agent uses on-line SARSA (TD-based) learning. SARSA is an on-policy algorithm as it updates its policy based on the actions taken. SGreedy uses an epsilon-greedy action-selection policy. SSofty uses a softmax action-selection policy. - Model-Based Interval Estimation (Mbie) The agent uses MBIE. It is an on-line, model-based method that estimated a model of the underlying MDP from experience and builds an internal MDP, with confidence intervals. It performs a modified version of Value Iteration on the internal MDP at each step, then uses a greedy policy to select the action. This version uses Exploration Bonus, which computes the estimated Q by giving a bonus to state-action pairs inversely proportional to the occupancy count. This method is A LOT faster (hence why I chose it, since the other one blocked), but it stills gets slow for a high number of cells due to it updating Q for each S-A pair at each step, and thus it is not suitable for real-time learning without further optimizations (8x8, 16x16). However, it can be seen that the method converges faster than Q-learning to the optimal path. ################################################# Sliders ####################### In the game screen, there may be one or more sliders. They can be used to control a few variables. - Turn Step [0,1] Controls the speed of the simulation (when not in manual mode, i.e. when not controlling the steps through the player inputs). - Discount Rate (gamma) [0,1] The discount factor determines the importance of future rewards. If close to zero, the agent will favour immediate rewards, if close to one, the agent will give the same importance to immediate and delayed rewards. Note: This slider has no impact on the Viter agent. - Learning Rate (alpha) [0,1] The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. Note: This slider has no impact on the Viter agent. - SoftMax T exp [1,6] The temperature of the SoftMax algorithm will be 10^(texp). It goes towards 10 when this value is 1 and towards INF when this value is 6. We do not want to go below 1, as the computations of SoftMax will easily go to infinity (exp(Q/t)), breaking the algorithm. After all, this value works correctly. It is suggested to keep it high (6) at the start for exploration and lower it later (1) for exploitation. - Epsilon Greedy [0,1] The epsilon value of the Epsilon Greedy. Epsilon close to zero will favour exploitation, epsilon close to one will favour exploration. ################################################# Options ####################### The main menu has options that can be toggled on or off. - Music Set the music on or off - Risky GridWorld If on, the grid will be created following the "risky gridworld" example. There won't be any blocked cells, the goal will yield a reward of 0, the player will start at 0,0 and the goal will be to the right-bottom corner. In addition, the bottom row will be full of -100 traps. This example is useful to see the difference between SARSA and Q-Learning. - Varying rewards If on, the rewards will vary with time with a sinuous function between -50 and 50. This means that there are no traps, but the system is dynamic and thus harder to predict. The goal will always yield a reward of 100. Note: Off-line methods (such as Value Iteration) will fail with varying rewards, since they will compute their policy only considering the static rewards. On-line method, instead, will learn to take advantage of the waves of positive rewards before advancing to the end - Random Grid If on, the grid will be randomized at each restart. Some cells will be blocked, some won't. The traps will also be randomized. Note: There is no garauntee that an exit can be reached. If off, the standard sample grid will always be used. - Random Start If on, at each episode, the agents will restart at a random position. If off, the agents will always start at the position (0,0) (if not blocked). - Grid Size The size of the grid: 4x4, 8x8, 16x16. Note that If 16x16 is selected, the simulation will take a while to start. ################################################# Notes ####################### - When the goal is reached, we make sure that the agent does a WAIT step in order to acknowledge that the goal's reward has been reached. This is needed for MBIE, or it won't register the goal's reward correctly and always assume it has maximum reward. - When multiple actions share the "best action" state, we choose one at random.