Antoine Broyelle

Learning to Land on Mars with Reinforcement Learning

February 06, 2022

Once in a while, a friend who is either learning to code or tackling complex problems drags me back to CodinGame However, I tend to put one type of challenge aside: optimal control systems. This kind of problem often requires hand-crafting a good cost function and modeling the transition dynamics. But what if we could solve the challenge without coding a control policy? This is the story of how I landed a rover on Mars using reinforcement learning.

CodinGame Mars Lander

This game’s goal is to land the spacecraft while using as few propellants as possible. The mission is successful only if the rover reaches a flat ground, at a low speed, without any tilt.

For each environment, we’re given Mars’s surface as pairs of coordinates, as well as the lander’s current state. That state includes position, speed, angle, current engine thrust, and remaining fuel volume. At each iteration, the program has less than 100ms to output the desired rotation and thrust power.

One test case looks like this:

Vid 1 — CodinGame Mars Lander

Coding the Game

The CodinGame platform is not designed to gather feedback on millions of simulations and improve on them. The only way to circumvent this limitation is by reimplementing the game.

The Interface

Because I wanted to do some reinforcement learning, I decided to follow the Gym package’s Environment class interface. Gym is a collection of test environments used to benchmark reinforcement learning algorithms. By complying with this interface, I could use many algorithms out of the box (or so I thought).

All Gym Environments follow the same interface, with two fields and three methods:

  • action_space
  • observation_space
  • reset(): called to generate a new game
  • step(action): returns the new observations of the environment given a specific action
  • render(): visualizes the agent in its environment

At first, I thought I could go without implementing render, but I was wrong. As with any other machine learning task, visualization is of the utmost importance for debugging.

Action Space

The algorithm controls the spacecraft’s thrust and orientation. The thrust has 5 levels between 0 and 4, and the angle is expressed in degrees between -90 and 90.

Rather than working in absolutes values, I decided that the action space would be a relative change in thrust and angle. The engine supports only a +/-1 change in thrust, and the orientation cannot change more than 15 degrees in absolute value.

Thrust and angle can be represented with categorical variables, with 3 and 31 mutually exclusive values respectively. Another possible characterization, which I decided to use, is to represent the action space as two continuous variables. For stability during the training, I normalized these values between -1 and 1.

Observation Space and Policy Network

Defining the observation space is a bit more challenging than it is for the action space. First, the agent expects a fixed-length input, but the ground is provided as a broken line with up to 30 points. To fulfill this requirement, I iteratively broke the longest segment into two by adding an intermediate point, until I reached thrity 2D points, for a total of 60 elements.

I then concatenated to these 60 elements the seven values that fully characterize the rover’s dynamic: x and y position, horizontal and vertical speed, angle, current thrust, and remaining propellant. I normalized these values between 0 and 1, or -1 and 1 if negative values are allowed.

In the early experiments, the fully connected policy was clearly struggling to identify the landing pad. The rover exhibited completely different behavior when the ground was translated horizontally because a slight offset creates a completely different representation of the ground once the surface is broken down into 30 points.

Changing the policy to a convolutional neural network (CNN) could have helped identify the landing area. By design, CNNs (unlike MLP) are translation equivariant. In addition, CNNs could have also eliminated the problem of fixed-length input I addressed above. Indeed, their number of parameters is independent of the input size.

After a few trials, it became clear that this approach would require a lot more effort to achive success. When using a CNN to extract an abstract ground representation, at some point these features need to be merged with the rover state. When should they merge? What should be the CNN’s capacity compared to that of the MLP? How should both networks be initialized? Would they work with the same learning rate? I ran a few experiments, but none of them were firmly conclusive.

In the end, to avoid going mad, I decided to use an MLP-based policy but to help the agent by providing a bit more information. The trick was to add the x and y coordinates of the two extremities of the flat section. This extra information can easily be computed by hand, so why not feed it to the model?

Simulation

When implementing the reset() function, I wanted to utilize as many environments as possible. I initially thought that generating a random ground surface and random rover state would do the job.

However, it turned out that not all these environments could be solved. For example, the rover might exit the frame before compensating for the initial speed, or the solution might consume an extremely high propellant volume. Finding a valid initial state might be as hard as solving the problem itself.

Clearly these unsolvable cases were penalising the training. This comes as no surprise; the same rule applies for any other machine learning task or algorithm. In the end, I decided to start from the five test cases CodinGame provided and to apply some random augmentations.

Reinforcement Learning

Policy Gradient

In reinforcement learning, an agent interacts with the environment via its actions at each time step. In return, the agent is granted a reward and is placed in a new state. The main assumption is that the future state depends only on the current state and the action taken. The objective is to maximise the rewards accumulated over the entire sequence.

Two classes of optimization algorithms are popular: Q-learning and policy gradient methods. The former aims to approximate the best transition function from one step to another. The latter directly optimizes for the best action. Despite being less popular, policy gradient methods have the advantage of supporting continuous action space and tend to converge faster.

For this task, I used a policy gradient method known as Proximal Policy Optimization (a.k.a PPO). In its revisited version, PPO introduces a simple twist to the vanilla policy gradient method by clipping the update magnitude. By reducing the variance and taking smaller optimization steps, the learning becomes more stable, with fewer parameter tweaks.

If you want to learn more about PPO and its siblings, I recommend the article Policy Gradient Algorithms by Lilian Weng, Applied AI Research Lead at OpenAI.

Action Noise

In the first experiments, the rover was very shaky, as its tilt oscillated around the optimal value. To eliminate this jerky motion, I used gSDE, which stands for generalized State-Dependent Exploration.

In reinforcement learning, it’s important to balance exploitation with exploration. Without exploring the action space, the agent has no way to find potential improvements. The exploration is often achieved by adding independent Gaussian noise to the action distribution.

The gSDE authors propose a state-dependent noise. That way, during one episode, the action remains the same for a given state rather than oscillating around a mean value. This exploration technique leads to smoother trajectories.

Vid. 2 — Jerky trajectory without gSDE

Reward Shaping

The reward is an incentive mechanism that tells the agent how well it’s performing. Crafting this function correctly is a big deal, given that the goal is to maximise the cumulative rewards.

Reward Sparsity

The first difficulty is reward sparsity. Let’s say we want to train a model to solve a Rubik’s cube. What would be a good reward function, knowing that there is only one good solution among 43,252,003,274,489,856,000 (~43 quintillion) possible states? It would take years to solve if we relied only on luck. If you’re interested in this problem, have a look at Solving the Rubik’s Cube Without Human Knowledge.

In our case, the rover has correctly landed if it grounds on a flat surface with a tilt angle of exactly 0° and a vertical and horizontal speed lower than 40 m/s and 20 m/s respectively. During training, I decided to loosen the angle restriction to anything between -15° and 15°, which increased the chances of reaching a valid terminal state. At inference, some post-processing code compensates for the rotation when the rover is about to land.

Reward Scale

If the rover runs out of fuel, or if it leaves the frame, it receives a negative reward of -150, and the episode ends. A valid terminal state yields a reward equal to the amount of remaining propellant.

By default, if the rover is still flying, it earns a reward of +1. In general, positive rewards encourage longer episodes, as the agent continues accumulating. On the other hand, negative rewards urge the agent to reach a terminal state as soon as possible to avoid penalties.

For this problem, shorter episodes should ideally consume less fuel. However, using the quantity of remaining fuel as a terminal reward creates a massive step function that encourages early misson completion. By maintaining a small positive reward at each step, the rover quickly learns to hover.

Vid. 3 — Learning to hover

For the rest, not all mistakes are created equal. I decided to give a reward of -75 if speed and angle were correct when touching non-flat ground, as well as a -50 reward if the spacecraft crashed on the landing area. Without more experiments, it’s unclear whether if the distinction of collisions represents any advantage.

Not Taking Shortcuts

Previously, I talked about helping the model identify the arrival site. One idea was to change the policy structure by replacing the MLP with a CNN. I used the solution of adding this information in the input vector. A third possibility was to change the reward function to incorporate a notion of distance to the landing area.

I ran a few experiments in which the reward was the negative Euclidean distance between the landing site and the crash site. As it turned out, this strategy instructs the agent to move in a straight line toward the target.

If the starting position and the landing site have no direct path between them, the agent would have to undergo a long sequence of decreasing rewards to reach the desired destination. Despite being an intuitive solution, using the Euclidean distance as a negative reward is a strong inductive bias that reduces the agent’s exploration capabilities.

Vid. 4 — Suboptimal strategy when minimizing distance to landing site

Hyper-Parameters

In computer vision or natural language processing, starting the training from a pretrained model is good practice: it speeds up training and tends to improve performance. However, this project has no available pretrained model.

Another element that is even more annoying is the absence of good hyperparameters. At first, I used the default values, but the training suffered from catastrophic forgetting. As you can see on the graph below, the mean reward drops sharply and struggles to recover.

Catastrophic forgetting
Fig 1 — Mean Episode Reward — Catastrophic forgetting.

Rather than starting an extensive hyperparameter search, I took inspiration from the RL baselines3 Zoo configurations. After a few tweaks, I had a great set of values. Still, the policy could doubtless be further improved with hyperparameter optimization.

Exporting the Policy Network

Training a good policy is only half the work. The final objective is to submit a solution on CodinGame. Two hurdles made it non-trivial: Pytorch is not supported, and the submission must be shorter than 100k characters.

Pytorch Modules without Pytorch

Since v1.7.0, Pytorch has the fx subpackage which contains three components: a symbolic tracer, an intermediate representation, and Python code generation. By employing a symbolic tracing, you obtain a graph that can be transformed. Finally, the last bit generates a valid code with match the graph’s semantic.

Unfortunately, the code generator covers only the module’s forward() method. For the __init__(), I wrote the code generation to traverse through all modules and print all parameter weights. Finally, since Pytorch is unavailable in the environment, I had to implement three Pytorch modules in pure numpy: Sequential, Linear, and ReLU.

The exported module is self-contained and combines both parameter weights and a computation graph. The result looks something like this:

import numpy as np
from numpy import float32

class MarsLanderPolicy:
    def __init__(self):
        self.policy_net = Sequential(
            Linear(
                weight=np.array([[-6.22188151e-02, -9.03800875e-02,  ...], ...], dtype=float32),
                bias=np.array([-0.03082988,  0.05894076, ...], dtype=float32),
            ),
            ReLU(),
            Linear(
                weight=np.array([[ 6.34499416e-02, -1.32812252e-02,,  ...], ...], dtype=float32),
                bias=np.array([-0.0534077 , -0.02541942, ...], dtype=float32),
            ),
            ReLU(),
        )
        self.action_net = Linear(
            weight=np.array([[-0.07862598, -0.01890517, ...], ...], dtype=float32),
            bias=np.array([0.08827707, 0.10649449], dtype=float32)),
        )

    def forward(self, observation):
        policy_net_0 = getattr(self.policy_net, "0")(observation);  observation = None
        policy_net_1 = getattr(self.policy_net, "1")(policy_net_0);  policy_net_0 = None
        policy_net_2 = getattr(self.policy_net, "2")(policy_net_1);  policy_net_1 = None
        policy_net_3 = getattr(self.policy_net, "3")(policy_net_2);  policy_net_2 = None
        action_net = self.action_net(policy_net_3);  policy_net_3 = None
        return action_net

Encoding for Shortening

As good as it looks, the exported solution is way too long at 440k characters.

Oups Submitted code is too long
Fig. 1 — CodinGame Error

The model has 71 input values, two hidden layers with 128 activations and two output nodes, which represents 25,858 free parameters. We could train a shallower network, but let’s see if we can find a way to shrink the generated code by at least 78%.

Each parameter is a 32-bit float, which takes, on average, 16 characters in plain text. Even when truncating to float16, the exported module is only ~100k chars shorter.

Clearly, printing all the floating numbers decimals is expensive. A different representation must be used! The shortest solution is obtained by taking the base64 encoding of the buffers filled with half float. The solution is now only 75k chars long.

I had one more trick up my sleeve in case base64 had been insufficient. So far, I have only been using chars contained in the ASCII table. The hack is to group two consecutive UTF-8 chars into one UTF-16 char. It’s way less readable and practical, but this yields another 50% reduction! Look at this monstrosity:

exec(bytes('浩潰瑲猠獹椊灭牯⁴慭桴䜊㴠㌠㜮ㄱ圊䑉䡔㴠㜠〰ਰ䕈䝉呈㴠㌠〰ਰ⸮ ... ⸮ਮ牰湩⡴≦牻畯摮愨杮敬紩笠潲湵⡤桴畲瑳紩⤢','u16')[2:])

With these three techniques combined (numerical approximation, buffer encoding and text encoding) it would have been possible to accommodate a model with 2.75 times more learnable parameters.

Conclusion

At the time of writing, with a cumulated 2,257 liters of fuel left, this solution puts me in 225th place over 4,986 contestants. The policy only takes between 8ms and 10ms.

This project was quite fun and a great opportunity for some hands-on RL practice. For me, the main takeaway is that RL uses the same nuts and bolts as other ML projects. Simplify the problem, always use visualization, provide good input, and don’t expect default hyperparameters to work on your task.

Have a look at the repo (MIT license) antoinebrl/rl-mars-lander if you want to play the game, use the environment or use the trained agent.