Our problem scenario is defined by three characterisations: the challenge, the objective, and the approach. The challenge is that robots currently operate in dynamic environments and must avoid potential collisions that could cause injury or damage equipment. Our objective, therefore, is to develop a collision avoidance system that uses reinforcement learning to train a robot arm to navigate safely in these dynamic environments. The approach involves utilising an RGB and depth camera to provide real-time input of thrown tennis balls, enabling the robot arm to avoid these obstacles.

The flow diagram illustrates the solution to our problem scenario, where an RGB and depth camera provide real-time visual input of tennis balls being thrown at a robotic arm. Using the RGB camera, we employed YOLO to capture various colour variations and utilised these features for object detection. We then trained the model to recognise incoming obstacles and predict their bounding boxes. The depth camera served as input for the deep Q-learning model, which, using the predicted bounding boxes, enabled the robot arm to learn the optimal positions to avoid the incoming tennis balls.

The environment for our mission consists of a 7-degree-of-freedom KUKA LBR iiwa robotic arm mounted on a table within a Pybullet simulation. Seven-times-upscaled tennis balls move towards the robot's workspace from random positions. A synthetic RGB-D and depth camera is positioned to view the workspace and the incoming tennis balls, allowing the arm to detect and dodge these objects. Robot control was achieved using the pybullet-robot-envs repository from GitHub.

YOLO version 8 or You Only Look Once is a computer vision model developed by Ultralytics for the purpose of real-time image and video processing. It offers a short response time for detection, classification, and segmentation problems, which made it a good candidate for use in our project for tracking projectiles in real-time.

Utilising the Open Images v7 Dataset, which contains approximately 16 million images across 600 classes see here, we downloaded around 400 annotated images of tennis balls and organised them into training, validation, and test sets using a script.

Initially, we began with a detection model pre-trained on the entire Open Images v7 Dataset. During training, the loss for both boxes and classes consistently trended downwards, though the validation loss was not perfect. Our primary focus was on accurately detecting tennis balls. A confusion matrix would help assess the model's performance in this specific task.

The model reliably predicts tennis balls correctly, and the other classes, which are not of interest, are mostly dismissed as background, not affecting the model's implementation. This accuracy was visually confirmed on the test dataset and within the Pybullet environment.

In our project, we utilise a Deep Q-Network (DQN) to enable the robotic arm to make informed movement decisions. The DQN model is essential for the robot's decision-making process, allowing it to learn and improve over time through interaction with the environment. The inputs to the DQN include the end effector's position and the distance to the nearest tennis ball, which help the robot understand its current state and the proximity of obstacles. The DQN architecture comprises two hidden layers with 256 nodes each, enabling the model to process these inputs effectively and learn the optimal actions to avoid collisions. The output of the DQN is an end effector movement command, which is then used alongside inverse kinematics to calculate the necessary joint angles for the robot to avoid obstacles. Essentially, the DQN model functions as the brain of our collision avoidance system, guiding the robot to navigate its environment safely in real-time.

The graphs shown illustrate the results of various training runs, where the x-axis represents the episodes and the y-axis represents the reward obtained. On the left, we see the initial run, and on the right is our most recent episode versus reward graph. In the initial version, the DQN learned that placing the end effector straight down on the table maximized the reward, which was effective for avoiding projectiles but not practical.

In subsequent versions, we refined the observations and introduced penalties and rewards for specific actions. This led to the DQN learning more balanced strategies for avoiding projectiles while maintaining an optimal position. The improvements can be seen in the increased number of positive reward episodes, indicating a more effective collision avoidance strategy.

Development:

Environment:

Some compromises were made to simplify the environment when compared with a theoretical real-world counterpart:

Only one type of projectile (tennis balls) would be used for a simpler computer vision task.
The tennis balls would be upscaled to increase visibility at lower resolutions to the YOLO model. This allowed the simulation to run faster instead of rendering a high resolution RGBD image every step.
To assist with a meaningful reward scheme, projectiles would not be targeting areas on the robot where it cannot move and therefore avoid damage (the robot’s base and first link).

The KUKA and table models were taken from the PyBullet library.

Obstacle models were taken from pybullet-object-models on Github.

Control of the KUKA arm was achieved using code from pybullet-robot-envs on Github.

YOLO:

After selecting YOLO for the project, we built our dataset of tennis ball images and found a model pre-trained on a larger dataset which we could fine-tune.

Annotated images of tennis balls were taken from the Open Images Dataset v7. An example of these can be found at this link: https://storage.googleapis.com/openimages/web/visualizer/index.html?type=detection&set=train&c=%2Fm%2F05ctyq

The FiftyOne library was used to download and format the desired annotated images into training, test, and validation sets. A .yaml file was also generated for YOLO to read the dataset. A guide for this can be found here: https://docs.voxel51.com/user_guide/export_datasets.html#yolov5dataset

Pretrained models on the entire Open Images V7 dataset can be found here: https://docs.ultralytics.com/datasets/detect/open-images-v7/. The basic YOLOv8n was used.

DQN:

Several rounds of training using a Pybullet simulation were performed to iteratively improve upon and find suitable hyperparameters as well as a reward scheme for the robot. During development, there were three major changes to how rewards were given to the DQN.

Attempt 1:
- Action: penalties were given for hitting a tennis ball.
- Result: the robot would lay flat on the table and sway from side to side to avoid projectiles, abusing the fact that the least number of projectiles came through this region. This is not ideal in a real-world situation, as colliding with the table would still cause damage.
Attempt 2:
- Action: to combat the robot’s tendency to lay flat on the table to avoid the majority of obstacles and therefore maximise its reward per episode, a penalty was added for collision with the table. Rewards are also now given for maintaining distance from an obstacle using the updated YOLO input observations.
- Result: the robot showed a new tendency to adopt a position standing straight up. While this helped it accrue rewards for ‘dodging’ projectiles coming down on the sides, its manipulability in this pose was severely reduced. This resulted in a sure hit for all projectiles coming down the centre.
Attempt 3: Current
- Action: to increase the general manipulability of the robot to better respond to incoming projectiles, penalties were given for raising the end effector more than 1m above the table.
- Result: the robot maintains a more manipulable pose and is better able to respond to projectiles.

Theoretical Improvements:

DQN is not an entirely appropriate solution for controlling the robot but is the limit of our current knowledge. Its main limitation is the requirement for a discrete action space. In the case of this project, this meant that commands for the end effector to move could only be given at predetermined speeds. Should there be a need to respond to different projectiles moving at different speeds, this would not be an adequate control method. A cursory search online shows that Deep Deterministic Policy Gradient (DDPG) uses a similar Q-function structure as DQN but allows for the use of a continuous action space.

Increasing the complexity of the observation space to track more than one projectile could also improve its ability to avoid more tennis balls at once. Additionally, changing the observations from the end effector’s distance to the nearest projectile to the distance from each joint to the projectile could help the robot sense when it is in danger faster. This would add much more complexity to the problem, however.

AI in Robotics Group 8

Problem Scenario

Problem Solution

Environment

YOLOv8

DQN

Demo Video

Development:

Environment:

YOLO:

DQN:

Theoretical Improvements: