Aggregated View Object Detection (AVOD) for Sensor Fusion of Lidar and Camera in Autonomous Driving.

Gandham Vignesh Babu

6 min readMay 22, 2021

Avod proposes nueral network architecture that uses LIDAR point clouds and RGB images.

This network is similar to the Faster-RCNN object detection network.

It consists of 2 sub networks:

Region proposal network
Second stage detector network.

Why do we use the region proposals ?

These region proposals are used to predict the Extents, Orientation , Classification of objects in 3D space.

Why 3D object detection is not performing great like 2D object Detection ?

Because of third dimension estimation problem.
Low resolution of 3D input data.
Detoriation of quality of data as function of distance.
Unlike 2D object detection , in 3D object detection we also estimate the orientation of the bounding boxes.

Do we have 2D region proposals or 3D region proposals ?

3D object detection rely on the 3D region proposal generation for 3D search space reduction.
So if we are using the 3D proposal generation so we would be generating the 3D anchors.

In AVOD we use region proposals of high quality detections via computationally expensive processing at later stages.

Achieving the higher recall at region proposal generation stage is crucial for good performance.

What are Advantages of AVOD ??

Inspired by feature pyramid networkds for 2D object detection . Proposal of novel feature extractor that produce high resolution feature maps from lidar point clouds and RGB images.
Proposal of feature fusion region proposal network utilizes the multiple modalities to produce the higher recall region proposals for the smaller classes.

3. Proposal of 3D bounding box encoding that confirms to geometric box constraints allowing the higher 3D localization accuracy.

4. Proposed nueral network exploit 1X1 convolution at RPN stage , allowing the higher computational speed and low memory.

AVOD Architecture :

1.Feature maps from BEV image and RGB image.

Both feature maps are used by the RPN to generate the non oriented region proposals.
These region proposals are passed through the network for the — Dimension refinement , Orientation estimation, Category Classification.

2. Generating the feature maps from the point clouds and images.

Generation of the 6 channel BEV map from the voxel grid representation of point cloud at 0.1 Meter resolution.
What is voxel grid ? The voxel grid is another geometry type in 3D that is defined on a regular 3D grid, whereas a voxel can be thought of as the 3D counterpart to the pixel in 2D.
The voxel grid can also be created from a point cloud using the method create_from_point_cloud. A voxel is occupied if at least one point of the point cloud is within the voxel. The color of the voxel is the average of all the points within the voxel. The argument voxel_size defines the resolution of the voxel grid.

Point cloud is cropped at [-40,40]X[0,70] meters to contain the points with in the field of camera.
This part of the point cloud which is with in the field of view of camera will be used for the BEV generation.
The first 5 channels of the BEV map is encoded with the maximum height of the point (point in point cloud for which voxel grid is generated ) in the grid cell, generated from the 5 equal slices between [0,2.5] meters along the Z-axis.
6 channel contains the point density information computed per cell.

Point density information where N is the number of points present in the cell.

3. Feature Exractor

Proposed architecture has two identical feature extractors one for each view.
This is composed of the two segments 1. Encoder 2. Decoder.
Encoder : This is modelled after the VGG-16. Mainly a number of channels reduced by half and cutting the network at conv-4 layer.
It takes the input as the M X N X D image or feature map.
It produces the m/8 X N/8 X D feature map.
This feature map has the higher representational power but 8 times lower resolution than the input.
Down sampling by 8 times results in smaller objects occupy less than one unit.
To overcome this problem AVOD came up with the decoder network inspired by the feature pyramid network by upsampling.
AVOD proposes the creation of the bottom-up decoder that learns to upsample the feature map back to the input size.
The special feature in AVOD is Decoder network involves the concatenation of feature map and the upsampling followed by the 3*3 convolution.

4. MULTI-MODAL Fusion RPN

Similar to 2D two-stage detectors, the proposed RPN regresses the difference between a set of prior 3D boxes and the ground truth.
These prior boxes are anchors and these are encoded using the axis aligned bounding box.
From RPN we generate the non-oriented region proposals . So we don’t get the orientation information of the proposal from the Region Proposal Network.
But while generating the anchors we will generate the anchor boxes with different orientations.
Anchor boxes are parameterised by the centroid (t ₓ, tᵧ , tz ) and the axis aligned dimensions (dₓ ,dᵧ ,dz ).
To generate the 3-D anchor grid from BEV map and 6 channels , (tₓ , tᵧ ) pairs are sampled at interval of 0.5 meters in BEV. tz is obtained from the sensor height from the ground plane.
The dimensions of anchors are sampled by the clustering of the training samples.
Anchors with out 3D points are significantly removed using the integral images.
This results in 80–100k non empty anchors per frame.

5. Extracting Feature Crops Via MultiView Crop And Resize Operations :

To extract the feature crops for every anchor we use the crop and resize operation.
The anchor which is generated is an 3D anchor with some orientation . Two regions of interest are obtained projecting the anchor onto BEV and image feature maps.
These corresponding regions are bilinearly resized to the 3X3 to obtain the equal length feature vectors.

6. Dimensionality reduction via 1X1 Convolution :

In some scenarios , RPN is required to save the feature cros for 100K anchors in GPU.
For example , 100K anchors from 256 dimensional feature map requires arround 5GB of memory. So in AVOD proposal of 1X1 convolution on output feature maps as an efficient dimensionality reduction mechanism which reduces the memory usage.
Fully connected layers of size 256 use feature crops to regress the axis aligned object proposal boxes and output an objectness score and 3D box regression is performed between (t ₓ, tᵧ , tz,dₓ ,dᵧ ,dz).
Smooth L1 loss is used for 3D box regression and cross entropy loss for objectness.
Background anchors are ignored while computing the regression loss.

SECOND STAGE DETECTION NETWORK:

1.3D bounding box encoding :-

In AVOD , bounding box coordinates with 4 corners,2 heights values as the top and bottom corner offsets from the ground plane.
The proposed encoding reduces the box representation from 24 dimensional encoding to 10 dimensional encoding vector.

2. Explicit Orientation Vector Regression :-

Avod has orientation vector that implicitly handles the angle wrapping between -pi,pi.
Usage of the regressed orientation vector to resolve the ambiguity in orientation estimation.

3. Generating Final Detections :-

Feature crops from the both input views are resized to 7X& then fused with element wise operation.
Single set of 3 fully connected layers of 2048 proces the fused feature crops for output box regression.
For final detections loss we have the loss similar to RPN.
Final loss for detections has 2 smooth L1 losses for the boundingbox and orientation vecotr regression.
Cross entropy loss for classification task.

Training Parameters :

Optimizer : ADAM
Learning Rate : 0.0001
Decay Factor : 0.8