mmWave Radar Fusion 论文总结

ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions

  • 2022, Arxiv
  • Question: How to merge mmWave radar with RGB frames to do 3D human mesh reconstruction?
  • Spec: Single person, 3D mesh, RGB + mmWave Radar.
  • Features: Robust in extreme weather/conditions like rain, smoke, low light, and occlusion.


  • Merging scheme: Using three branches: Image branch, radar point cloud branch, and a fusion branch. All three branches are concatenated and sent to a transformer. A human template is also concatenated as “positional encoding” (actually it’s more like prior knowledge encoding.)


  • Previous fusion methods:

    1. Point-level fusion method: Concatenate image features or projected RGB pixels to the point clouds as extended features of the point-based model.
      • This fusion strategy is not suitable for mmWave-RGB fusion due to the sparsity and noise of radar points.
      • Undesirable issues like randomly missing and temporally flicking would lead to fetching fewer or even wrong image features.
    2. DeepFusion: Treat image features as K and V, answering Qs from mmWave point cloud features.
    3. TokenFusion: Do feature extraction and go through the transformer separately for both image and radar features. Fusion happens in the last stage.
  • They perform positional encoding by attaching the 3D coordinates of each joint and vertex in a human template mesh to the global vector.

  • Raw radar side input has a shape of 1024×31024\times 3, which means 1024 mmWave radar point cloud in the cropped body region. Here it’s not clear how can this dimension keeps the same among all samples, but there should be a sampling mechanism to always sample 1024 points exactly.

  • Image feature part, they use HRNet to extract the feature of 224×224224\times 224 cropped body region.

  • They use PointNet++ to process raw radar point clouds, and the length of the resulting feature map (L) represents the number of seed points sampled by the Farthest Point Sample (FPS).

  • Results:


MmWave Radar and Vision Fusion for Object Detection in Autonomous Driving: A Review

  • 2022, Sensors (MDPI)
  • This paper discussed various fusion methods used by previous mmWave + Vision sensor papers.

Fusion Methods

  1. Data Level

    • Mature, but not the mainstream method.
    • Basically, using radar to detect the ROI, and crop the visual frame accordingly.
    • The size of the initial ROI is determined by the distance between the obstacle and mmWave radar.
    • Radar’s information loss is significant.
  2. Decision Level

    • This is the mainstream fusion scheme at present.
    • The basic guideline is to process radar and vision data separately in parallel, let them do the final prediction also separately, and only fuse the predicted results.
    • Radar detection results generate a list of objects and contain information such as the distance, azimuth angle, and relative velocity of the detected objects.
    • The fusion method can be divided into Bayesian theory-based, Kalman Filter-based, Dempster Shafer Theory-based, and Radar Validation-based.
  3. Feature Level

    • This is a relatively new strategy.
    • The core idea is to extract the feature from both sensors, fuse them, and do the prediction.
    • The fusion methods are usually concatenation, point-wise addition, or spatial attention fusion.
    • Note that the goal of fusion here is to compile an RGB image-like feature map, and the object detection module here can be any traditional CV object detection algorithm.
    • Radar feature extraction mostly adopts the method of converting radar points to the image plane to generate a radar image. The purpose of radar feature extraction is to transform radar information into imagelike matrix information. Each radar-generated feature map’s channel represents a physical quantity such as distance, longitudinal speed, lateral speed, and so on.

Future Trend

  • 3D Object detection. The existing Radar-Vision fusion works are mainly 2D object detection, and 3D detection results are far worse.
  • Integrate new sensors.
  • Better ways of sensing information fusion (multi-modal fusion).
    • Better ways to deal with sparseness brought by radar.
    • More efficient multi-sensor fusing methods.


  • Coordinate transformation method: Radar and camera are placed in the same coordinate. Using the camera and radar’s absolute coordinates, perform a linear transformation to calibrate.
  • Sensor verification method: Radar proposes a target list at first, and then verifies and matches using vision information.
  • Vision-based method: Propose candidate areas for moving targets using the camera, and match radar results to it.

Spatial Attention Fusion for Obstacle Detection Using MmWave Radar and Vision Sensor

  • 2020, Sensors (MDPI)
  • Core Contribution: Proposed a novel attention-based radar-vision fusion mechanism to do obstacle detection.
  • Taxonomy: Feature level fusion. (Attached is another taxonomy.)


  • Network structure: The feature extractor part is not anything new, it is a combination of modified ResNet and RetinaNet. The fusion mechanism is the part where it has the edge on.

  • The core mechanism: SAF, aims to predict an attention map generated by radar feature, and point-wise multiply this attention map by the feature extracted from vision features.

  • The “radar image” they used is transformed from radar points, which is shown as follows:

    • They calculate the extrinsic matrix of the radar and front camera separately, transforming the radar point cloud into camera coordinates.
    • They calculate the pixel value of the radar image using depth dd, longitudinal velocity vxv_x and lateral velocity vyv_y. The ff here is a linear transformation for these three variables.
    • Lastly, since each pixel covers a too-small area of the image, they render a solid circle around each radar point. The value within the circle is the same, and the circle’s radius is rr.
    • If there there are two radar points whose distance is less than 2r2r, in the intersection area, use the value of the nearer one (the one with a smaller dd).