Home Action Genome

A large-scale multi-view video dataset of daily activities at home

CVPR2021 Competition

We have 3 challenge tasks in Home Action Genome competition.

Challenge #1: Atomic Action Localization

About

In our dataset, we annotated all the atomic actions segments that are performed during the activities. For this track, participants will use atomic action segments, this dataset carefully annotated with a complete set of temporal action segments for the atomic action localization task. Each sample can contain multiple action segments. The task is to localize these atomic action segments by predicting the start and end times of each atomic action as well as the action label. For this track, participants are also allowed to leverage audio information. External datasets for pre-training are allowed, but it needs to be clearly documented.

competition1

Evaluation Metric

For evaluation, the metric used in this task is Second-mAP. Second-mAP measures the area under the precision-recall curve of detections for each second. To determine if the prediction is true positive, we check intersection over union (IoU) with ground truth segments. If temporal IoU is greater than a threshold (e.g. IoU > 0.5), we judge the detection is true positive. This metric is similar to ActivityNet.

Baseline

Pre-trained models and pre-computed features will be available soon.

Submission Format

The information of submission format will be released soon.

Challenge #2: Scene-graph Generation

About

We use scene graphs to describe the relationship between a person and the object used during the execution of an action. In this track, the algorithms need to predict per-frame scene graphs, including how they change as the video progresses. For this track, participants are also allowed to leverage audio information. External datasets for pre-training are allowed, but it needs to be clearly documented. Since there are multiple relationships between each pair of human and object, there is no graph constraint (or single-relationship constraint).

competition2

Evaluation Metric

For evaluation of scene graph prediction, we use three standard evaluation metrics:

(1) Scene graph detection (SGDET): To detect a set of objects (their bounding boxes, and object categories) and the predicate labels between the person and each of these objects. An object is considered to be correctly detected if it has at least 0.5 IoU overlap with the ground-truth bounding box. Input information is video and other modalities only.

(2) Scene graph classification (SGCLS): To predict object categories and predicate labels between the person and each object. Participants can use input information as video, other modalities and ground truth boxes.

(3) Predicate classification (PREDCLS): To predict predicate labels only. This evaluation examines a model’s performance on predicate classification in isolation from other factors. Participants can use input information as video, other modalities and ground truth boxes with object categories.

Evaluation metrics is recall@k, we compute the fraction of times the ground truth relationship triplets are predicted in the top k most confident triplet predictions in each tested frame. We will use k=20, 50.

Baseline

Pre-trained models and pre-computed features will be available soon.

Submission Format

The information of submission format will be released soon.

Challenge #3: Privacy Concerned Activity Recognition

About

Privacy-sensitive recognition method is very important for practical application. This task is Video-level activity recognition, but in specially, input videos are blurred, look like recorded with a multi-pinhole camera. In training models, participants can use unblurred images, but test data has only blurred images. External datasets for pre-training are allowed, but it needs to be clearly documented.

competition3

Evaluation Metric

This task is video understanding without a clear video image. Participants predict k activity labels for each video, and each candidate predicted label is checked whether or not it is consistent with ground truth activity label to calculate scores. We will use k=1 and k=5 and evaluate recognition methods with the average of these two scores.

Baseline

Pre-trained models and pre-computed features will be available soon.

Submission Format

The information of submission format will be released soon.