Software & Data Downloads

AssemblyBench — Physics-Aware Assembly of Complex Industrial Objects

Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, often overlooking shape complexity and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and physically plausible 6-DoF part assembly trajectories. We also propose AssemblyDyno, a transformer-based model that uses the instruction manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works . . .

LLMPhy — Parameter-Identifiable Physical Reasoning Combining Large Language Models and Physics Engines

Most learning-based approaches to complex physical reasoning overlook the crucial challenge of parameter identification, such as estimating mass and friction, that governs scene dynamics—despite its importance in real-world applications including collision avoidance and robotic manipulation. We present LLMPhy, a black-box optimization framework that integrates large language models (LLMs) with physics simulators for physical reasoning. LLMPhy bridges the textbook physical knowledge embedded in LLMs with world models implemented in modern physics engines, enabling the construction of digital twins of input scenes through the estimation of latent physical parameters. We are publicly releasing our implementation of the core functionalities . . .

SRP — SimpleRefrigerantProperties.jl

SimpleRefrigerantProperties.jl provides lightweight models for evaluating thermodynamic properties of refrigerant fluids in Julia. The package is designed for quick integration into larger thermodynamic or system models where simple, fast property evaluations are sufficient. Thermodynamic properties are parameterized using pressure (Pa) and specific enthalpy (J/kg). The models are constructed using 2-D interpolation over precomputed reference datasets. Property values are evaluated using bilinear interpolation over trapezoidal cells in the pressure–enthalpy (P–h) space. All interpolation routines are implemented directly within the package to minimize dependencies on external libraries.

The dataset provides thermodynamic . . .

embracing-cacophony — Embracing Cacophony

This repository contains the PyTorch data loader code with EQ augmentation and multiple sampling of instrumental stems for music source separation. The inference code and pre-trained weights of a TFC-TDF-UNet v3 model trained with the proposed data augmentation methods is also included.

REXO — multi-view Radar object dEtection with 3D bounding boX diffusiOn

This sofeware contains the PyTorch implementation of REXO (multi-view Radar object dEtection with 3D bounding boX diffusiOn), a radar-based pipeline that takes multi-view radar heatmaps as input and estimates 3D bounding boxes (BBox) of human objects.

REXO operates a BBox diffusion process directly in the 3D radar space and utilizes these noisy 3D BBoxes to guide an explicit cross-view radar feature association. At each diffusion timestep, these noisy 3D BBoxes are projected into every radar view, where RoI-aligned feature cropping extracts view-specific radar features. These multi-view-associated radar features are then aggregated to condition the 3D BBox denoising process. The denoised 3D BBoxes are transformed into the 3D . . .

RAPTR — Radar-based 3D Pose Estimation using Transformer

Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose {RAPTR} (RAdar Pose esTimation using tRansformer) using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a new pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and the refined joint decoder finalizes pose . . .

MMHOI — MMHOI Dataset: Modeling Complex 3D Multi-Human Multi-Object Interactions

The MMHOI dataset addresses the challenge of modeling real-world scenes with multiple humans and objects interacting in causal, goal-oriented, or cooperative ways. Existing 3D human-object interaction (HOI) benchmarks capture only a subset of these complex interactions. MMHOI closes this gap by providing a large-scale dataset with comprehensive 3D shape and pose annotations for every person and object, covering 78 action categories and 14 interaction-specific body parts across 12 everyday scenarios.

MMHOI is designed to foster research in multi-human multi-object interactions and serves as a benchmark for next-generation HOI algorithms. All necessary data for training, validation, and testing are publicly released.

SuDaField — Subject- and Dataset-Aware Neural Field for HRTF Modeling

PyTorch implementation for training and evaluating models related to Subject- and Dataset-Aware Neural Field (SuDaField) for HRTF modeling.

Open Vocabulary Attribute Detection Dataset

Current detection datasets usually contain various object annotations. Compared to that, there are few detection dataset contains attribute annotations, which is also important for the task of detection. To address this gap, we propose a novel attribute dataset, OVAD, to support training and testing attribute detection comprehensively. OVAD builds on the [nuScenes](https://www.nuscenes.org/nuscenes) dataset (license: CC BY-NC-SA 4.0), supplementing it with detailed attribute annotations capturing spatial relationships, motion states, and interactions between objects. It is useful for developing and evaluating systems needing to know complex scene dynamics.

To encourage more follow up works on Open Vocabulary Attribute . . .

TUSS — Task-Aware Unified Source Separation

PyTorch code for task-aware unified source separation (TUSS), a separation model that uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks.

anomaly-score-normalization — Local Density-Based Anomaly Score Normalization for Domain Generalization

PyTorch code for local density-based anomaly score normalization for domain generalization applied to unsupervised anomalous sound detection using the DCASE 2020, 2023, and 2024 task 2 datasets.

LTOAD — Long-Tailed Online Anomaly Detection dataset

Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. We expand in both directions to a realistic setting by considering the novel task of long-tailed online AD (LTOAD).

G-RepsNets — Group Representation Networks

This is the code for the TMLR 2025 publication G-RepsNet: A Fast and General Construction of Equivariant Networks for Arbitrary Matrix Groups. Given input representations and the group of interest, the work presents a general and efficient construction of equivariant neural networks using equivariant tensor polynomials and tensor mixing.

SMITIN — Self-Monitored Inference-Time INtervention for Generative Music Transformers

PyTorch code for Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes.

ranf-hrtf — Retrieval-Augmented Neural Field for HRTF Upsampling and Personalization

PyTorch implementation for training and evaluating models related to Retrieval-Augmented Neural Field (RANF) for HRTF upsampling and personalization. The model will be trained and evaluated on the SONICOM dataset.

RETR — Radar dEtection TRansformer

PyTorch training and evaluation code for RETR (Radar dEtection TRansformer). RETR inherits the advantages of DETR, eliminating the need for hand-crafted components for object detection and segmentation in the image plane. RETR incorporates carefully designed modifications: 1) depth-prioritized feature similarity via a tunable positional encoding (TPE); 2) a tri-plane loss from both radar and camera coordinates; 3) using a calibrated or learnable radar-to-camera transformation via reparameterization, to account for the unique multi-view radar setting.

MetaLIC — Meta-Learning State Space Models

PyTorch implementation with Bouc-Wen nonlinear system identification benchmark using meta-learned neural state-space models for rapid adaptation.

LIP4RobotInverseDynamics — Lagrangian Inspired Polynomial for Robot Inverse Dynamics

Learning the inverse dynamics of robots directly from data, adopting a black-box approach, is interesting for several real-world scenarios where limited knowledge about the system is available. This code repository proposes a black-box model, based on Gaussian Process (GP) Regression for the identification of the inverse dynamics of robotic manipulators. The proposed model relies on a novel multidimensional kernel, called Lagrangian Inspired Polynomial (LIP) kernel. The LIP kernel is based on two main ideas. First, instead of directly modeling the inverse dynamics components, we model as GPs the kinetic and potential energy of the system. The GP prior on the inverse dynamics components is derived from those on the energies by applying the . . .

melpets-llmpc2024-red-team — MEL-PETs Joint-Context Attack for LLM Privacy Challenge

Code that we submitted (as the MEL-PETs team) for the Red Team Track of the NeurIPS 2024 LLM Privacy Challenge, where we won the Special Award for Practical Attack.

melpets-llmpc2024-blue-team — MEL-PETs Defense for LLM Privacy Challenge

Code that we submitted (as the MEL-PETs team) for the Blue Team Track of the NeurIPS 2024 LLM Privacy Challenge, where we won the Third Place Award.

EVAL — Explainable Video Anomaly Localization

We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in new videos of the same scene. Importantly, our approach is explainable. Our high-level appearance and motion features can provide human-understandable reasons for why any part of a video is classified as normal or anomalous. We conduct experiments on standard video anomaly detection datasets (Street Scene, CUHK Avenue, ShanghaiTech and UCSD Ped1, . . .

GRAM — Generalization in Deep RL with a Robust Adaptation Module

This repository contains the official implementation for the paper "GRAM: Generalization in Deep RL with a Robust Adaptation Module." GRAM is a deep RL framework that can generalize to both in-distribution and out-of-distribution scenarios at deployment time within a single unified architecture.

OptimalRML — Optimal Recursive McCormick Linearization of MultiLinear Programs

OptimalRML provides Python codes for computing the Optimal Recursive McCormick Linearization (RML) of MultiLinear Programs (MLPs). Given a MLP, the codes can compute: (i) a minimum-sized RML and (ii) a size-constrained best-bound RML. The RMLs define relaxations of MLPs that can be used in global optimization.

eeg-subject-transfer — Stabilizing Subject Transfer in EEG Classification with Divergence Estimation

The code for our Journal of Neural Engineering submission / arXiv draft "Stabilizing Subject Transfer in EEG Classification with Divergence Estimation".

pycvxset — Convex sets in Python

This paper introduces pycvxset, a new Python package to manipulate and visualize convex sets. We support polytopes and ellipsoids, and provide user-friendly methods to perform a variety of set operations. For polytopes, pycvxset supports the standard halfspace/vertex representation as well as the constrained zonotope representation. The main advantage of constrained zonotope representations over standard halfspace/vertex representations is that constrained zonotopes admit closed-form expressions for several set operations. pycvxset uses CVXPY to solve various convex programs arising in set operations, and uses pycddlib to perform vertex-halfspace enumeration. We demonstrate the use of pycvxset in analyzing and controlling dynamical systems . . .

Gear-NeRF — Gear Extensions of Neural Radiance Fields

This repository contains the implementation of Gear-NeRF, an approach for novel-view synthesis, as well as tracking of any object in the scene in the novel view using prompts such as mouse clicks, described in the paper:

Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang, Pedro Miraldo, Suhas Lohit, Moitreya Chatterjee, "Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling", appeared in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 (Highlight).

MMVR — Millimeter-wave Multi-View Radar Dataset

Compared with an extensive range of automotive radar datasets to support autonomous driving, indoor radar datasets are scarce at a much smaller scale in the format of low-resolution radar point clouds and usually under an open-space single-room setting. In this paper, we aim to scale up indoor radar data collection in a large-scale, multi-view high-resolution heatmap in a multi-day, multi-room, and multi-subject setting. Referring to the millimeter-wave multi-view radar (MMVR) dataset, it consists of $345$K multi-view radar heatmap frames collected from $22$ human subjects over $6$ different rooms (e.g, open/cluttered offices and meeting rooms). Each pair of horizontal and vertical radar frames is synchronized with RGB image-plane . . .

TF-Locoformer — Transformer-based model with LOcal-modeling by COnvolution

This code implements TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution for speech enhancement and audio source separation, presented in our Interspeech 2024 paper. Training and inference scripts are provided, as well as pretrained models for the WSJ0-2mix, Libri2mix, WHAMR!, and DNS-Interspeech2020 datasets

ERAS — Enhanced Reverberation as Supervision

This code implements the Enhanced Reverberation as Supervision (ERAS) framework for fully unsupervised training of 2-source separation using stereo data.

TS-SEP — Target-Speaker SEParation

Minimal PyTorch code for testing the network architectures proposed in our IEEE TASLP paper "TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings." We include both target-speaker voice activity detection (TS-VAD) as a first stage training process, and target-speaker separation (TS-SEP) second stage training.

TI2V-Zero — Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

This is the code for the CVPR 2024 publication TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models. It allows users to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water") based on a pretrained text-to-video (T2V) diffusion model, without any additional training or fine-tuning.

ComplexVAD — ComplexVAD Dataset

This is a dataset for video anomaly detection collected at the University of South Florida. The dataset consists of various video clips of a single scene on the campus of USF showing a road and a pedestrian crosswalk. The anomalies in the dataset mainly consist of anomalous interactions between two people or objects. For example, some anomalies are two people running into each other, or a person trying to break into a car or a person leaving a package on the ground.

SEBBs — Sound Event Bounding Boxes

Python implementation for the prediction of sound event bounding boxes (SEBBs). SEBBs are one-dimensional bounding boxes defined by event onset time, event offset time, sound class and a confidence. They represent sound event candidates with a scalar confidence score assigned to it. We call it (1d) bounding boxes to highlight the similarity to the (2d) bounding boxes typically used for object detection in computer vision.

With SEBBs the sensitivity of a system can be controlled without an impact on the detection of an events' on- and offset times, which the previous frame-level thresholding approaches suffer from.

SteeredDiffusion — Steered Diffusion

This the code for the ICCV 2023 publication Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Face Synthesis. It allows users to modify outputs of pretrained diffusion models using additional steering functions without any need of fine-tuning. The code shows examples of several types of tasks like image restoration and editing using Steered Diffusion.

robust-rotation-estimation — Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes

We present a novel approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation (and more general motion estimation) is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other data sets, we provide a new dataset and benchmark, with high-accuracy, rigorously tested ground truth on 17 video sequences. Our method uses a novel generalization of the Hough transform on SO3 to efficiently find the camera rotation most compatible with the optical flow. Methods developed for wide baseline stereo (e.g., 5-point methods) do not do well with the small baseline implicit in . . .

MOST-GAN — 3D MOrphable STyleGAN

Recent advances in generative adversarial networks (GANs) have led to remarkable achievements in face image synthesis. While methods that use style-based GANs can generate strikingly photorealistic face images, it is often difficult to control the characteristics of the generated faces in a meaningful and disentangled way. Prior approaches aim to achieve such semantic control and disentanglement within the latent space of a previously trained GAN. In contrast, we propose a framework that a priori models physical attributes of the face such as 3D shape, albedo, pose, and lighting explicitly, thus providing disentanglement by design. Our method, MOST-GAN, integrates the expressive power and photorealism of style-based GANs with the physical . . .

LTAD — Long-Tailed Anomaly Detection Dataset

Anomaly detection (AD) aims to identify defective images and localize their defects (if any). Ideally, AD models should be able to: detect defects over many image classes; not rely on hard-coded class names that can be uninformative or inconsistent across datasets; learn without anomaly supervision; and be robust to the long-tailed distributions of real-world applications. To address these challenges, we formulate the problem of long-tailed AD by introducing several datasets with different levels of class imbalance for performance evaluation.

NIIRF — Neural IIR Filter Field for HRTF Upsampling and Personalization

PyTorch implementation for training and evaluating models proposed in our ICASSP 2024 paper, “NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization.” Both single and multi-subject training and inference codes are included for use with CIPIC and HUTUBS datasets, respectively.

PixPNet — Pixel-Grounded Prototypical Part Networks

This repository contains the code for the paper, Pixel-Grounded Prototypical Part Networks by Zachariah Carmichael, Suhas Lohit, Anoop Cherian, Michael Jones, and Walter J Scheirer. PixPNet (Pixel-Grounded Prototypical Part Network) is an improvement upon existing prototypical part neural networks (ProtoPartNNs): PixPNet truly localizes to object parts (unlike other approaches, including ProtoPNet), has quantitatively better interpretability, and is competitive on image classification benchmarks.

Prototypical part neural networks (ProtoPartNNs), namely ProtoPNet and its derivatives, are an intrinsically interpretable approach to machine learning. Their prototype learning scheme enables intuitive explanations of the form, this . . .

BANSAC — BAyesian Network for adaptive SAmple Consensus

RANSAC-based algorithms are the standard techniques for robust estimation in computer vision. These algorithms are iterative and computationally expensive; they alternate between random sampling of data, computing hypotheses, and running inlier counting. Many authors tried different approaches to improve efficiency. One of the major improvements is having a guided sampling, letting the RANSAC cycle stop sooner. This paper presents a new guided sampling process for RANSAC. Previous methods either assume no prior information about the inlier/outlier classification of data points or use some previously computed scores in the sampling. In this paper, we derive a dynamic Bayesian network that updates individual data points' inlier scores while . . .

DeepBornFNO — Learned Born Operator for Reflection Tomographic Imaging

Recent developments in wave-based sensor technologies, such as ground penetrating radar (GPR), provide new opportunities for imaging underground scenes. From the scattered electromagnetic wave measurements obtained by GPR, the goal is to estimate the permittivity distribution of the underground scenes. However, such problems are highly ill-posed, difficult to formulate, and computationally expensive. In this paper, we propose to use a novel physics-inspired machine learning-based method to learn the wave-matter interaction under the GPR setting. The learned forward model is combined with a learned signal prior to recover the unknown underground scenes via optimization. We test our approach on a dataset of 400 permittivity maps with three . . .

AVLEN — Audio-Visual-Language Embodied Navigation in 3D Environments

Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), . . .

hyper-unmix — Hyperbolic Audio Source Separation

PyTorch implementation for training and interacting with models proposed in our ICASSP 2023 paper, “Hyperbolic Audio Source Separation.” We include the weights for a model pre-trained on the Librispeech Slakh Unmix (LSX) dataset, which hierarchically separates an audio mixture containing music and speech. Furthermore, code for training models using mask cross-entropy, spectrogram, and waveform losses is included. An interface for interacting with the learned hyperbolic embeddings created using PyQT6 is also provided in this codebase.

SMART-101 — Simple Multimodal Algorithmic Reasoning Task Dataset

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task (and the associated SMART-101 dataset) for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children of younger age (6--8). Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their . . .

GODS — Generalized One-class Discriminative Subspaces

One-class learning is the problem of fitting a model to data for which annotations are available only for a single class. Such models are useful for tasks such as anomaly detection, when the normal data is modeled by the 'one' class. In this software release, we are making public our implementation of our Generalized One-Class Discriminative Subspaces (GODS) algorithm (ICCV 2019, TPAMI 2023) for anomaly detection. The key idea of our method is to use a pair of orthonormal frames -- identifying the one-class data subspace -- to "sandwich" the labeled data via optimizing for two objectives jointly: i) minimize the distance between the origins of the two frames, and ii) to maximize the margin between the hyperplanes and the data. Our method . . .

CFS — Cocktail Fork Separation

PyTorch implementation of the Multi Resolution CrossNet (MRX) model proposed in our ICASSP 2022 paper, "The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks." We include the weights for a model pre-trained on the Divide and Remaster (DnR) dataset, which can separate the audio from a soundtrack (e.g., movie or commercial) into individual speech, music, and sound effects stems. A pytorch_lightning script for model training using the DnR dataset is also included.

Partial-GCNN — Partial Group Convolutional Neural Networks

This software package provides the PyTorch implementation of Partial Group Convolutional Neural Networks described in the NeurIPS 2022 paper "Learning Partial Equivariances from Data". Partial G-CNNs are able to learn layer-wise levels of partial and full equivariance to discrete, continuous groups and combinations thereof, directly from data. Partial G-CNNs retain full equivariance when beneficial, but adjust it whenever it becomes harmful. The software package also provides scripts to reproduce the results in the paper.

kscore — Nonparametric Score Estimators

PyTorch reimplementation of code from "Nonparametric Score Estimators" (Yuhao Zhou, Jiaxin Shi, Jun Zhu. https://arxiv.org/abs/2005.10099). See original Tensorflow implementation at https://github.com/miskcoo/kscore (MIT license).

SOCKET — SOurce-free Cross-modal KnowledgE Transfer

SOCKET allows transferring knowledge from neural networks trained on a source sensor modality (such as RGB) for one or more domains where large amount of annotated data may be available to an unannotated target dataset from a different sensor modality (such as infrared or depth). It makes use of task-irrelevant paired source-target images in order to promote feature alignment between the two modalities as well as distribution matching between the source batch norm features (mean and variance) and the target features.

CISOR — Convergent Inverse Scattering using Optimization and Regularization

This software package implements the CISOR reconstruction algorithm along with other benchmark algorithms that attempt to recover the distribution of refractive indices of an object in a multiple scattering regime. The problem of reconstructing an object from the measurements of the light it scatters is common in numerous imaging applications. While the most popular formulations of the problem are based on linearizing the object-light relationship, there is an increased interest in considering nonlinear formulations that can account for multiple light scattering. Our proposed algorithm for nonlinear diffractive imaging, called Convergent Inverse Scattering using Optimization and Regularization (CISOR), is based on our new variant of fast . . .

InSeGAN-ICCV2021 — Instance Segmentation GAN

This package implements InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. For this task, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vectors), each encoding the 3D pose of an object that is represented by a learned implicit object template. The generator has two distinct modules. The first module, the instance feature generator, uses each encoded pose to transform the implicit template into a feature map representation of each object instance. The second module, the depth image renderer, aggregates all of the . . .

HMIS — Hierarchical Musical Instrument Separation

Many sounds that humans encounter are hierarchical in nature; a piano note is one of many played during a performance, which is one of many instruments in a band, which might be playing in a bar with other noises occurring. Inspired by this, we re-frame the musical source separation problem as hierarchical, combining similar instruments together at certain levels and separating them at other levels. This allows us to deconstruct the same mixture in multiple ways, depending on the appropriate level of the hierarchy for a given application. In this software package, we present pytorch implementations of various methods for hierarchical musical instrument separation, with some methods focusing on separating specific instruments (like guitars) . . .

AVSGS — Audio Visual Scene-Graph Segmentor

State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to characterize the sources better, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained via co-segmenting the audio spectrogram. At its core, . . .

PyRoboCOP — Python-based Robotic Control & Optimization Package

PyRoboCOP is a lightweight Python-based package for control and optimization of robotic systems described by nonlinear Differential Algebraic Equations (DAEs). In particular, the package can handle systems with contacts that are described by complementarity constraints and provides a general framework for specifying obstacle avoidance constraints. The package performs direct transcription of the DAEs into a set of nonlinear equations by performing orthogonal collocation on finite elements. The resulting optimization problem belongs to the class of Mathematical Programs with Complementarity Constraints (MPCCs). MPCCs fail to satisfy commonly assumed constraint qualifications and require special handling of the complementarity constraints in . . .

MC-PILCO — Monte Carlo Probabilistic Inference for Learning COntrol

This package implements a Model-based Reinforcement Learning algorithm called Monte Carlo Probabilistic Inference for Learning and COntrol (MC-PILCO), for modeling and control of dynamical system. The algorithm relies on Gaussian Processes (GPs) to model the system dynamics and on a Monte Carlo approach to estimate the policy gradient during optimization. The Monte Carlo approach is shown to be effective for policy optimization thanks to a proper cost function shaping and use of dropout. The possibility of using a Monte Carlo approach allows a more flexible framework for Gaussian Process Regression that leads to more structured and more data efficient kernels. The algorithm is also extended to work for Partially Measurable Systems and . . .

Safety-RL — Goal directed RL with Safety Constraints

In this paper, we consider the problem of building learning agents that can efficiently learn to navigate in constrained environments. The main goal is to design agents that can efficiently learn to understand and generalize to different environments using high-dimensional inputs (a 2D map), while following feasible paths that avoid obstacles in obstacle-cluttered environment. We test our proposed method in the recently proposed \textit{Safety Gym} suite that allows testing of safety-constraints during training of learning agents. The provided python code base allows to reproduce the results from the IROS 2020 paper that was published last year.

Sound2Sight — Generating Visual Dynamics from Sound and Context

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to . . .

TEAQC — Template Embeddings for Adiabatic Quantum Computation

Quantum Annealing (QA) can be used to quickly obtain near-optimal solutions for Quadratic Unconstrained Binary Optimization (QUBO) problems. In QA hardware, each decision variable of a QUBO should be mapped to one or more adjacent qubits in such a way that pairs of variables defining a quadratic term in the objective function are mapped to some pair of adjacent qubits. However, qubits have limited connectivity in existing QA hardware. This software Python codes implementing integer linear programs to search for an embedding of the problem graph into certain classes of minors of the QA hardware, which we call template embeddings. In particular, we consider the template embedding that are minors of the Chimera graph used in D-Wave . . .

ACOT — Adversarially-Contrastive Optimal Transport

In this software release, we provide a PyTorch implementation of the adversarially-contrastive optimal transport (ACOT) algorithm. Through ACOT, we study the problem of learning compact representations for sequential data that captures its implicit spatio-temporal cues. To separate such informative cues from the data, we propose a novel contrastive learning objective via optimal transport. Specifically, our formulation seeks a low-dimensional subspace representation of the data that jointly (i) maximizes the distance of the data (embedded in this subspace) from an adversarial data distribution under a Wasserstein distance, (ii) captures the temporal order, and (iii) minimizes the data distortion. To generate the adversarial distribution, . . .

CME — Circular Maze Environment

In this package, we provide python code for a circular maze environment (CME) which Is a challenging environment for learning manipulation and control. The goal in this system is to tip and tilt the CME so as to drive one (or more) marble(s) from the outermost to the innermost ring. While this system is very intuitive and easy for humans to solve, it can be very difficult and inefficient for standard reinforcement learning algorithms to learn meaningful policies. Consequently, we provide codes to this environment so that it can be used as a benchmark for different algorithms that can learn meaningful policies in this environment. We also provide codes for iLQR which can be used to control the motion of marbles in the proposed environment.

LUVLi — Landmarks’ Location, Uncertainty, and Visibility Likelihood

Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations nor predict whether landmarks are visible. In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities. We model these as mixed random variables and estimate them using a deep network trained with our proposed Location, Uncertainty, and Visibility Likelihood (LUVLi) loss. In addition, we release an entirely new labeling of a large face alignment dataset with over 19,000 face images in a full range of head poses. Each face is manually labeled with the . . .

CAZSL — Context-Aware Zero Shot Learning

Learning accurate models of the physical world is required for a lot of robotic manipulation tasks. However, during manipulation, robots are expected to interact with unknown workpieces so that building predictive models which can generalize over a number of these objects is highly desirable. We provide codes for context-aware zero shot learning (CAZSL) models, an approach utilizing a Siamese network architecture, embedding space masking and regularization based on context variables which allows us to learn a model that can generalize to different parameters or features of the interacting objects. The proposed learning algorithm on the recently released Omnipush data set that allows testing of meta-learning capabilities using . . .

OFENet — Online Feature Extractor Network

This Python code implements an online feature extractor network (OFENet) that uses neural nets to produce good representations to be used as inputs to deep RL algorithms. Even though the high dimensionality of input is usually supposed to make learning of RL agents more difficult, by using this network, we show that the RL agents in fact learn more efficiently with the high-dimensional representation than with the lower-dimensional state observations. We believe that stronger feature propagation together with larger networks (and thus larger search space) allows RL agents to learn more complex functions of states and thus improves the sample efficiency. The code also contains several test problems. Through numerical experiments on these . . .

MotionNet

The ability to reliably perceive the environmental states, particularly the existence of objects and their motion behavior, is crucial for autonomous driving. In this work, we propose an efficient deep model, called MotionNet, to jointly perform perception and motion prediction from 3D point clouds. MotionNet takes a sequence of LiDAR sweeps as input and outputs a bird's eye view (BEV) map, which encodes the object category and motion information in each grid cell. The backbone of MotionNet is a novel spatio-temporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. To enforce the smoothness of predictions over both space and time, the training of MotionNet is further regularized with novel . . .

FoldingNet_Plus — FoldingNet++

This software is the pytorch implementation of FoldingNet++, which is a novel end-to-end graph-based deep autoencoder to achieve compact representations of unorganized 3D point clouds in an unsupervised manner.

The encoder of the proposed networks adopts similar architectures as in PointNet, which is a well-acknowledged method for supervised learning of 3D point clouds, such as recognition and segmentation. The decoder of the proposed networks involves three novel modules: folding module, graph-topology-inference module, and graph-filtering module. The folding module folds a canonical 2D lattice to the underlying surface of a 3D point cloud, achieving coarse reconstruction; the graph-topology-inference module learns a graph . . .

QNTRPO — Quasi-Newton Trust Region Policy Optimization

We propose a trust region method for policy optimization that employs Quasi-Newton approximation for the Hessian, called Quasi-Newton Trust Region Policy Optimization (QNTRPO). Gradient descent has become the de facto algorithm for reinforcement learning tasks with continuous controls. The algorithms has achieved state-of-the-art performance on wide variety of tasks and resulted in several improvements in performance of reinforcement learning algorithms across a wide range of systems. However, the algorithm suffers from a number of drawbacks including: lack of stepsize selection criterion, slow convergence, and dependence on problem scaling. We investigate the use of a dogleg method with a Quasi-Newton approximation for the Hessian to . . .

RIDE — Robust Iterative Data Estimation

Recent studies have demonstrated that as classifiers, deep neural networks (e.g., CNNs) are quite vulnerable to adversarial attacks that only add quasi-imperceptible perturbations to the input data but completely change the predictions of the classifiers. To defend classifiers against such adversarial attacks, here we focus on the white-box adversarial defense where the attackers are granted full access to not only the classifiers but also defenders to produce as strong attack as possible. We argue that a successful white-box defender should prevent the attacker from not only direct gradient calculation but also a gradient approximation. Therefore we propose viewing the defense from the perspective of a functional, a high-order function . . .

GNI — Gradient-based Nikaido-Isoda

Computing Nash equilibrium (NE) of multiplayer games has witnessed renewed interest due to recent advances in generative adversarial networks (GAN). However, computing equilibrium efficiently is challenging. To this end, we introduce the Gradient-based Nikaido-Isoda (GNI) function which serves as a merit function, vanishing only at the first-order stationary points of each player’s optimization problem. Gradient descent is shown to converge sublinearly to a first-order stationary point of the GNI function. For the particular case of bilinear min-max games and multi-player quadratic games, the GNI function is convex. Hence, the application of gradient descent in this case yields linear convergence to an NE (when one exists).
. . .

DSP — Discriminative Subspace Pooling

Human action recognition from video sequences is one of the fundamental problems in computer vision. In this research, we investigate and propose representation learning approaches towards solving this problem, which we call discriminative subspace pooling. Specifically, we combine recent deep learning approaches with techniques for generating adversarial perturbations into learning novel representations that can summarize long video sequences into compact descriptors – these descriptors capture essential properties of the input videos that are sufficient to achieve good recognition rates. We make two contributions. First, we propose a subspace-based discriminative classifier, similar to a non-linear SVM, but having piecewise-linear . . .

StreetScene — Street Scene Dataset

The Street Scene dataset consists of 46 training video sequences and 35 testing video sequences taken from a static USB camera looking down on a scene of a two-lane street with bike lanes and pedestrian sidewalks. See Figure 1 for a typical frame from the dataset. Videos were collected from the camera at various times during two consecutive summers. All of the videos were taken during the daytime. The dataset is challenging because of the variety of activity taking place such as cars driving, turning, stopping and parking; pedestrians walking, jogging and pushing strollers; and bikers riding in bike lanes. In addition, the videos contain changing shadows, and moving background such as a flag and trees blowing in the wind.
. . .

SSTL — Semi-Supervised Transfer Learning

Successful state-of-the-art machine learning techniques rely on the existence of large well sampled and labeled datasets. Today it is easy to obtain a finely sampled dataset because of the decreasing cost of connected low-energy devices. However, it is often difficult to obtain a large number of labels. The reason for this is two-fold. First, labels are often provided by people whose attention span is limited. Second, even if a person was able to label perpetually, this person would need to be shown data in a large variety of conditions. One approach to addressing these problems is to combine labeled data collected in different sessions through transfer learning. Still even this approach suffers from dataset limitations.

This . . .

1bCRB — One-Bit CRB

Massive multiple-input multiple-output (MIMO) systems can significantly increase the spectral efficiency, mitigate propagation loss by exploiting large array gain, and reduce inter-user interference with high-resolution spatial beamforming. To reduce complexity and power consumption, several transceiver architectures have been proposed for mmWave massive MIMO systems: 1) an analog architecture, 2) a hybrid analog/digital architecture, and 3) a fully digital architecture with low-resolution ADCs.

To this end, we derive the Cramer-Rao bound (CRB) on estimating angular-domain channel parameters including angles-of-departure (AoDs), angles-of-arrival (AoAs), and associated channel path gains. Our analysis provides a simple tool . . .

FoldingNet

Recent deep networks that directly handle points in a point set, e.g., PointNet, have been state-of-the-art for supervised learning tasks on point clouds such as classification and segmentation. In this work, a novel end-to-end deep auto-encoder is proposed to address unsupervised learning challenges on point clouds. On the encoder side, a graph-based enhancement is enforced to promote local structures on top of PointNet. Then, a novel folding-based decoder deforms a canonical 2D grid onto the underlying 3D object surface of a point cloud, achieving low reconstruction errors even for objects with delicate structures. The proposed decoder only uses about 7% parameters of a decoder with fully-connected neural networks, yet leads to a more . . .

Kernel Correlation Network

Unlike on images, semantic learning on 3D point clouds using a deep network is challenging due to the naturally unordered data structure. Among existing works, PointNet has achieved promising results by directly learning on point sets. However, it does not take full advantage of a point's local neighborhood that contains fine-grained structural information which turns out to be helpful towards better semantic learning. In this regard, we present two new operations to improve PointNet with a more efficient exploitation of local structures. The first one focuses on local 3D geometric structures. In analogy to a convolution kernel for images, we define a point-set kernel as a set of learnable 3D points that jointly respond to a set of . . .

FRPC — Fast Resampling on Point Clouds via Graphs

We propose a randomized resampling strategy to reduce the cost of storing, processing and visualizing a large-scale point cloud, that selects a representative subset of points while preserving application-dependent features. The strategy is based on graphs, which can represent underlying surfaces and lend themselves well to efficient computation. We use a general feature-extraction operator to represent application-dependent features and propose a general reconstruction error to evaluate the quality of resampling; by minimizing the error, we obtain a general form of optimal resampling distribution. The proposed resampling distribution is guaranteed to be shift-, rotation- and scale-invariant in the 3D space.

PCQM — Point Cloud Quality Metric

It is challenging to measure the geometry distortion of point cloud introduced by point cloud compression. Conventionally, the errors between point clouds are measured in terms of point-to-point or point-to-surface distances, that either ignores the surface structures or heavily tends to rely on specific surface reconstructions. To overcome these drawbacks, we propose using point-to-plane distances as a measure of geometric distortions on point cloud compression. The intrinsic resolution of the point clouds is proposed as a normalizer to convert the mean square errors to PSNR numbers. In addition, the perceived local planes are investigated at different scales of the point cloud. Finally, the proposed metric is independent of the size of . . .

ROSETA — Robust Online Subspace Estimation and Tracking Algorithm

This script implements a revised version of the robust online subspace estimation and tracking algorithm (ROSETA) that is capable of identifying and tracking a time-varying low dimensional subspace from incomplete measurements and in the presence of sparse outliers. The algorithm minimizes a robust l1 norm cost function between the observed measurements and their projection onto the estimated subspace. The projection coefficients and sparse outliers are computed using a LASSO solver and the subspace estimate is updated using a proximal point iteration with adaptive parameter selection.

CASENet — Deep Category-Aware Semantic Edge Detection

Boundary and edge cues are highly beneficial in improving a wide variety of vision tasks such as semantic segmentation, object recognition, stereo, and object proposal generation. Recently, the problem of edge detection has been revisited and significant progress has been made with deep learning. While classical edge detection is a challenging binary problem in itself, the category-aware semantic edge detection by nature is an even more challenging multi-label problem. We model the problem such that each edge pixel can be associated with more than one class as they appear in contours or junctions belonging to two or more semantic classes. To this end, we propose a novel end-to-end deep semantic edge learning architecture based on ResNet . . .

NDS — Non-negative Dynamical System model

Non-negative data arise in a variety of important signal processing domains, such as power spectra of signals, pixels in images, and count data. We introduce a novel non-negative dynamical system model for sequences of such data. The model we propose is called non-negative dynamical system (NDS), and bridges two active fields, dynamical systems and nonnegative matrix factorization (NMF). Its formulation follows that of linear dynamical systems, but the observation and the latent variables are assumed non-negative, the linear transforms are assumed to involve non-negative coefficients, and the additive random innovations both for the observation and the latent variables are replaced by multiplicative random innovations. The software . . .

MERL_Shopping_Dataset — MERL Shopping Dataset

As part of this research, we collected a new dataset for training and testing action detection algorithms. Our MERL Shopping Dataset consists of 106 videos, each of which is a sequence about 2 minutes long. The videos are from a fixed overhead camera looking down at people shopping in a grocery store setting. Each video contains several instances of the following 5 actions: "Reach To Shelf" (reach hand into shelf), "Retract From Shelf " (retract hand from shelf), "Hand In Shelf" (extended period with hand in the shelf), "Inspect Product" (inspect product while holding it in hand), and "Inspect Shelf" (look at shelf while not touching or reaching for the shelf).

JGU — Joint Geodesic Upsampling

We develop an algorithm utilizing geodesic distances to upsample a low resolution depth image using a registered high resolution color image. Specifically, it computes depth for each pixel in the high resolution image using geodesic paths to the pixels whose depths are known from the low resolution one. Though this is closely related to the all-pairshortest-path problem which has O(n2 log n) complexity, we develop a novel approximation algorithm whose complexity grows linearly with the image size and achieve real-time performance. We compare our algorithm with the state of the art on the benchmark dataset and show that our approach provides more accurate depth upsampling with fewer artifacts. In addition, we show that the proposed . . .

EBAD — Exemplar-Based Anomaly Detection

Anomaly detection in real-valued time series has important applications in many diverse areas. We have developed a general algorithm for detecting anomalies in real-valued time series that is computationally very efficient. Our algorithm is exemplar-based which means a set of exemplars are first learned from a normal time series (i.e. not containing any anomalies) which effectively summarizes all normal windows in the training time series. Anomalous windows of a testing time series can then be efficiently detected using the exemplar-based model.

The provided code implements our hierarchical exemplar learning algorithm, our exemplar-based anomaly detection algorithm, and a baseline brute-force Euclidean distance anomaly . . .

PEAC — Plane Extraction using Agglomerative Clustering

Real-time plane extraction in 3D point clouds is crucial to many robotics applications. We present a novel algorithm for reliably detecting multiple planes in real time in organized point clouds obtained from devices such as Kinect sensors. By uniformly dividing such a point cloud into non-overlapping groups of points in the image space, we first construct a graph whose node and edge represent a group of points and their neighborhood respectively. We then perform an agglomerative hierarchical clustering on this graph to systematically merge nodes belonging to the same plane until the plane fitting mean squared error exceeds a threshold. Finally we refine the extracted planes using pixel-wise region growing. Our experiments demonstrate that . . .

PQP — Parallel Quadratic Programming

An iterative multiplicative algorithm is proposed for the fast solution of quadratic programming (QP) problems that arise in the real-time implementation of Model Predictive Control (MPC). The proposed algorithm—Parallel Quadratic Programming (PQP)—is amenable to fine-grained parallelization. Conditions on the convergence of the PQP algorithm are given and proved. Due to its extreme simplicity, even serial implementations offer considerable speed advantages. To demonstrate, PQP is applied to several simulation examples, including a stand-alone QP problem and two MPC examples. When implemented in MATLAB using single-thread computations, numerical simulations of PQP demonstrate a 5 - 10x speed-up compared to the MATLAB active-set . . .

BRDF — MERL BRDF Database

The MERL BRDF database contains reflectance functions of 100 different materials. Each reflectance function is stored as a densely measured Bidirectional Reflectance Distribution Function (BRDF).

Sample code to read the data is included with the database. Note that parameterization of theta-half has changed.