Joined April 2014
Attention is All You Need in Speech Separation with @SpeechBrain1 and @huggingface on @GradioML paper: arxiv.org/abs/2010.13154 github: github.com/speechbrain/speec… gradio demo: gradio.app/g/AK391/speechbra…
0
9
0
51
1,389
Can deep RL discover athletic high jump strategies, such as the Fosbury flop, Western roll, and more? Surprisingly, yes! Accepted to #SIGGRAPH2021. Very fun collaboration with Zhiqi Yin, Zeshi Yang, KangKang Yin (SFU). Paper + video: arpspoof.github.io/project/j… 1/2
8
87
9
406
11,332
Show this thread
Pri3D: Can 3D Priors Help 2D Representation Learning? pdf: arxiv.org/pdf/2104.11225.pdf abs: arxiv.org/abs/2104.11225
0
4
2
29
ImageNet-21K Pretraining for the Masses pdf: arxiv.org/pdf/2104.10972.pdf abs: arxiv.org/abs/2104.10972 github: github.com/Alibaba-MIIL/Imag…
0
10
0
36
Hierarchical Motion Understanding via Motion Programs pdf: arxiv.org/pdf/2104.11216.pdf abs: arxiv.org/abs/2104.11216 project page: sumith1896.github.io/motion2…
0
5
0
44
1,184
Token Labeling: Training a 85.5% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet pdf: arxiv.org/pdf/2104.10858.pdf abs: arxiv.org/abs/2104.10858
0
12
0
44
On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation pdf: arxiv.org/pdf/2104.11222.pdf abs: arxiv.org/abs/2104.11222 project page: cs.cmu.edu/~clean-fid/ github: github.com/GaParmar/clean-fi…
0
5
0
20
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation project page: hangz-nju-cuhk.github.io/pro… github: github.com/Hangz-nju-cuhk/Ta…
1
21
1
86
2,510
1
1
0
5
GIF
Cross-Domain and Disentangled Face Manipulation with 3D Guidance pdf: arxiv.org/pdf/2104.11228.pdf abs: arxiv.org/abs/2104.11228 project page: cassiepython.github.io/sigas…
0
6
0
34
KeypointDeformer: Unsupervised 3D Keypoint Discovery for Shape Control pdf: arxiv.org/pdf/2104.11224.pdf abs: arxiv.org/abs/2104.11224 project page: tomasjakab.github.io/Keypoin…
1
18
0
74
1,755
Multiscale Vision Transformers "We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models" pdf: arxiv.org/pdf/2104.11227.pdf abs: arxiv.org/abs/2104.11227
0
10
0
48
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures pdf: arxiv.org/pdf/2104.11178.pdf abs: arxiv.org/abs/2104.11178
0
51
1
227
So-ViT: Mind Visual Tokens for Vision Transformer pdf: arxiv.org/pdf/2104.10935.pdf abs: arxiv.org/abs/2104.10935 "when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models"
0
8
1
37
DreamerV2 learns a world model of the DeepMind humanoid and solves standup and walking from only pixel inputs 🌍🚀
17
67
16
687
GIF
Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation pdf: arxiv.org/pdf/2104.10674.pdf abs: arxiv.org/abs/2104.10674 github: github.com/GT-RIPL/robo-vln project page: zubair-irshad.github.io/proj…
0
7
0
32
1,394
PP-YOLOv2: A Practical Object Detector pdf: arxiv.org/pdf/2104.10419.pdf abs: arxiv.org/abs/2104.10419
0
14
0
53