Combining Multi-Modal Data with Deep Generative Models
Rebecca John
Ladoke Akintola University of Technology
Abstract
In the past several years, AI tasks have had to deal with all sorts of data, such as text, audio, pictures, and sensor outputs. Multi-modal data fusion enhances insight by combining information from different data modalities. This work presents a unified framework for multi-modal data fusion using deep generative models, focusing on Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models. A novel architecture is proposed to generate a joint latent representation capable of capturing relationships among multiple modalities even when some data is noisy or missing. Extensive evaluations on benchmark datasets demonstrate that the proposed approach outperforms state-of-the-art fusion methods on tasks such as classification, generation, and cross-modal retrieval. The model extends advanced AI systems by demonstrating applications in diverse domains including healthcare analytics and multimedia content creation.
Keywords
Multi-modal learning, Data fusion, Deep generative models, Variational autoencoders (VAEs), Generative adversarial networks (GANs), Cross-modal generation, Representation learning, Diffusion models, Missing modality handling, Joint latent space.
References
[1] Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR.
[2] Goodfellow, I., et al. (2014). Generative Adversarial Nets. NIPS.
[3] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
[4] Ngiam, J., et al. (2011). Multimodal deep learning. ICML.
[5] Tsai, Y. H. H., et al. (2019). Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL.
[6] Wang, W., et al. (2020). Generalizing to Unseen Modalities for Multimodal Sentiment Analysis. ACL.
[7] Wu, Z., et al. (2018). Multimodal generative models for scalable weakly-supervised learning. NeurIPS.
[8] Suzuki, M., et al. (2016). Joint multimodal learning with deep generative models. ICML.
[9] Shi, Y., et al. (2019). Variational Modality Dropout for Multi-Modal Deep Generative Models. AAAI.
[10] Saito, M., et al. (2017). Temporal Generative Adversarial Nets with Singular Value Clipping. ICCV.
[11] Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE TPAMI.
[12] Pu, Y., et al. (2016). Variational Autoencoder for Deep Learning of Images, Labels and Captions. NIPS.
[13] Radford, A., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. ICML (CLIP).
[14] Ramesh, A., et al. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125.
[15] Tjandra, A., et al. (2020). Multi-modal self-supervised learning for audio-visual speech recognition. ICASSP.