An introduction to multi-modal learning with a special focus on the machine vision community (Dimitrios Papadopoulos)
An overview of the current trends in multi-modal learning with a special focus on the state-of-the-art in the machine vision community.
Multi-modal Learning for Biodiversity Monitoring: Bridging Vision and Genomics at Scale (Graham Taylor and Joakim Bruslund Haurum)
This session explores cutting-edge applications of multi-modal learning to address the global biodiversity crisis through the integration of biological images, DNA barcodes, and taxonomic information. Participants will learn how modern machine learning approaches can fill critical knowledge gaps in biodiversity monitoring by combining visual and genomic data modalities.
The session covers foundational concepts in biological image classification using Vision Transformers, DNA sequence modelling with specialized architectures like BarcodeBERT and BarcodeMamba, and multi-modal contrastive learning techniques that align images, DNA sequences, and taxonomic text in shared embedding spaces. We demonstrate these concepts using the BIOSCAN-5M dataset, which contains over 5 million insect specimens with paired images and DNA barcodes.
Key technical topics include fine-grained taxonomic classification challenges, k-mer tokenization strategies for biological sequences, zero-shot transfer learning for discovering taxonomic structure, and hyperbolic representation learning that captures hierarchical relationships in biological data. The session emphasizes practical evaluation protocols for both closed-world and open-world scenarios, addressing real challenges in biodiversity applications where new species are constantly discovered.
Through hands-on demonstrations and interactive components, participants will explore how multi-modal approaches outperform single-modality methods in taxonomic classification and species identification tasks. The session highlights the unique advantages of using DNA as auxiliary information to improve image-based classification and demonstrates how self-supervised learning can discover meaningful taxonomic structure without extensive labelling.
This work represents a practical application of multi-modal learning to urgent scientific challenges, showcasing how the integration of diverse data types can accelerate biodiversity monitoring efforts needed to understand and preserve global ecosystems.
Session 1: Introduction to Biodiversity Crisis and AI Solutions (30 min) – Graham
This opening session establishes the critical importance of biodiversity monitoring and the role of AI in addressing global knowledge gaps. We will introduce the seven biodiversity shortfalls and demonstrate how machine learning can help fill these gaps at scale. Supplementary reading: “Harnessing AI” Nature Reviews paper (Pollock et al.).
Learning Objectives:
- Understand the scope and urgency of the global biodiversity crisis
- Identify the seven key shortfalls in biodiversity knowledge (Linnaean, Prestonian, Wallacean, etc.)
- Recognize the potential of AI and machine learning to address biodiversity monitoring challenges
- Appreciate the scale of data collection needed for effective biodiversity assessment
Session 2: Image-based Representation Learning with Transformers (30 min) – Joakim
This session focuses on Vision Transformers for architecture and taxonomic classification as the core task. We will cover how modern computer vision approaches can be applied to biological image classification and the unique challenges posed by fine-grained taxonomic distinctions.
Learning Objectives:
- Understand Vision Transformer architecture and its advantages for biological image classification
- Learn about fine-grained visual recognition challenges in taxonomic classification
- Master evaluation protocols for taxonomic image classification at different hierarchical levels
- Recognize the importance of large-scale datasets for training robust biological vision models
Session 3: DNA Barcoding Fundamentals and DNA-based representation learning (25 min) – Graham
This session focuses on BarcodeBERT as architecture and taxonomic classification as task, introducing core evaluations of fine-tuned supervision vs. unseen sequences to genus level. We will cover the unique aspects of DNA sequence modeling for biodiversity applications. Supplementary readings: DNA barcoding tutorial paper (Zarubiieva & Taylor) for biological context, and BarcodeBERT paper (Millan Arias et al.) for methodology.
Learning Objectives:
- Understand the differences between DNA barcode modeling and general genomic sequence analysis
- Learn about k-mer tokenization strategies and their advantages for biological sequences
- Master the evaluation paradigms: fine-tuned classification, linear probing, and 1-NN genus-level probing
- Recognize the importance of domain-specific pretraining for specialized biological applications
Session 4: Multi-modal Contrastive Learning and BIOSCAN-5M (30 min) – Graham
This session covers BIOSCAN-5M dataset and core CLIBD methodology, introducing how to align images, DNA barcodes, and taxonomic text in a shared embedding space. We will demonstrate the advantages of multimodal approaches over single-modality methods. Core readings: BIOSCAN-5M and CLIBD papers (Gharaee et al., Gong et al.).
Learning Objectives:
- Understand the structure and scale of the BIOSCAN-5M multimodal dataset
- Learn contrastive learning principles for multimodal alignment
- Master the concept of using DNA as auxiliary information to improve image-based classification
- Appreciate the challenges of open-world vs. closed-world evaluation in biodiversity
Session 5: Advanced Sequence Models for DNA (30 min) – Graham
This session covers state-space models vs. transformers with a focus on BarcodeMamba, exploring more efficient alternatives to transformer architectures for DNA sequence modeling. We will compare computational efficiency and performance trade-offs. Supplementary reading: BarcodeMamba paper (Gao & Taylor).
Learning Objectives:
- Compare transformer vs. state space model architectures for DNA sequence modeling
- Understand the computational advantages of structured state space models
- Learn about the selective copying capabilities of Mamba for handling sequence variations
- Recognize parameter efficiency benefits while maintaining biodiversity classification performance
Session 6: Self-Supervised Learning for Clustering (30 min) – Joakim
This session covers the motivation and methodology in the “An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders” paper, exploring how self-supervised encoders can be applied to novel datasets without fine-tuning. We will focus on zero-shot transfer and taxonomic structure discovery. Supplementary reading: Zero-shot clustering paper (Lowe and Bruslund Haurum et al.).
Learning Objectives:
- Understand zero-shot transfer learning principles for biological data
- Learn evaluation protocols for clustering performance across taxonomic hierarchies
- Master the use of self-supervised encoders for discovering structure in unseen datasets
- Recognize the advantages of SSL approaches for biodiversity applications with limited labeled data
Session 7: Beyond Euclidean – Hyperbolic Representation Learning (25 min) – Joakim
This session motivates hyperbolic representation learning in the context of BIOSCAN-5M and reviews MERU methodology. We will explore how hyperbolic geometry can better capture the hierarchical nature of taxonomic relationships. Supplementary reading: MERU paper on hyperbolic representations (Desai et al.).
Learning Objectives:
- Understand how hyperbolic geometry can capture hierarchical relationships in biological data
- Learn about the limitations of Euclidean spaces for representing taxonomic hierarchies
- Master the concept of entailment and partial order in hyperbolic embeddings
- Recognize future research opportunities in hierarchical multimodal biodiversity analysis
Suggested Hands-on Components:
- Session 2: Interactive exploration of taxonomic image classification challenges
- Session 3: Comparison of different tokenization strategies for DNA sequences
- Session 4: Multimodal embedding visualization and retrieval demos
- Session 6: Zero-shot clustering evaluation on biological data
Reading List
Core
- Gharaee Z, Lowe SC, Gong Z, Arias PM, Pellegrino N, Wang AT, et al. BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity. Advances in neural information processing systems (NeurIPS) datasets & Benchmarks track. 2024.
- Gong Z, Wang AT, Huo X, Haurum JB, Lowe SC, Taylor GW, et al. CLIBD: Bridging vision and genomics for biodiversity monitoring at scale. International Conference on Learning Representations. 2025.
Supplementary
- Pollock LJ, Kitzes J, Beery S, Gaynor KM, Jarzyna MA, Mac Aodha O, et al. Harnessing artificial intelligence to fill global shortfalls in biodiversity knowledge. Nat Rev Biodivers. 2025;1: 166–182.
- Zarubiieva I, Taylor GW. Unlocking Biodiversity with DNA Barcodes: A Tutorial for Machine Learning Researchers. 2025.
- Arias PM, Sadjadi N, Safari M, Gong Z, Wang A, Lowe S, et al. BarcodeBERT: Transformers for Biodiversity Analysis. Advances in neural information processing systems (NeurIPS) workshop on self-supervised learning – theory and practice. 2023. Available: http://arxiv.org/abs/2311.02401
- Gao T, Taylor G. BarcodeMamba: State Space Modes for Biodiversity Analysis. Neural Information Processing Systems (NeurIPS) Workshop on Foundation Models for Science. 2024.
- Lowe SC, Haurum JB, Oore S, Moeslund TB, Taylor GW. An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders. arXiv [cs.LG]. 2024. Available: http://arxiv.org/abs/2406.02465
- Desai K, Nickel M, Rajpurohit T, Johnson J, Vedantam R. Hyperbolic Image-Text Representations. Proceedings of the International Conference on Machine Learning. 2023.
Multi-modal learning on healthcare records (Mads Nielsen)
Healthcare records consist of structured data such as diagnoses, medications, procedures, lab tests as well as unstructured data in terms of text notes, signals and images.
Several developments have used transformer-based language technology to embed the structural data. Likewise, tokenized images and text can be represented in simple vector spaces.
We walk through these technologies and give examples of predictive algorithms and casual inference of effect sizes as well as alignment og structural and unstructured data.
We present international standardization efforts.
Reading list:
- CORE-BEHRT: A Carefully Optimized and Rigorously Evaluated BEHRT
- https://www.nature.com/articles/s41746-021-00455-y
Industry talk: How can multi-modal data be used for hearing device development (Paula Lopez Diez)
An overview of the types of data that can be used when developing and fitting hearing devices.
Case study on multi-modal learning for emotion recognition (Line Clemmensen)
Using vision, speech and text for emotion recognition
Leveraging multimodality for concept-based XAI analysis of vision transformers (Robert Jenssen)
The first part of this talk will outline some basic principles for neural multimodal learning (MML). Focus then shifts to aspects of eXplainable AI (XAI) in the context of MML. Leveraging the multimodal CLIP model, the second part of the talk presents an approach to have vision networks “explain” how they function in terms of capturing concepts (such as colors, patterns, objects, etc) with particular focus on “transformer” networks. This is important for the general understanding of such AI systems and to shed light on pretraining vs finetuning. The last part of the talk takes these ideas to the analysis and understanding of mammography image analysis.
Papers:
- Radford et al: Learning Transferable Visual Models From Natural Language Supervision: https://arxiv.org/abs/2103.00020
- Oikarinen, Weng: CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks https://arxiv.org/abs/2204.10965
- Dorszewski et al: From Colors to Classes: Emergence of Concepts in Vision Transformers https://arxiv.org/abs/2503.24071