Skip to Main Content

Program Schedule

Wednesday, April 7th
9:00–9:15AM Sudhir Kumar

Protein sequences contain rich information about protein evolution, fitness landscapes, and stability. In this talk, I will present novel applications of machine learning directed toward the elucidation of protein fitness landscapes, protein evolutionary relationships and protein stability. I will describe how latent space models trained on protein multiple-sequence alignments (MSAs) employing variational auto-encoders can infer these properties from sequences. Using both simulated and real MSAs, it will be shown how the low dimensional latent space representation of sequences, calculated using the encoder model, capture both evolutionary and ancestral relationships between sequences. Together with experimental fitness data and Gaussian process regression, it will be illustrated how the latent space representation also enables learning the protein fitness landscape in a continuous low dimensional space. Moreover, we will demonstrate how the model is also useful in predicting protein mutational stability landscapes and quantifying the importance of stability in shaping protein evolution. Through this presentation I hope to illustrate that the latent space models learned using variational auto-encoders provide a mechanism for exploration of the rich data contained in protein sequences regarding evolution, fitness and stability and hence are well-suited to help guide protein engineering efforts.


Unsupervised methods in data analysis aim at obtaining a synthetic description of high-dimensional data landscapes, revealing their structure and their salient features. We will describe an approach for charting complex and heterogeneous data spaces, providing a topography of the high-dimensional probability density from which the data are harvested. We obtain information on the number and the height of the probability peaks, the depth of the "valleys" separating them, the relative location of the peaks and their hierarchical organization. The topography is reconstructed by using an unsupervised variant of Density Peak clustering[1,2] exploiting a non-parametric density estimator[3], which automatically measures the density in the manifold containing the data[4]. Importantly, the density estimator provides an estimate of the error. This is a key feature, which allows distinguishing genuine probability peaks from density fluctuations due to finite sampling. We show that this approach allows identifying the Markov States explored during a protein folding molecular dynamic trajectory directly from the shape of the multidimensional probability density, namely without exploiting any kinetic information[5].

[1] Science, 1492, vol 322 (2014)
[2] Inf. Sci., (2021)
[3] JCTC ,1206, vol 14 , (2018)
[4] Sci Rep. 12140, vol 7 (2017)
[5] JCTC 80, vol 1, (2020)
10:15–10:30AM Bio Break (15 minutes)
GCR Program Talks (Chair: Ron Levy)
10:30: Topological Aspects of Protein Sequence Space (Vincenzo Carnevale)
10:45: Sequence-covariation Models of Protein Fitness and Evolution (Allan Haldane)
11:00: Deep Learning Approaches for the Study of Epistasis (Slobodan Vucetic)
11:15: Epistasis in the Somatic Evolution of Prostate Cancer (Jeffrey Townsend)
GCR Lightning Talks (Chair: Allan Haldane)
11:30: Generative Protein Sequence Models and Metrics (Francisco McGee)
11:35: Sequence Generation from Masked Prediction Models (Sandro Hauri)
11:40: Emergence of Invariant Sites in Molecular Evolution by Epistasis (Ravi Patel)
11:45: What is the Origin of HIV Drug-Resistant Viral Strains? (Avik Biswas)
11:50: Hamiltonian Landscape of Kinetic Routes to Drug Resistance (Indrani Choudhuri)
11:55: Epistasis Among Somatic Cancer Mutations (Jorge Alfaro-Murillo)
12:00: Using Sequence Dimensionality Reduction to Investigate Epistasis (Christine Neville)
12:05: Inherent Properties of Anisotropic Networks (Banu Ozkan)
12:10: Teaching Epistasis through Storyboarding (Caryn S. Babaian)
12:15–12:30PM Ron Levy (Summary)
Biophysical Models of Epistasis
12:30–1:00PM Lunch
1:00–2:00PM Roundtable with Advisors (NSF-GCR senior members and advisors only)

Canalization is the evolution of robustness. Under persistent stabilizing selection species are theorized to evolve not just toward the optimum, but to the generation of individuals close to the optimum. Consequently, after perturbation of the genetic background or environment, the variance of the trait or prevalence of disease can change dramatically. I will discuss the theory, and present new analyses from the UK Biobank illustrating how the genetic architecture of complex traits is altered with respect to lifestyle, behavioral, and socioeconomic status in the cohort study. Implications for evolution of epistatic interactions among genotypes contributing to the genotype-phenotype map follow from these results.


A major goal in computational biology is the development of algorithms, analysis techniques, and tools towards deep mechanistic understanding of life at a molecular level. In the process, computational biology must take advantage of the new developments in artificial intelligence and machine learning, and then extend beyond pattern analysis to provide testable hypotheses for experimental scientists. This talk will focus on our contributions to this process and relevant related work. We will first discuss the development of machine learning techniques for partially observable domains such as molecular biology; in particular, methods for accurate estimation of frequency of occurrence of hard-to-measure and rare events and how such frequencies can be used to calibrate prediction models. We will then show how these methods play key roles in inferring protein cellular roles and phenotypic effects of genomic variation, with an emphasis on understanding the molecular mechanisms of human genetic disease. We further assessed the value of these methods in the wet lab where we tested the molecular mechanisms behind selected de novo mutations in a cohort of individuals with neurodevelopmental disorders.

3:00–3:15PM Bio Break (15 minutes)
Contributed Lightning Talks (Chair: Vincenzo Carnevale)
3:15: Effects of Epistasis on Sugar Utilization Kinetics in Yeast (Anjali Mahilkar)
3:20: Evolutionary Adaptation to Thermosensation via epistasis (Lucia Coronel)
3:25: Sequence and Structural Analysis of GPCRs (Hung Nguyen Do)
3:30: Epistasis from a Machine Learning Perspective (Mindy Shi)
3:35: Residue-Residue Contact Prediction in Disordered Regions (Piyush Borole)
3:40: Modeling the Evolution of Metabolic Pathways (Avery Selberg)
3:45: Modeling Probabilities of Retention of Duplicated Genes (Amanda Wilson)
3:50: Is the Relationship of Individual and Epistatic Mutational Effects Predictable? (Ian Dworkin)
3:55: Evolutionary Consequences of Epistasis on RNA Fitness Landscapes (Eric Hayden)
4:00: Building the Mutational History of SARS-CoV-2 (Sayaka Miura)
4:05: Massive Spatiotemporal Tracking of SARS-CoV-2 Genetic Trends (Steven Weaver)
4:10PM Ron Levy (Closing Remarks)