Recent studies revealed numerous biosynthetic gene clusters (BGCs) across a wide range of bacterial and fungal species amenable to cultivation. However, little is known about the biosynthetic machinery and natural products produced by uncultivated organisms. I discuss the bottleneck of identifying BGCs coding for Peptidic Natural Products (PNPs) from metagenomics data and argue that the future progress in exploration of antibiotics critically depends on a transition from the current one-off process of PNP analysis to a high-throughput PNP discovery, including PNP discovery from metagenomics data. I further describe recent developments of metaSPAdes, truSPAdes, and 10XSPAdes assemblers that significantly increased the contig lengths and opened the door towards genome mining for PNPs in metagenomics datasets. Finally, I will describe recent advances in computational PNP discovery that span bioinformatics techniques ranging from metagenomics to genome mining to peptidogenomics.
The talk will be devoted to the new bioinformatics approaches to diagnose the condition and the study of the history of human autoimmune diseases using Ankylosing spondylitis (Bekhterev's disease) as an example. In addition, we will present a number of different strategies that can be used to develop new treatments for dealing with autoimmune diseases based on the analysis of the diversity of T- and B-cell receptors.
Recent advances in sequencing technologies enabled obtaining data of scanning adaptive immune receptors: antibodies and T-cell receptors. Such progress allowed to state many immunological problems as computational ones and founded a new field in Bioinformatics — Immunoinformatics. The starting problem in this young field is accurate reconstruction of immune receptors repertoire from immunosequencing data. Even the highest-quality immunosequencing data obtained with modern protocols are prone to high error rate. Thus, distinguishing natural diversity of immune receptors from sample preparation errors is a prerequisite for more advanced immunological problems. One of them is evolutionary analysis of antibodies repertoires. An antibody repertoire is the result of a fast evolution that is achieved by various processes of the secondary diversification. As a result of multiple cycles of the secondary diversification, antibody repertoire represents a set of clonal lineages with various abundances. Each such lineage can be viewed as a clonal tree. Construction of clonal trees on antibody repertoire sequences during immune response allows one to detect functional antibodies — typically the most abundant clonal tree is comprised by antibodies specific to invading antigen.
Our immunoinformatics group at Center for Algorithmic Biotechnology (CAB), SPbU has developed the following toolset for immunoinformatics problems. IgReC — a tool for antibody reconstruction from Illumina MiSeq reads; BarcodedIgReC — a modification of IgReC using information about unique molecular identifiers (UMIs); IgQUAST — a tool for quality assessment of antibody repertoire construction; IgDiversityAnalyzer — a tool for analysis of antibody repertoire diversity; AntEvolo — a novel algorithm for construction of clonal trees for antibody repertoires — is currently being developed in CAB.Metagenomic sequencing has emerged as a technology of choice for analyzing bacterial populations and discovery of novel organisms and genes. While different groups have developed specialized tools for de novo metagenomic assembly, the problem of assembling complex microbial communities is far from being resolved.
First part of this talk talk will be devoted to our metaSPAdes software, which integrated proven solutions from the SPAdes toolkit with metagenomics-specific techniques. We will highlight key differences between SPAdes and metaSPAdes and advertise recently added (or soon to be released) features, e.g. support of third generation sequencing technologies for hybrid metagenomic assemblies.
In the second part, we will present our novel (yet unpublished) pipeline for improved reconstruction of individual organisms from (time or spatial) metagenomic series. While availability of sequencing data for multiple related samples provides an unprecedented opportunity for the accurate reconstruction of individual microbial community members, widely used approaches demonstrate some major deficiencies and limitations. In an attempt to overcome those, we developed MTS (Metagenomic Time Series) pipeline that integrates state-of-the art differential binning approaches with valuable (and largely underappreciated) ideas from early works on metagenomic series analysis.
Each biotech company has to create new genetic constructions and express proteins. The most popular gene synthesis method is PCR-based assembly from a set of oligonucleotides. To increase synthesis purity most scientists often design oligonucleotides intuitively with one or a number of heuristics, but this often leads to loss of protein expression.
In this work we present Oligo-Designer (a part of BIOCAD YLab computational platform) — a novel oligonucleotides design software that optimizes the probability of successful gene synthesis and protein expression in specified cell culture. We use genetic algorithm and a set of rational physics-based metrics to solve common problems like hairpins, low binding energy and oligonucleotides cross-reactivity. This helps us to obtain an optimum set of oligonucleotides. The algorithm can be used both in single gene or library mode. Using this algorithm in 2016 we created 466 gene constructions (exact or libraries) with average expression after transient transfection between 150 and 200 mg/l. We also present a fully-automated synthesis pipeline using Biosset and Tecan hardware.
Metagenomics revolutionized the field of microbial ecology, giving access to Gb-sized datasets of microbial communities under natural conditions. This enables fine-grained analyses of the functions of community members, studies of their association with phenotypes and environments, as well as of their microevolution and adaptation to changing environmental conditions.
Phylogenetic methods for studying adaptation and evolutionary dynamics are not able to cope with big data. Calculating the dN/dS ratio for the large-scale sequence data sets that are being generated in metagenomics and comparative microbial genomics is very challenging, due excessive run times of current methods.
EDEN is the first software for the rapid detection of protein families and regions under positive selection, as well as their associated biological processes, from meta- and pangenome data. It provides an interactive result visualization for detailed comparative analyses. EDEN is available as a Docker installation under the GPL 3.0 license, allowing its use on common operating systems, at http://www.github.com/hzi-bifo/eden
We applied EDEN to 66 samples of the HMP project from six body sites sampled from healthy individuals. Across all body sites, most protein families with significant signs of positive selection in comparison to all other protein families were annotated with transport and binding functions, suggesting the existence of a functional pan-selectome. We also used EDEN to characterize human gut metagenome samples. EDEN determined a significantly higher dN/dS ratio for the protein coding genes from lean individuals compared to overweight and obese individuals, suggestive of a higher functional diversity in the guts of lean individuals.
The Genome10K project was begun in 2009 by an international consortium of biologists and genome scientists determined to facilitate the whole genome sequence and analyses of 10,000 vertebrate species. Since then the number of species selected and accomplished has risen from ~30 to over 350 species sequenced or ongoing with funding, over 1000% increase in eight years. I shall summarize the advances and responsibilities that have occurred to date and lay out the achievements and present challenges of reaching the goal. I shall review the status of known vertebrate genome projects, recommend standards for pronouncing a species genome as sequenced or completed, and provide a present and future view of the landscape of Genome 10K.
At the Theodosius Dobzhansky Center for Genome Bioinformatics, we have contributed with a the comparative analyses of 12 of the 38 living species of Felidae, a remarkable example of worldwide species radiation and adaptation to various environments. Our study included analyses of genome sequence of : lion (Panthera leo), tiger (Panthera tigris), snow leopard (Panthera uncia), leopard (Panthera pardus), jaguar (Panthera onca), caracal (Caracal caracal), lynx (Lynx lynx), Asian leopard cat (Prionailurus bengalensis), fishing cat (Prionailurus viverrinus), puma (Puma concolor), cheetah (Acinonyx jubatus), and domestic cat (Felis catus) - coverring six lineages of the family (Panthera, Caracal, Lynx, Asian leopard cat, Puma, and Domestic cat). For each, whole-genome assembled sequence was assessed and annotated including genes, repeats, and variants and other features . A structural alignment of the genomes was performed to identify homology and rearrangements between them. Homozygosity regions were determined based on single nucleotide variants called in the sequenced specimens. Differences and similarities between the annotated genomes are interpreted in terms of the evolutionary process that took place 10.8 million years ago and initiated branching from the last common Felid ancestor.
The Genome 10K endeavor is ambitious, bold, expensive and uncertain, but together the Genome 10K Consortium of Scientists (G10KCOS) and the world genomics community are moving deliberately toward their goal of delivering to the coming generation a gift of genome empowerment for many vertebrate species.
Computational methods that were once successful for capillary sequencing have not worked well for massively parallel short-read sequencing. This sparked a flurry of new short-read mapping and assembly methods. More recently, long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore have emerged, producing extremely long, but noisy, reads. Again, this fundamental shift in data type has required new computational methods for routine bioinformatics tasks, but is also creating many new opportunities. I will discuss applications of long-read sequencing to the problems of genome assembly, alignment, and metagenomics; including the possibility of complete, haplotype-resolved vertebrate genomes and real-time analysis of complex metagenomic samples
Raw data generated by sequencing machines represent hundreds of Giga bytes of information that are systematically processed to extract useful information. However, this mass of genomic data contains a lot of redundancy that can be captured by optimized data structures, such as the de-Bruijn graph, allowing the full information to fit into a standard computer memory. Alternatively, dedicated memory architectures are another possibility to quickly process this mass of information. The talk will discuss these two alternatives by first presenting an optimized implementation of the de-Bruijn graph and an associated tool box called GATB (Genomic Analysis Tool Box). In the second part, we will introduce the PIM (Processing in Memory) concept and present preliminary results on genomic applications using a PIM chip currently developed by the UPMEM company
The talk covers simple (SNV, InDel) and structural variations visual analysis with a help of New Genome Browser (http://lifescience.opensource.epam.com/ngb, https://github.com/epam/NGB).
NGB is fast and user friendly Web-based genome browser that responds to requirements and recommendations from research and clinician communities.
NGB provides various visual tools for DNA and RNA sequence analysis, exon/domain easy integration with annotation databases, cloud-based data support, embedded protein structure viewer etc.
For structural variations analysis NGB displays fusion proteins with domains/exons structure. All these tools will be demonstrated during the talk.