PNAS预印本

说起预印本(preprint)平台,生物学从业者一般最先想到的就是biorxiv了。近几年,随着预印本的重要性和认可度都不断提升,越来越多的预印本服务器,如preprint.orgagriRxivmedRxivOSF也都应运而生。然而,也许你并不了解,其实大名鼎鼎的《美国科学院院刊》PNAS早早地就已推出了“预印本”。

说到这儿,了解PNAS投稿的朋友大概可以猜到了,这就是大名鼎鼎的院士通道”。目前,PNAS分为直接投稿(direct submission)和贡献投稿(contributed submission)两个途径。按照官方数据,前者囊括了75%以上发表的稿件(下图左),也是大多数学者选择的投稿方式,往往要遭遇比绝大多数杂志更为繁琐而严苛的审稿,包括院士初审、主持编委复审及专家外审三个阶段,能够最终发表难度极大(也称为平民通道)。而后者,也就是所谓的contributed submission必需由美国科学院院士投稿推荐,每位院士一年可以摊到两个配额。尽管也要经历审稿流程,但走院士途径投稿的文章,相对于直接投稿模式而言,难度大幅降低,故称为院士担纲通道。其实当年还有一个Communicated渠道(院士引荐通道10年废除【1】),需先与一位院士沟通,然后由其向杂志引荐,该通道不保证发表,但成功率显然较平民通道高。

好了,这三个通道发表的文章质量有没有差别呢?2009年,两名来自哈佛大学的学者对这三种通道文章的引用率【2】。结果表明,走平民通道的文章引用率(下图右蓝色)显著高于院士担纲通道(下图右红色)。当然,是否可以就此认为走平民通道的文章水准更高,就见仁见智了。

   

针对突如其来的全区大流行,院士通道引来了更多的争议,主要原因是走该通道的关于新冠病毒主题的文章由于审稿不严,一旦有严重问题或争议,一经发表便可能借助PNAS的大平台对全球抗疫造成不良影响。比如,下面两篇院士通道的文章,都已经引起了不少争议。针对第二篇,来自佛罗里达大学的生物统计学助理教授Natalie E. Dean更是表示,这篇(出自化学院士Mario J. Molina研究团队的)文章里的figure 3绝不会被任何流行病学家的审稿所通过。

不仅是一般科学家,也有美国院士对院士通道表示不屑。上个月,美国科学院院士、加拿大英属哥伦比亚大学(university of british columbia)的进化生物学家Sarah Otto就表示,不会走这个通道,也不会为其审稿,如果该文章足够好,那就不应该回避常规的审稿。这里有必要交代一下,院士担纲通道的文章是可以指定四年内未同自己合作过的科学家作为审稿人的3】。所以,Otto教授因享有盛誉,自然是可以拒绝其他院士的审稿邀约的。但其他学者就未必敢了,这大概也是该模式长盛不衰的原因之一吧。

说是预印本,那是开玩笑的,前面说过,院士通道的文章还是要经过审稿的,但坊间很多声音认为形式远远大于实质。不论如何,PNAS的院士通道确实由于其“保送”的发表模式遭到了越来越多的质疑。这里有一个解决方案,如果PNAS将院士通道转为PNAS preprints,一来保证了PNAS的大名,二来显出其与平民通道的不同之处,是不是一个两全其美的办法呢?

其实,不论是平民还是院士通道,对于我而言都是未敢企及的通道,难免心生妒忌胡言乱语。所以大家姑且一听,不必太过当真。好了,说完PNAS不正宗的preprint,让我们继续第25期的bioRxiv生信好文速览,一起看看上个月正宗预印本平台上发表了哪些值得一读的preprints吧。


1. 霍普金斯大学Salzberg推出Liftoff,一款网络爆红的基因注释的比对工具

Liftoff: an accurate gene annotation mapping tool

Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. One strategy to annotate new or improved genome assemblies is to map or ‘lift over’ the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.4% of human protein-coding genes to a chimpanzee genome assembly with 98.7% sequence identity.

2. 基因突变本身有适应性吗?来看看这项被马普所植物大佬Weigel称为可能是其最具启发性的工作(Perhaps most provocative study ever from my lab.

Mutation bias shapes gene evolution in Arabidopsis thaliana

Classical evolutionary theory maintains that mutation rate variation between genes should be random with respect to fitness 1–4 and evolutionary optimization of genic mutation rates remains controversial 3,5. However, it has now become known that cytogenetic (DNA sequence + epigenomic) features influence local mutation probabilities 6, which is predicted by more recent theory to be a prerequisite for beneficial mutation rates between different classes of genes to readily evolve 7. To test this possibility, we used de novo mutations in Arabidopsis thaliana to create a high resolution predictive model of mutation rates as a function of cytogenetic features across the genome. As expected, mutation rates are significantly predicted by features such as GC content, histone modifications, and chromatin accessibility. Deeper analyses of predicted mutation rates reveal effects of introns and untranslated exon regions in distancing coding sequences from mutational hotspots at the start and end of transcribed regions in A. thaliana. Finally, predicted coding region mutation rates are significantly lower in genes where mutations are more likely to be deleterious, supported by numerous estimates of evolutionary and functional constraint. These findings contradict neutral expectations that mutation probabilities are independent of fitness consequences. Instead they are consistent with the evolution of lower mutation rates in functionally constrained loci due to cytogenetic features, with important implications for evolutionary biology8.

3. 纳米孔黑科技让对RNA修饰的直接测序成为可能

Direct detection of RNA modifications and structure using single molecule nanopore sequencing

Many methods exist to detect RNA modifications by short-read sequencing, relying on either antibody enrichment of transcripts bearing modified bases or mutational profiling approaches which require conversion to cDNA. Endogenous modifications are present on several major classes of RNA including tRNA, rRNA and mRNA and can modulate diverse biological processes such as genetic recoding, mRNA export and RNA folding. In addition, exogenous modifications can be introduced to RNA molecules to reveal RNA structure and dynamics. Limitations on read length and library size inherent in short-read-based methods dissociate modifications from their native context, preventing single molecule analysis and modification phasing.   Here we demonstrate direct RNA nanopore sequencing to detect endogenous and exogenous RNA modifications over long   sequence   distance   at   the   single   molecule   level.   We demonstrate comprehensive detection of endogenous modifications in E. coli and S. cerevisiae ribosomal RNA (rRNA) using current signal deviations. Notably   2’-O-methyl (Nm) modifications generated a discernible shift in current signal and event level dwell times. We show that dwell times are mediated by the RNA motor protein which sits atop the nanopore. Further, we characterize a recently described small adduct-generating 2’-O-acylation reagent, acetylimidazole (AcIm) for exogenously labeling flexible nucleotides in RNA. Finally, we demonstrate the utility of AcIm for single molecule RNA structural probing using nanopore sequencing.


4. 不同门类细菌间广泛存在的全同序列意味着什么?

Long identical sequences found in multiple bacterial genomes reveal frequent and widespread exchange of genetic material between distant species

Horizontal transfer of genomic elements is an essential force that shapes microbial genome evolution. This process occurs via various mechanisms and has been studied in detail for a variety of biological systems. However, a coarse-grained, global picture of horizontal gene transfer (HGT) in the microbial world is still missing. One reason is the difficulty to process large amounts of genomic microbial data to find and characterize HGT events, especially for highly distant organisms. Here, we exploit that HGT between distant species creates long identical DNA sequences in distant species, which can be found efficiently using alignment-free methods. We analyzed over 90, 000 bacterial genomes and thus identified over 100, 000 events of HGT. We further develop a mathematical model to analyze the statistical properties of those long exact matches and thus estimate the transfer rate between any pair of taxa. Our results demonstrate that long-distance gene exchange (across phyla) is very frequent, as more than 8% of the bacterial genomes analyzed have been involved in at least one such event. Finally, we confirm that the function of the transferred sequences strongly impact the transfer rate, as we observe a 3.5 order of magnitude variation between the most and the least transferred categories. Overall, we provide a unique view of horizontal transfer across the bacterial tree of life, illuminating one fundamental process driving bacterial evolution.

5. Projecting single-cell transcriptomics data onto a reference T cell atlas to interpret immune responses

Single-cell transcriptomics is a transformative technology to explore heterogeneous cell populations such as T cells, one of the most potent weapons against cancer and viral infections. Recent advances in this technology and the computational tools developed in their wake provide unique opportunities to build reference atlases that can be used to systematically compare new single-cell RNA-seq (scRNA-seq) datasets derived from different models or therapeutic conditions. We have developed ProjecTILs (https://github.com/carmonalab/ProjecTILs), a novel computational tool to project new scRNA-seq data into a reference map of T cells, allowing their direct comparison in a stable, annotated system of coordinates. ProjecTILs enables the classification of query cells into curated, discrete states, but also over a continuous space of intermediate states. We illustrate the projection of several datasets from recent publications over two novel cross-study murine T cell reference atlases: the first describing tumor-infiltrating T lymphocytes (TILs), the second characterizing acute and chronic viral infection. ProjecTILs accurately predicted the effects of multiple perturbations, including the ablation of genes controlling T cell differentiation, such as Tox, Ptpn2, miR-155 and Regnase-1, and identified novel gene programs that were altered in these cells (such as a Lag3-Klrc1 inhibitory module), revealing mechanisms of action behind these immunotherapeutic targets and opening new opportunities for the identification of novel targets. By comparing multiple samples over the same reference map, and across alternative embeddings, our method allows exploring the effect of cellular perturbations (e.g. as the result of therapy or genetic engineering) in terms of transcriptional states and altered genetic programs.

6. 北美河狸基因组助于揭示其长寿和抗癌机理

The genome of North American beaver provides insights into the mechanisms of its longevity and cancer resistance

The North American beaver (Castor canadensis) is an exceptionally long-lived and cancer-resistant rodent species, and thus an excellent model organism for comparative genomic studies of longevity. Here, we utilize a significantly improved beaver genome assembly to assess evolutionary changes in gene coding sequences, copy number, and expression. We found that the beaver Aldh1a1, a stem cell marker gene encoding an enzyme required for detoxification of ethanol and aldehydes, is expanded (~10 copies vs. two in mouse and one in human). We also show that the beaver cells are more resistant to ethanol, and beaver liver extracts show higher ability to metabolize aldehydes than the mouse samples. Furthermore, Hpgd, a tumor suppressor gene, is uniquely duplicated in the beaver among rodents. Our evolutionary analysis identified beaver genes under positive selection which are associated with tumor suppression and longevity. Genes involved in lipid metabolism show positive selection signals, changes in copy number and altered gene expression in beavers. Several genes involved in DNA repair showed a higher expression in beavers which is consistent with the trend observed in other long-lived mammals. In summary, we identified several genes that likely contribute to beaver longevity and cancer resistance, including increased ability to detoxify aldehydes, enhanced tumor suppression and DNA repair, and altered lipid metabolism.

7. 群体遗传学研究工具msprime出现bug,作者发文认错

Lessons learned from bugs in models of human history

Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors.

软件作者、牛津大学教授Jerome Kelleher还就此事连发多条推特郑重道歉


8. 超快速基因组坐标处理工具IGD,号称快过bedtools等同类软件一个量级(来自弗吉尼亚大学Nathan Sheffield组)

IGD: high-performance search for large-scale genomic interval datasets

Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.

9. Python, C++ Java之间的无监督翻译(arxiv

Unsupervised Translation of Programming Languages

A transcompiler, also known as source-to-source translator, is a system that convertssource code from a high-level programming language (such as C++ or Python)to another.   Transcompilers are primarily used for interoperability, and to portcodebases written in an obsolete or deprecated language (e.g. COBOL, Python 2)to a modern one. They typically rely on handcrafted rewrite rules, applied to thesource code abstract syntax tree.   Unfortunately, the resulting translations oftenlack readability, fail to respect the target language conventions, and require manualmodifications in order to work properly. The overall translation process is time-consuming and requires expertise in both the source and target languages, makingcode-translation projects expensive. Although neural models significantly outper-form their rule-based counterparts in the context of natural language translation,their applications to transcompilation have been limited due to the scarcity of paral-lel data in this domain. In this paper, we propose to leverage recent approaches inunsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and showthat it can translate functions between C++, Java, and Python with high accuracy.Our method relies exclusively on monolingual source code, requires no expertise inthe source or target languages, and can easily be generalized to other programminglanguages. We also build and release a test set composed of 852 parallel functions,along with unit tests to check the correctness of translations.   We show that ourmodel outperforms rule-based commercial baselines by a significant margin.

10. 【新冠】新冠病毒变异在线研究工具(preprints.org

CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation

CoV-GLUE is an online web application for the interpretation and analysis of SARS-CoV-2 virus genome sequences, with a focus on amino acid sequence variation. It is based on the GLUE data-centric bioinformatics environment and provides a browsable database of amino acid replacements and coding region indels that have been observed in sequences from the pandemic. Users may also analyse their own SARS-CoV-2 sequences by submitting them to the web application to receive an interactive report containing visualisations of phylogenetic classification and highlighting genomic variation of potentially high impact, for example linked to primer mismatches.

引文

1. PNAS will eliminate Communicated submissions in July 2010. Randy Schekman. PNAS September 15, 2009 106 (37) 15518; https://doi.org/10.1073/pnas.

2. Rand DG, Pfeiffer T (2009) Systematic Differences in Impact across Publication Tracks at PNAS. PLoS ONE 4(12): e8092

3. https://www.pnas.org/page/authors/journal-policies

分享