Prof. WANG Xiujie's group from the Institute of Genetics and Developmental Biology (IGDB) of the Chinese Academy of Sciences (CAS), in collaboration with Prof. PEI Xiaobing's group from Huazhong University of Science and Technology, has developed a cosine similarity-based method, COSine similarity-based marker Gene identification (COSG), for more accurate and scalable marker gene identification.
Results were published in Briefings in Bioinformatics on Jan. 19.
Accurate cell classification is the groundwork for downstream analysis of single-cell sequencing data, and the commonly used methods for cell marker gene identification usually rely on statistical tests to search for genes that are differentially expressed between cells of interest and all other cells in a dataset. However, differential expression analysis-based methods cannot guarantee the expression specificity of identified genes in the target cells, and the commonly used methods also have shortcomings such as low computational efficiency.
Additionally, with the development of single-cell assay for transposase-accessible chromatin (ATAC) sequencing and spatial transcriptomics technologies, the need for a universal method capable of identifying cell marker genes from multiple types of single-cell data modalities is rapidly emerging.
According to the researchers, the basic concept of COSG is to compare two genes within a given cell population by evaluating the angles between the vectors representing the expression pattern of each gene in an n-dimensional cell space. Within the cell space, each dimension represents a cell. The representative vector for each gene consists of n-basis (n equals to the number of total detected cells), and the coordinate of each basis represents the gene’s expression level in each cell.
Consequently, the cosine similarity of two genes equals the cosine value of the angle between the two genes' representative vectors in the cell space. The more similar the expression patterns, the smaller the angle is. If two genes have identical expression patterns, the angle between their representative vectors will be zero, regardless of their expression abundance difference.
Therefore, cosine similarity is expression scale-independent and should be more sensitive to identify genes specifically expressed in target cells.
Besides, single-cell sequencing data contain many zero values, calculation of cosine similarity is very efficient on sparse matrices, which ensures the high efficiency of COSG in marker gene identification.
COSG is applicable to single-cell RNA sequencing data, single-cell ATAC sequencing data and spatially resolved transcriptome data and is fast and scalable for ultra-large datasets of million-scale cells, and identifies marker genes for over one million cells in less than two minutes.
Application on both simulated and real experimental datasets showed that the marker genes or genomic regions identified by COSG have greater cell-type specificity, demonstrating the superior performance of COSG in terms of both accuracy and efficiency as compared with other available methods.
This research was supported by grants from the National Key Research and Development Program of China, the Natural Science Foundation of China, the Strategic Priority Research Program of CAS and the Beijing Natural Science Foundation of China.
52 Sanlihe Rd., Xicheng District,
Beijing, China (100864)