10X单细胞测序分析软件：Cell ranger，从拆库到定量

撰文：周运来，一个读序列天书的公子哥，稳健，潇洒，大方，靠谱。大型测序工厂的螺丝钉，统计草原上的游牧者。 (
https://www.jianshu.com/u/06ae70ef31bc)

--- 大师，大师，我想学习单细胞··· 闭上眼睛跟我来

单细胞测序有着漫长的过去，却只有短暂的历史—-谁说的！

说她漫长是因为到如今也有十几年的历史了（这里面两个重要任务：汤富酬老师2009年的文章使得单细胞测序成为现实，郭国骥老师2010年的文章揭示对500多细胞进行48个基因的单细胞RT-qPCR就可以进行细胞类型区分，使得单细胞测序方向转向大量细胞低深度测序），说她短暂是因为针对单细胞的分析工具越来越有意义开发周期却越来越短。一般生物信息流程主要由软件（安装与参数）、数据库（结构和生物学意义）和数据分析（统计学和编程）组成，目前单细胞分析用到的软件主要是FastQC、Cellranger和R包Seurat、monocle；数据库有相应物种的参考基因组、KEGG、GO；数据分析部分主要基于count矩阵和差异表达数据用R或者Python来做。

cellranger是10X genomics公司为单细胞RNA测序分析量身打造的数据分析软件，可以直接输入Illumina 原始数据（raw base call ，BCL或FASTQ)直接进行文库拆分、细胞拆分、输出表达定量矩阵、降维（pca），聚类（Graph-based& K-Means）以及可视化（t-SNE）结果，结合配套的Loupe Cell Browser给予研究者更多探索单细胞数据的机会。cellranger的高度集成化，使得单细胞测序数据探索变得更加简单，研究者有更多的时间来做生物学意义的挖掘。

今天小编就要给大家介绍下这个可能成为行业标准的数据分析软件——cellranger。在这之前，还是先来了解一下10X genomics单细胞测序的一般原理吧。

10X genomics

一个油滴 (GEM)=一个单细胞+一个凝胶微珠=一个scRNA-Seq，可以说这就是10X的基本技术原理。

V2试剂盒产生的文库结构：

V3试剂盒产生的文库结构:

reads 2 ：与基因组比对 (配合UMI进行定量)

Barcode: 标记细胞

UMI （Unique Molecular Identifier）：标记转录本

PolyT ：捕获成熟的RNA

10X genomics单细胞测序通过Barcode来标记细胞和细胞计数，UMI 来标记转录本，这样与参考基因组比对后就可以定量基因的表达量 (转录本数量,近乎绝对定量)。

在2019年3月7日10x官方站对“单细胞基因表达入门”的直播视频中提到（只是一场直播提到的信息，仅供参考）：

由于10x单细胞测序的重复性较高，因此如无特殊原因，做生物学重复的意义不大；

如果细胞大小不一致，但直径符合上机要求，对捕获效率没有明显影响；

细胞重悬清洗后要保证钙、镁离子浓度较低，从而不影响反转录；

非常规形态或直径较大的细胞可以采用抽核的方法进行检测;

当被问及测序时需不需要加入标准物（如ERCC）的时候，官方给出的建议是不建议加ERCC（考虑到成本和影响细胞和基因的计数）。

cell ranger pipeline

cellranger单细胞分析流程主要分为：数据拆分 cellranger mkfastq、细胞定量 cellranger count、GEM整合 cellranger aggr、定制调整 cellranger reanalyze。还有一些用户可能会用到的功能：mat2csv、mkgtf、mkref (构建索引)、vdj、mkvdjref、testrun (测试软件是否安装成功和输出结果的结构)、upload、sitecheck。

文库拆分cellranger mkfastq

封装了Illumina’s bcl2fastq软件，用来拆分Illumina 原始数据（raw base call (BCL)），输出 FASTQ 文件。

cellranger-cs/3.0.0 -h # 获取帮助信息，篇幅有限就不展示了，可阅读原文或自己运行阅读# 不展示不代表不重要，做分析不读参数解释就是耍流氓

有以下两种使用方式

$ cellranger mkfastq --id=tiny-bcl                      --run=/path/to/tiny_bcl                      --csv=cellranger-tiny-bcl-simple-1.2.0.csvcellranger mkfastqCopyright (c) 2017 10x Genomics, Inc.  All rights reserved.-------------------------------------------------------------------------------Martian Runtime - 3.0.2-v3.2.0Running preflight checks (please wait)...2019-03-02 16:33:54 [runtime] (ready)           ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:33:57 [runtime] (split_complete)  ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:33:57 [runtime] (run:local)       ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main2019-03-02 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET

cellranger-tiny-bcl-simple-1.2.0.csv文件结构

$ cellranger mkfastq --id=tiny-bcl                      --run=/path/to/tiny_bcl                      --samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csvcellranger mkfastqCopyright (c) 2017 10x Genomics, Inc.  All rights reserved.-------------------------------------------------------------------------------Martian Runtime - 3.0.2-v3.2.0Running preflight checks (please wait)...2019-03-02 16:25:49 [runtime] (ready)           ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:25:52 [runtime] (split_complete)  ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:25:52 [runtime] (run:local)       ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main2019-03-02 16:25:58 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET

如果使用samplesheet文件(见下图，一般不用)，需要准备的信息就多一些了，比如调整Reads中的序列长度，而使用简化版的csv文件，cell ranger可以识别所用试剂盒版本，然后自动化的调整reads长度。

拆分好之后的目录结构如下所示

├── fastq_path│   ├── H35KCBCXY│   │   └── test_sample│   │       ├── test_sample_S1_L001_I1_001.fastq.gz #index序列│   │       ├── test_sample_S1_L001_R1_001.fastq.gz #barcode umi│   │       └── test_sample_S1_L001_R2_001.fastq.gz #reads

注意下FASTQ文件的命名规律，如果我们直接拿到fastq文件，只要命名格式符合，也可以直接进行后续分析。(图中sample name，sample order类比一library name, library well number，见下面aggr部分)

拆库时可以加–qc参数输出序列质量情况，保存在outs/qc_summary.json中 (whitelist见aggr部分)：

"sample_qc": {  "Sample1": {    "5": {      "barcode_exact_match_ratio": 0.9336158258904611,      "barcode_q30_base_ratio": 0.9611993091728814,      "bc_on_whitelist": 0.9447542078230667,      "mean_barcode_qscore": 37.770630795934,      "number_reads": 2748155,      "read1_q30_base_ratio": 0.8947676653366835,      "read2_q30_base_ratio": 0.7771883245304577    },    "all": {      "barcode_exact_match_ratio": 0.9336158258904611,      "barcode_q30_base_ratio": 0.9611993091728814,      "bc_on_whitelist": 0.9447542078230667,      "mean_barcode_qscore": 37.770630795934,      "number_reads": 2748155,      "read1_q30_base_ratio": 0.8947676653366835,      "read2_q30_base_ratio": 0.7771883245304577    }  }}

cellranger count

count是cellranger最主要也是最重要的功能：完成细胞和基因的定量，也就是产生了我们用来做各种分析的基因表达矩阵。

cellranger count  -h cellranger count (3.0.0)Copyright (c) 2018 10x Genomics, Inc.  All rights reserved.-------------------------------------------------------------------------------'cellranger count' quantifies single-cell gene expression.The commands below should be preceded by 'cellranger':Usage:    count        --id=ID        [--fastqs=PATH]        [--sample=PREFIX]        --transcriptome=DIR        [options]    count <run_id> <mro> [options]    count -h | --help | --versionArguments:    id      A unique run id, used to name output folder [a-zA-Z0-9_-]+.    fastqs  Path of folder created by mkfastq or bcl2fastq.    sample  Prefix of the filenames of FASTQs to select.    transcriptome   Path of folder containing 10x-compatible reference.Options:# Single Cell Gene Expression    --description=TEXT  Sample description to embed in output files.    --libraries=CSV     CSV file declaring input library data sources.    --expect-cells=NUM  Expected number of recovered cells.    --force-cells=NUM   Force pipeline to use this number of cells, bypassing                            the cell detection algorithm.    --feature-ref=CSV   Feature reference CSV file, declaring feature-barcode                             constructs and associated barcodes.    --nosecondary       Disable secondary analysis, e.g. clustering. Optional.    --r1-length=NUM     Hard trim the input Read 1 to this length before                            analysis.    --r2-length=NUM     Hard trim the input Read 2 to this length before                            analysis.    --chemistry=CHEM    Assay configuration. NOTE: by default the assay                            configuration is detected automatically, which                            is the recommened mode. You usually will not need                            to specify a chemistry. Options are: 'auto' for                            autodetection, 'threeprime' for Single Cell 3',                            'fiveprime' for  Single Cell 5', 'SC3Pv1' or                            'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3,                            'SC5P-PE' or 'SC5P-R2' for Single Cell 5'.                            paired-end/R2-only. Default: auto.    --no-libraries      Proceed with processing using a --feature-ref but no                            feature-barcoding data specified with the                            'libraries' flag.    --lanes=NUMS        Comma-separated lane numbers.    --indices=INDICES   Comma-separated sample index set "SI-001" or sequences.    --project=TEXT      Name of the project folder within a mkfastq or                            bcl2fastq-generated folder to pick FASTQs from.# Martian Runtime    --jobmode=MODE      Job manager to use. Valid options:                            local (default), sge, lsf, or a .template file    --localcores=NUM    Set max cores the pipeline may request at one time.                            Only applies when --jobmode=local.    --localmem=NUM      Set max GB the pipeline may request at one time.                            Only applies when --jobmode=local.    --mempercore=NUM    Set max GB each job may use at one time.                            Only applies in cluster jobmodes.    --maxjobs=NUM       Set max jobs submitted to cluster at one time.                            Only applies in cluster jobmodes.    --jobinterval=NUM   Set delay between submitting jobs to cluster, in ms.                            Only applies in cluster jobmodes.    --overrides=PATH    The path to a JSON file that specifies stage-level                            overrides for cores and memory.  Finer-grained                            than --localcores, --mempercore and --localmem.                            Consult the 10x support website for an example                            override file.    --uiport=PORT       Serve web UI at http://localhost:PORT    --disable-ui        Do not serve the UI.    --noexit            Keep web UI running after pipestance completes or fails.    --nopreflight       Skip preflight checks.    -h --help           Show this message.    --version           Show version.Note: 'cellranger count' can be called in two ways, depending on how youdemultiplexed your BCL data into FASTQ files.1. If you demultiplexed with 'cellranger mkfastq' or directly with   Illumina bcl2fastq, then set --fastqs to the project folder containing   FASTQ files. In addition, set --sample to the name prefixed to the FASTQ   files comprising your sample. For example, if your FASTQs are named:       subject1_S1_L001_R1_001.fastq.gz   then set --sample=subject12. If you demultiplexed with 'cellranger demux', then set --fastqs to a   demux output folder containing FASTQ files. Use the --lanes and --indices   options to specify which FASTQ reads comprise your sample.   This method is deprecated. Please use 'cellranger mkfastq' going forward.

—nosecondary 指定后不进行后续的降维、聚类和可视化分析。一般是只获得矩阵，后续分析用Seurat。

输出文件解释

.outs├── analysis【数据分析文件夹】│   ├── clustering【聚类，图聚类和k-means聚类】│   ├── diffexp【差异分析】│   ├── pca【主成分分析线性降维】│   └── tsne【非线性降维信息】├── cloupe.cloupe【Loupe Cell Browser 输入文件】├── filtered_feature_bc_matrix【很重要，Seurat, Monocle的输入文件】│   ├── barcodes.tsv.gz│   ├── features.tsv.gz│   └── matrix.mtx.gz├── filtered_feature_bc_matrix.h5【过滤掉的barcode信息HDF5 format】├── metrics_summary.csv【CSV format数据摘要】├── molecule_info.h5【aggregate的时候会用到的文件】├── raw_feature_bc_matrix【原始barcode信息】│   ├── barcodes.tsv.gz│   ├── features.tsv.gz│   └── matrix.mtx.gz├──  possorted_genome_bam.bam【比对文件】├──  possorted_genome_bam.bam.bai【索引文件】├── raw_feature_bc_matrix.h5【原始barcode信息HDF5 format】├── web_summary.html【 页简版 告以及可视化】└── *_gene_bar.csv_temp【过程文件】

GEM文库整合cellranger aggr

当实验中用到了多个GEM well，并且想放在一起分析时，需要先用cellranger count分析各个来自于一个GEM well的fastq文件 (与是否同一个lane测序没关系)，然后再用cellranger aggr进行整合。

$ cd /home/jdoe/runs$

                                                        声明：本站部分文章及图片源自用户投稿，如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢！