撰文:周运来,一个读序列天书的公子哥,稳健,潇洒,大方,靠谱。大型测序工厂的螺丝钉,统计草原上的游牧者。 (
https://www.jianshu.com/u/06ae70ef31bc)
--- 大师,大师,我想学习单细胞··· 闭上眼睛跟我来
单细胞测序有着漫长的过去,却只有短暂的历史—-谁说的!
说她漫长是因为到如今也有十几年的历史了(这里面两个重要任务:汤富酬老师2009年的文章使得单细胞测序成为现实,郭国骥老师2010年的文章揭示对500多细胞进行48个基因的单细胞RT-qPCR就可以进行细胞类型区分,使得单细胞测序方向转向大量细胞低深度测序),说她短暂是因为针对单细胞的分析工具越来越有意义开发周期却越来越短。一般生物信息流程主要由软件(安装与参数)、数据库(结构和生物学意义)和数据分析(统计学和编程)组成,目前单细胞分析用到的软件主要是FastQC、Cellranger和R包Seurat、monocle;数据库有相应物种的参考基因组、KEGG、GO;数据分析部分主要基于count矩阵和差异表达数据用R或者Python来做。
cellranger是10X genomics公司为单细胞RNA测序分析量身打造的数据分析软件,可以直接输入Illumina 原始数据(raw base call ,BCL或FASTQ)直接进行文库拆分、细胞拆分、输出表达定量矩阵、降维(pca),聚类(Graph-based& K-Means)以及可视化(t-SNE)结果,结合配套的Loupe Cell Browser给予研究者更多探索单细胞数据的机会。cellranger的高度集成化,使得单细胞测序数据探索变得更加简单,研究者有更多的时间来做生物学意义的挖掘。
今天小编就要给大家介绍下这个可能成为行业标准的数据分析软件——cellranger。在这之前,还是先来了解一下10X genomics单细胞测序的一般原理吧。
10X genomics
一个油滴 (GEM)=一个单细胞+一个凝胶微珠=一个scRNA-Seq,可以说这就是10X的基本技术原理。
V2试剂盒产生的文库结构:
V3试剂盒产生的文库结构:
10X genomics单细胞测序通过Barcode来标记细胞和细胞计数,UMI 来标记转录本,这样与参考基因组比对后就可以定量基因的表达量 (转录本数量,近乎绝对定量)。
在2019年3月7日10x官方 站对“单细胞基因表达入门”的直播视频中提到(只是一场直播提到的信息,仅供参考):
cell ranger pipeline
cellranger单细胞分析流程主要分为:数据拆分 cellranger mkfastq、细胞定量 cellranger count、GEM整合 cellranger aggr、定制调整 cellranger reanalyze。还有一些用户可能会用到的功能:mat2csv、mkgtf、mkref (构建索引)、vdj、mkvdjref、testrun (测试软件是否安装成功和输出结果的结构)、upload、sitecheck。
文库拆分cellranger mkfastq
封装了Illumina’s bcl2fastq软件,用来拆分Illumina 原始数据(raw base call (BCL)),输出 FASTQ 文件。
cellranger-cs/3.0.0 -h # 获取帮助信息,篇幅有限就不展示了,可阅读原文或自己运行阅读# 不展示不代表不重要,做分析不读参数解释就是耍流氓
有以下两种使用方式
$ cellranger mkfastq --id=tiny-bcl --run=/path/to/tiny_bcl --csv=cellranger-tiny-bcl-simple-1.2.0.csvcellranger mkfastqCopyright (c) 2017 10x Genomics, Inc. All rights reserved.-------------------------------------------------------------------------------Martian Runtime - 3.0.2-v3.2.0Running preflight checks (please wait)...2019-03-02 16:33:54 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:33:57 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:33:57 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main2019-03-02 16:34:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
cellranger-tiny-bcl-simple-1.2.0.csv文件结构
$ cellranger mkfastq --id=tiny-bcl --run=/path/to/tiny_bcl --samplesheet=cellranger-tiny-bcl-samplesheet-1.2.0.csvcellranger mkfastqCopyright (c) 2017 10x Genomics, Inc. All rights reserved.-------------------------------------------------------------------------------Martian Runtime - 3.0.2-v3.2.0Running preflight checks (please wait)...2019-03-02 16:25:49 [runtime] (ready) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:25:52 [runtime] (split_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET2019-03-02 16:25:52 [runtime] (run:local) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main2019-03-02 16:25:58 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
如果使用samplesheet文件(见下图,一般不用),需要准备的信息就多一些了,比如调整Reads中的序列长度,而使用简化版的csv文件,cell ranger可以识别所用试剂盒版本,然后自动化的调整reads长度。
拆分好之后的目录结构如下所示
├── fastq_path│ ├── H35KCBCXY│ │ └── test_sample│ │ ├── test_sample_S1_L001_I1_001.fastq.gz #index序列│ │ ├── test_sample_S1_L001_R1_001.fastq.gz #barcode umi│ │ └── test_sample_S1_L001_R2_001.fastq.gz #reads
注意下FASTQ文件的命名规律,如果我们直接拿到fastq文件,只要命名格式符合,也可以直接进行后续分析。(图中sample name,sample order类比一library name, library well number,见下面aggr部分)
拆库时可以加–qc参数输出序列质量情况,保存在outs/qc_summary.json中 (whitelist见aggr部分):
"sample_qc": { "Sample1": { "5": { "barcode_exact_match_ratio": 0.9336158258904611, "barcode_q30_base_ratio": 0.9611993091728814, "bc_on_whitelist": 0.9447542078230667, "mean_barcode_qscore": 37.770630795934, "number_reads": 2748155, "read1_q30_base_ratio": 0.8947676653366835, "read2_q30_base_ratio": 0.7771883245304577 }, "all": { "barcode_exact_match_ratio": 0.9336158258904611, "barcode_q30_base_ratio": 0.9611993091728814, "bc_on_whitelist": 0.9447542078230667, "mean_barcode_qscore": 37.770630795934, "number_reads": 2748155, "read1_q30_base_ratio": 0.8947676653366835, "read2_q30_base_ratio": 0.7771883245304577 } }}
cellranger count
count是cellranger最主要也是最重要的功能:完成细胞和基因的定量,也就是产生了我们用来做各种分析的基因表达矩阵。
cellranger count -h cellranger count (3.0.0)Copyright (c) 2018 10x Genomics, Inc. All rights reserved.-------------------------------------------------------------------------------'cellranger count' quantifies single-cell gene expression.The commands below should be preceded by 'cellranger':Usage: count --id=ID [--fastqs=PATH] [--sample=PREFIX] --transcriptome=DIR [options] count <run_id> <mro> [options] count -h | --help | --versionArguments: id A unique run id, used to name output folder [a-zA-Z0-9_-]+. fastqs Path of folder created by mkfastq or bcl2fastq. sample Prefix of the filenames of FASTQs to select. transcriptome Path of folder containing 10x-compatible reference.Options:# Single Cell Gene Expression --description=TEXT Sample description to embed in output files. --libraries=CSV CSV file declaring input library data sources. --expect-cells=NUM Expected number of recovered cells. --force-cells=NUM Force pipeline to use this number of cells, bypassing the cell detection algorithm. --feature-ref=CSV Feature reference CSV file, declaring feature-barcode constructs and associated barcodes. --nosecondary Disable secondary analysis, e.g. clustering. Optional. --r1-length=NUM Hard trim the input Read 1 to this length before analysis. --r2-length=NUM Hard trim the input Read 2 to this length before analysis. --chemistry=CHEM Assay configuration. NOTE: by default the assay configuration is detected automatically, which is the recommened mode. You usually will not need to specify a chemistry. Options are: 'auto' for autodetection, 'threeprime' for Single Cell 3', 'fiveprime' for Single Cell 5', 'SC3Pv1' or 'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3, 'SC5P-PE' or 'SC5P-R2' for Single Cell 5'. paired-end/R2-only. Default: auto. --no-libraries Proceed with processing using a --feature-ref but no feature-barcoding data specified with the 'libraries' flag. --lanes=NUMS Comma-separated lane numbers. --indices=INDICES Comma-separated sample index set "SI-001" or sequences. --project=TEXT Name of the project folder within a mkfastq or bcl2fastq-generated folder to pick FASTQs from.# Martian Runtime --jobmode=MODE Job manager to use. Valid options: local (default), sge, lsf, or a .template file --localcores=NUM Set max cores the pipeline may request at one time. Only applies when --jobmode=local. --localmem=NUM Set max GB the pipeline may request at one time. Only applies when --jobmode=local. --mempercore=NUM Set max GB each job may use at one time. Only applies in cluster jobmodes. --maxjobs=NUM Set max jobs submitted to cluster at one time. Only applies in cluster jobmodes. --jobinterval=NUM Set delay between submitting jobs to cluster, in ms. Only applies in cluster jobmodes. --overrides=PATH The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than --localcores, --mempercore and --localmem. Consult the 10x support website for an example override file. --uiport=PORT Serve web UI at http://localhost:PORT --disable-ui Do not serve the UI. --noexit Keep web UI running after pipestance completes or fails. --nopreflight Skip preflight checks. -h --help Show this message. --version Show version.Note: 'cellranger count' can be called in two ways, depending on how youdemultiplexed your BCL data into FASTQ files.1. If you demultiplexed with 'cellranger mkfastq' or directly with Illumina bcl2fastq, then set --fastqs to the project folder containing FASTQ files. In addition, set --sample to the name prefixed to the FASTQ files comprising your sample. For example, if your FASTQs are named: subject1_S1_L001_R1_001.fastq.gz then set --sample=subject12. If you demultiplexed with 'cellranger demux', then set --fastqs to a demux output folder containing FASTQ files. Use the --lanes and --indices options to specify which FASTQ reads comprise your sample. This method is deprecated. Please use 'cellranger mkfastq' going forward.
—nosecondary 指定后不进行后续的降维、聚类和可视化分析。一般是只获得矩阵,后续分析用Seurat。
输出文件解释
.outs├── analysis【数据分析文件夹】│ ├── clustering【聚类,图聚类和k-means聚类】│ ├── diffexp【差异分析】│ ├── pca【主成分分析线性降维】│ └── tsne【非线性降维信息】├── cloupe.cloupe【Loupe Cell Browser 输入文件】├── filtered_feature_bc_matrix【很重要,Seurat, Monocle的输入文件】│ ├── barcodes.tsv.gz│ ├── features.tsv.gz│ └── matrix.mtx.gz├── filtered_feature_bc_matrix.h5【过滤掉的barcode信息HDF5 format】├── metrics_summary.csv【CSV format数据摘要】├── molecule_info.h5【aggregate的时候会用到的文件】├── raw_feature_bc_matrix【原始barcode信息】│ ├── barcodes.tsv.gz│ ├── features.tsv.gz│ └── matrix.mtx.gz├── possorted_genome_bam.bam【比对文件】├── possorted_genome_bam.bam.bai【索引文件】├── raw_feature_bc_matrix.h5【原始barcode信息HDF5 format】├── web_summary.html【 页简版 告以及可视化】└── *_gene_bar.csv_temp【过程文件】
GEM文库整合cellranger aggr
当实验中用到了多个GEM well,并且想放在一起分析时,需要先用cellranger count分析各个来自于一个GEM well的fastq文件 (与是否同一个lane测序没关系),然后再用cellranger aggr进行整合。
$ cd /home/jdoe/runs$
声明:本站部分文章及图片源自用户投稿,如本站任何资料有侵权请您尽早请联系jinwei@zod.com.cn进行处理,非常感谢!