kb_python.count

Module Contents

Functions

kallisto_bus(→ Dict[str, str])

Runs kallisto bus.

kallisto_quant_tcc(→ Dict[str, str])

Runs kallisto quant-tcc.

bustools_project(→ Dict[str, str])

Runs bustools project.

bustools_sort(→ Dict[str, str])

Runs bustools sort.

bustools_inspect(→ Dict[str, str])

Runs bustools inspect.

bustools_correct(→ Dict[str, str])

Runs bustools correct.

bustools_count(→ Dict[str, str])

Runs bustools count.

bustools_capture(→ Dict[str, str])

Runs bustools capture.

bustools_whitelist(→ Dict[str, str])

Runs bustools whitelist.

matrix_to_cellranger(→ Dict[str, str])

Convert bustools count matrix to cellranger-format matrix.

convert_matrix(→ Dict[str, str])

Convert a gene count or TCC matrix to loom or h5ad.

convert_matrices(→ Dict[str, str])

Convert a gene count or TCC matrix to loom or h5ad.

filter_with_bustools(→ Dict[str, str])

Generate filtered count matrices with bustools.

stream_fastqs(→ List[str])

Given a list of fastqs (that may be local or remote paths), stream any

stream_batch(→ str)

Given a path to a batch file, produce a new batch file where all the

copy_or_create_whitelist(→ str)

Copies a pre-packaged whitelist if it is provided. Otherwise, runs

convert_transcripts_to_genes(→ str)

Convert a textfile containing transcript IDs to another textfile containing

write_smartseq3_capture(→ str)

Write the capture sequence for smartseq3.

count(→ Dict[str, Union[str, Dict[str, str]]])

Generates count matrices for single-cell RNA seq.

count_smartseq3(→ Dict[str, Union[str, Dict[str, str]]])

Generates count matrices for Smartseq3.

count_velocity(→ Dict[str, Union[Dict[str, str], str]])

Generates RNA velocity matrices for single-cell RNA seq.

count_velocity_smartseq3(→ Dict[str, Union[str, ...)

Generates count matrices for Smartseq3.

Attributes

INSPECT_PARSER

kb_python.count.INSPECT_PARSER
kb_python.count.kallisto_bus(fastqs: Union[List[str], str], index_path: str, technology: str, out_dir: str, threads: int = 8, n: bool = False, k: bool = False, paired: bool = False, strand: Optional[typing_extensions.Literal[unstranded, forward, reverse]] = None) Dict[str, str]

Runs kallisto bus.

Parameters
  • fastqs – List of FASTQ file paths, or a single path to a batch file

  • index_path – Path to kallisto index

  • technology – Single-cell technology used

  • out_dir – Path to output directory

  • threads – Number of threads to use, defaults to 8

  • n – Include number of read in flag column (used when splitting indices), defaults to False

  • k – Alignment is done per k-mer (used when splitting indices), defaults to False

  • paired – Whether or not to supply the –paired flag, only used for bulk and smartseq2 samples, defaults to False

  • strand – Strandedness, defaults to None

Returns

Dictionary containing paths to generated files

kb_python.count.kallisto_quant_tcc(mtx_path: str, saved_index_path: str, ecmap_path: str, t2g_path: str, out_dir: str, flens_path: Optional[str] = None, l: Optional[int] = None, s: Optional[int] = None, threads: int = 8) Dict[str, str]

Runs kallisto quant-tcc.

Parameters
  • mtx_path – Path to counts matrix

  • saved_index_path – Path to index.saved

  • ecmap_path – Path to ecmap

  • t2g_path – Path to T2G

  • out_dir – Output directory path

  • flens_path – Path to flens.txt, defaults to None

  • l – Mean fragment length, defaults to None

  • s – Standard deviation of fragment length, defaults to None

  • threads – Number of threads to use, defaults to 8

Returns

Dictionary containing path to output files

kb_python.count.bustools_project(bus_path: str, out_path: str, map_path: str, ecmap_path: str, txnames_path: str) Dict[str, str]

Runs bustools project.

bus_path: Path to BUS file to sort out_dir: Path to output directory map_path: Path to file containing source-to-destination mapping ecmap_path: Path to ecmap file, as generated by kallisto bus txnames_path: Path to transcript names file, as generated by kallisto bus

Returns

Dictionary containing path to generated BUS file

kb_python.count.bustools_sort(bus_path: str, out_path: str, temp_dir: str = 'tmp', threads: int = 8, memory: str = '4G', flags: bool = False) Dict[str, str]

Runs bustools sort.

Parameters
  • bus_path – Path to BUS file to sort

  • out_dir – Path to output BUS path

  • temp_dir – Path to temporary directory, defaults to tmp

  • threads – Number of threads to use, defaults to 8

  • memory – Amount of memory to use, defaults to 4G

  • flags – Whether to supply the –flags argument to sort, defaults to False

Returns

Dictionary containing path to generated index

kb_python.count.bustools_inspect(bus_path: str, out_path: str, whitelist_path: Optional[str] = None, ecmap_path: Optional[str] = None) Dict[str, str]

Runs bustools inspect.

Parameters
  • bus_path – Path to BUS file to sort

  • out_path – Path to output inspect JSON file

  • whitelist_path – Path to whitelist

  • ecmap_path – Path to ecmap file, as generated by kallisto bus

Returns

Dictionary containing path to generated index

kb_python.count.bustools_correct(bus_path: str, out_path: str, whitelist_path: str) Dict[str, str]

Runs bustools correct.

Parameters
  • bus_path – Path to BUS file to correct

  • out_path – Path to output corrected BUS file

  • whitelist_path – Path to whitelist

Returns

Dictionary containing path to generated index

kb_python.count.bustools_count(bus_path: str, out_prefix: str, t2g_path: str, ecmap_path: str, txnames_path: str, tcc: bool = False, mm: bool = False, cm: bool = False, umi_gene: bool = False, em: bool = False) Dict[str, str]

Runs bustools count.

Parameters
  • bus_path – Path to BUS file to correct

  • out_prefix – Prefix of the output files to generate

  • t2g_path – Path to output transcript-to-gene mapping

  • ecmap_path – Path to ecmap file, as generated by kallisto bus

  • txnames_path – Path to transcript names file, as generated by kallisto bus

  • tcc – Whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm – Whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • cm – Count multiplicities instead of UMIs. Used for chemitries without UMIs, such as bulk and Smartseq2, defaults to False

  • umi_gene – Whether to use genes to deduplicate umis, defaults to False

  • em – Whether to estimate gene abundances using EM algorithm, defaults to False

Returns

Dictionary containing path to generated index

kb_python.count.bustools_capture(bus_path: str, out_path: str, capture_path: str, ecmap_path: Optional[str] = None, txnames_path: Optional[str] = None, capture_type: typing_extensions.Literal[transcripts, umis, barcode] = 'transcripts', complement: bool = True) Dict[str, str]

Runs bustools capture.

Parameters
  • bus_path – Path to BUS file to capture

  • out_path – Path to BUS file to generate

  • capture_path – Path transcripts-to-capture list

  • ecmap_path – Path to ecmap file, as generated by kallisto bus

  • txnames_path – Path to transcript names file, as generated by kallisto bus

  • capture_type – The type of information in the capture list. Can be one of transcripts, umis, barcode.

  • complement – Whether or not to complement, defaults to True

Returns

Dictionary containing path to generated index

kb_python.count.bustools_whitelist(bus_path: str, out_path: str, threshold: Optional[int] = None) Dict[str, str]

Runs bustools whitelist.

Parameters
  • bus_path – Path to BUS file generate the whitelist from

  • out_path – Path to output whitelist

  • threshold – Barcode threshold to be included in whitelist

Returns

Dictionary containing path to generated index

kb_python.count.matrix_to_cellranger(matrix_path: str, barcodes_path: str, genes_path: str, t2g_path: str, out_dir: str) Dict[str, str]

Convert bustools count matrix to cellranger-format matrix.

Parameters
  • matrix_path – Path to matrix

  • barcodes_path – List of paths to barcodes.txt

  • genes_path – Path to genes.txt

  • t2g_path – Path to transcript-to-gene mapping

  • out_dir – Path to output matrix

Returns

Dictionary of matrix files

kb_python.count.convert_matrix(counts_dir: str, matrix_path: str, barcodes_path: str, genes_path: Optional[str] = None, ec_path: Optional[str] = None, t2g_path: Optional[str] = None, txnames_path: Optional[str] = None, name: str = 'gene', loom: bool = False, h5ad: bool = False, by_name: bool = False, tcc: bool = False, threads: int = 8) Dict[str, str]

Convert a gene count or TCC matrix to loom or h5ad.

Parameters
  • counts_dir – Path to counts directory

  • matrix_path – Path to matrix

  • barcodes_path – List of paths to barcodes.txt

  • genes_path – Path to genes.txt, defaults to None

  • ec_path – Path to ec.txt, defaults to None

  • t2g_path – Path to transcript-to-gene mapping. If this is provided, the third column of the mapping is appended to the anndata var, defaults to None

  • txnames_path – Path to transcripts.txt, defaults to None

  • name – Name of the columns, defaults to “gene”

  • loom – Whether to generate loom file, defaults to False

  • h5ad – Whether to generate h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • tcc – Whether the matrix is a TCC matrix, defaults to False

  • threads – Number of threads to use, defaults to 8

Returns

Dictionary of generated files

kb_python.count.convert_matrices(counts_dir: str, matrix_paths: List[str], barcodes_paths: List[str], genes_paths: Optional[List[str]] = None, ec_paths: Optional[List[str]] = None, t2g_path: Optional[str] = None, txnames_path: Optional[str] = None, name: str = 'gene', loom: bool = False, h5ad: bool = False, by_name: bool = False, nucleus: bool = False, tcc: bool = False, threads: int = 8) Dict[str, str]

Convert a gene count or TCC matrix to loom or h5ad.

Parameters
  • counts_dir – Path to counts directory

  • matrix_paths – List of paths to matrices

  • barcodes_paths – List of paths to barcodes.txt

  • genes_paths – List of paths to genes.txt, defaults to None

  • ec_paths – List of path to ec.txt, defaults to None

  • t2g_path – Path to transcript-to-gene mapping. If this is provided, the third column of the mapping is appended to the anndata var, defaults to None

  • txnames_path – List of paths to transcripts.txt, defaults to None

  • name – Name of the columns, defaults to “gene”

  • loom – Whether to generate loom file, defaults to False

  • h5ad – Whether to generate h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • nucleus – Whether the matrices contain single nucleus counts, defaults to False

  • tcc – Whether the matrix is a TCC matrix, defaults to False

  • threads – Number of threads to use, defaults to 8

Returns

Dictionary of generated files

kb_python.count.filter_with_bustools(bus_path: str, ecmap_path: str, txnames_path: str, t2g_path: str, whitelist_path: str, filtered_bus_path: str, filter_threshold: Optional[int] = None, counts_prefix: Optional[str] = None, tcc: bool = False, mm: bool = False, kite: bool = False, temp_dir: str = 'tmp', threads: int = 8, memory: str = '4G', count: bool = True, loom: bool = False, h5ad: bool = False, by_name: bool = False, cellranger: bool = False, umi_gene: bool = False, em: bool = False) Dict[str, str]

Generate filtered count matrices with bustools.

Parameters
  • bus_path – Path to sorted, corrected, sorted BUS file

  • ecmap_path – Path to matrix ec file

  • txnames_path – Path to list of transcripts

  • t2g_path – Path to transcript-to-gene mapping

  • whitelist_path – Path to filter whitelist to generate

  • filtered_bus_path – Path to filtered BUS file to generate

  • filter_threshold – Barcode filter threshold for bustools, defaults to None

  • counts_prefix – Prefix of count matrix, defaults to None

  • tcc – Whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm – Whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • kite – Whether this is a KITE workflow

  • temp_dir – Path to temporary directory, defaults to tmp

  • threads – Number of threads to use, defaults to 8

  • memory – Amount of memory to use, defaults to 4G

  • count – Whether to run bustools count, defaults to True

  • loom – Whether to convert the final count matrix into a loom file, defaults to False

  • h5ad – Whether to convert the final count matrix into a h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • cellranger – Whether to convert the final count matrix into a cellranger-compatible matrix, defaults to False

  • umi_gene – Whether to perform gene-level UMI collapsing, defaults to False

  • em – Whether to estimate gene abundances using EM algorithm, defaults to False

Returns

Dictionary of generated files

kb_python.count.stream_fastqs(fastqs: List[str], temp_dir: str = 'tmp') List[str]

Given a list of fastqs (that may be local or remote paths), stream any remote files. Internally, calls utils.

Parameters
  • fastqs – List of (remote or local) fastq paths

  • temp_dir – Temporary directory

Returns

All remote paths substituted with a local path

kb_python.count.stream_batch(batch_path: str, temp_dir: str = 'tmp') str

Given a path to a batch file, produce a new batch file where all the remote FASTQs are being streamed.

Parameters
  • fastqs – List of (remote or local) fastq paths

  • temp_dir – Temporary directory

Returns

New batch file with all remote paths substituted with a local path

kb_python.count.copy_or_create_whitelist(technology: str, bus_path: str, out_dir: str) str

Copies a pre-packaged whitelist if it is provided. Otherwise, runs bustools whitelist to generate a whitelist.

Parameters
  • technology – Single-cell technology used

  • bus_path – Path to BUS file generate the whitelist from

  • out_dir – Path to output directory

Returns

Path to copied or generated whitelist

kb_python.count.convert_transcripts_to_genes(txnames_path: str, t2g_path: str, genes_path: str) str

Convert a textfile containing transcript IDs to another textfile containing gene IDs, given a transcript-to-gene mapping.

Parameters
  • txnames_path – Path to transcripts.txt

  • t2g_path – Path to transcript-to-genes mapping

  • genes_path – Path to output genes.txt

Returns

Path to written genes.txt

kb_python.count.write_smartseq3_capture(capture_path: str) str

Write the capture sequence for smartseq3.

Parameters

capture_path – Path to write the capture sequence

Returns

Path to written file

kb_python.count.count(index_path: str, t2g_path: str, technology: str, out_dir: str, fastqs: List[str], whitelist_path: Optional[str] = None, tcc: bool = False, mm: bool = False, filter: Optional[typing_extensions.Literal[bustools]] = None, filter_threshold: Optional[int] = None, kite: bool = False, FB: bool = False, temp_dir: str = 'tmp', threads: int = 8, memory: str = '4G', overwrite: bool = False, loom: bool = False, h5ad: bool = False, by_name: bool = False, cellranger: bool = False, inspect: bool = True, report: bool = False, fragment_l: Optional[int] = None, fragment_s: Optional[int] = None, paired: bool = False, strand: Optional[typing_extensions.Literal[unstranded, forward, reverse]] = None, umi_gene: bool = False, em: bool = False) Dict[str, Union[str, Dict[str, str]]]

Generates count matrices for single-cell RNA seq.

Parameters
  • index_path – Path to kallisto index

  • t2g_path – Path to transcript-to-gene mapping

  • technology – Single-cell technology used

  • out_dir – Path to output directory

  • fastqs – List of FASTQ file paths or a single batch definition file

  • whitelist_path – Path to whitelist, defaults to None

  • tcc – Whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm – Whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • filter – Filter to use to generate a filtered count matrix, defaults to None

  • filter_threshold – Barcode filter threshold for bustools, defaults to None

  • kite – Whether this is a KITE workflow

  • FB – Whether 10x Genomics Feature Barcoding technology was used, defaults to False

  • temp_dir – Path to temporary directory, defaults to tmp

  • threads – Pumber of threads to use, defaults to 8

  • memory – Amount of memory to use, defaults to 4G

  • overwrite – Overwrite an existing index file, defaults to False

  • loom – Whether to convert the final count matrix into a loom file, defaults to False

  • h5ad – Whether to convert the final count matrix into a h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • cellranger – Whether to convert the final count matrix into a cellranger-compatible matrix, defaults to False

  • inspect – Whether or not to inspect the output BUS file and generate the inspect.json

  • report – Generate an HTMl report, defaults to False

  • fragment_l – Mean length of fragments, defaults to None

  • fragment_s – Standard deviation of fragment lengths, defaults to None

  • paired – Whether the fastqs are paired. Has no effect when a single batch file is provided. Defaults to False

  • strand – Strandedness, defaults to None

  • umi_gene – Whether to perform gene-level UMI collapsing, defaults to False

  • em – Whether to estimate gene abundances using EM algorithm, defaults to False

Returns

Dictionary containing paths to generated files

kb_python.count.count_smartseq3(index_path: str, t2g_path: str, out_dir: str, fastqs: List[str], whitelist_path: Optional[str] = None, tcc: bool = False, mm: bool = False, temp_dir: str = 'tmp', threads: int = 8, memory: str = '4G', overwrite: bool = False, loom: bool = False, h5ad: bool = False, by_name: bool = False, inspect: bool = True, strand: Optional[typing_extensions.Literal[unstranded, forward, reverse]] = None) Dict[str, Union[str, Dict[str, str]]]

Generates count matrices for Smartseq3.

Parameters
  • index_path – Path to kallisto index

  • t2g_path – Path to transcript-to-gene mapping

  • out_dir – Path to output directory

  • fastqs – List of FASTQ file paths

  • whitelist_path – Path to whitelist, defaults to None

  • tcc – Whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm – Whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • temp_dir – Path to temporary directory, defaults to tmp

  • threads – Pumber of threads to use, defaults to 8

  • memory – Amount of memory to use, defaults to 4G

  • overwrite – Overwrite an existing index file, defaults to False

  • loom – Whether to convert the final count matrix into a loom file, defaults to False

  • h5ad – Whether to convert the final count matrix into a h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • inspect – Whether or not to inspect the output BUS file and generate the inspect.json

  • strand – Strandedness, defaults to None

Returns

Dictionary containing paths to generated files

kb_python.count.count_velocity(index_path: str, t2g_path: str, cdna_t2c_path: str, intron_t2c_path: str, technology: str, out_dir: str, fastqs: List[str], whitelist_path: Optional[str] = None, tcc: bool = False, mm: bool = False, filter: Optional[typing_extensions.Literal[bustools]] = None, filter_threshold: Optional[int] = None, temp_dir: str = 'tmp', threads: int = 8, memory: str = '4G', overwrite: bool = False, loom: bool = False, h5ad: bool = False, by_name: bool = False, cellranger: bool = False, inspect: bool = True, report: bool = False, nucleus: bool = False, fragment_l: Optional[int] = None, fragment_s: Optional[int] = None, paired: bool = False, strand: Optional[typing_extensions.Literal[unstranded, forward, reverse]] = None, umi_gene: bool = False, em: bool = False) Dict[str, Union[Dict[str, str], str]]

Generates RNA velocity matrices for single-cell RNA seq.

Parameters
  • index_path – Path to kallisto index

  • t2g_path – Path to transcript-to-gene mapping

  • cdna_t2c_path – Path to cDNA transcripts-to-capture file

  • intron_t2c_path – Path to intron transcripts-to-capture file

  • technology – Single-cell technology used

  • out_dir – Path to output directory

  • fastqs – List of FASTQ file paths or a single batch definition file

  • whitelist_path – Path to whitelist, defaults to None

  • tcc – Whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm – Whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • filter – Filter to use to generate a filtered count matrix, defaults to None

  • filter_threshold – Barcode filter threshold for bustools, defaults to None

  • temp_dir – Path to temporary directory, defaults to tmp

  • threads – Number of threads to use, defaults to 8

  • memory – Amount of memory to use, defaults to 4G

  • overwrite – Overwrite an existing index file, defaults to False

  • loom – Whether to convert the final count matrix into a loom file, defaults to False

  • h5ad – Whether to convert the final count matrix into a h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • cellranger – Whether to convert the final count matrix into a cellranger-compatible matrix, defaults to False

  • inspect – Whether or not to inspect the output BUS file and generate the inspect.json

  • report – Generate HTML reports, defaults to False

  • nucleus – Whether this is a single-nucleus experiment. if True, the spliced and unspliced count matrices will be summed, defaults to False

  • fragment_l – Mean length of fragments, defaults to None

  • fragment_s – Standard deviation of fragment lengths, defaults to None

  • paired – Whether the fastqs are paired. Has no effect when a single batch file is provided. Defaults to False

  • strand – Strandedness, defaults to None

  • umi_gene – Whether to perform gene-level UMI collapsing, defaults to False

  • em – Whether to estimate gene abundances using EM algorithm, defaults to False

Returns

Dictionary containing path to generated index

kb_python.count.count_velocity_smartseq3(index_path: str, t2g_path: str, cdna_t2c_path: str, intron_t2c_path: str, out_dir: str, fastqs: List[str], whitelist_path: Optional[str] = None, tcc: bool = False, mm: bool = False, temp_dir: str = 'tmp', threads: int = 8, memory: str = '4G', overwrite: bool = False, loom: bool = False, h5ad: bool = False, by_name: bool = False, inspect: bool = True, strand: Optional[typing_extensions.Literal[unstranded, forward, reverse]] = None) Dict[str, Union[str, Dict[str, str]]]

Generates count matrices for Smartseq3.

Parameters
  • index_path – Path to kallisto index

  • t2g_path – Path to transcript-to-gene mapping

  • out_dir – Path to output directory

  • fastqs – List of FASTQ file paths

  • whitelist_path – Path to whitelist, defaults to None

  • tcc – Whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm – Whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • temp_dir – Path to temporary directory, defaults to tmp

  • threads – Pumber of threads to use, defaults to 8

  • memory – Amount of memory to use, defaults to 4G

  • overwrite – Overwrite an existing index file, defaults to False

  • loom – Whether to convert the final count matrix into a loom file, defaults to False

  • h5ad – Whether to convert the final count matrix into a h5ad file, defaults to False

  • by_name – Aggregate counts by name instead of ID. Only affects when tcc=False.

  • inspect – Whether or not to inspect the output BUS file and generate the inspect.json

  • strand – Strandedness, defaults to None

Returns

Dictionary containing paths to generated files