kb_python.count

Module Contents

Functions

kallisto_pseudo(batch_path, index_path, out_dir, threads=8)

Runs kallisto pseudo.

kallisto_bus(fastqs, index_path, technology, out_dir, threads=8, n=False, k=False)

Runs kallisto bus.

kallisto_bus_split(fastqs, index_paths, technology, out_dir, temp_dir='tmp', threads=8, memory='4G')

Runs kallisto bus with split indices.

bustools_mash(out_dirs, out_dir)

Runs bustools mash. Additionally, combines the `run_info.json`s into

bustools_merge(bus_path, out_dir, ecmap_path, txnames_path)

Runs bustools merge.

bustools_project(bus_path, out_path, map_path, ecmap_path, txnames_path)

Runs bustools project.

bustools_sort(bus_path, out_path, temp_dir='tmp', threads=8, memory='4G', flags=False)

Runs bustools sort.

bustools_inspect(bus_path, out_path, whitelist_path, ecmap_path)

Runs bustools inspect.

bustools_correct(bus_path, out_path, whitelist_path)

Runs bustools correct.

bustools_count(bus_path, out_prefix, t2g_path, ecmap_path, txnames_path, tcc=False, mm=False)

Runs bustools count.

bustools_capture(bus_path, out_path, capture_path, ecmap_path, txnames_path, capture_type='transcripts')

Runs bustools capture.

bustools_whitelist(bus_path, out_path)

Runs bustools whitelist.

write_smartseq_batch(fastq_pairs, cell_ids, out_path)

Write a 3-column TSV specifying batch information for Smart-seq reads.

matrix_to_cellranger(matrix_path, barcodes_path, genes_path, t2g_path, out_dir)

Convert bustools count matrix to cellranger-format matrix.

convert_matrix(counts_dir, matrix_path, barcodes_path, genes_path=None, ec_path=None, t2g_path=None, txnames_path=None, name='gene', loom=False, h5ad=False, tcc=False, threads=8)

Convert a gene count or TCC matrix to loom or h5ad.

convert_matrices(counts_dir, matrix_paths, barcodes_paths, genes_paths=None, ec_paths=None, t2g_path=None, txnames_path=None, name='gene', loom=False, h5ad=False, nucleus=False, tcc=False, threads=8)

Convert a gene count or TCC matrix to loom or h5ad.

filter_with_bustools(bus_path, ecmap_path, txnames_path, t2g_path, whitelist_path, filtered_bus_path, counts_prefix=None, tcc=False, mm=False, kite=False, temp_dir='tmp', threads=8, memory='4G', count=True, loom=False, h5ad=False, cellranger=False)

Generate filtered count matrices with bustools.

stream_fastqs(fastqs, temp_dir='tmp')

Given a list of fastqs (that may be local or remote paths), stream any

copy_or_create_whitelist(technology, bus_path, out_dir)

Copies a pre-packaged whitelist if it is provided. Otherwise, runs

convert_transcripts_to_genes(txnames_path, t2g_path, genes_path)

Convert a textfile containing transcript IDs to another textfile containing

count(index_paths, t2g_path, technology, out_dir, fastqs, whitelist_path=None, tcc=False, mm=False, filter=None, kite=False, FB=False, temp_dir='tmp', threads=8, memory='4G', overwrite=False, loom=False, h5ad=False, cellranger=False, inspect=True, report=False)

Generates count matrices for single-cell RNA seq.

count_smartseq(index_paths, t2g_path, technology, out_dir, fastq_pairs, cell_ids=None, temp_dir='tmp', threads=8, memory='4G', overwrite=False, loom=False, h5ad=False)

Generates gene or isoform count matrices from Smart-seq reads.

count_velocity(index_paths, t2g_path, cdna_t2c_path, intron_t2c_path, technology, out_dir, fastqs, whitelist_path=None, tcc=False, mm=False, filter=None, temp_dir='tmp', threads=8, memory='4G', overwrite=False, loom=False, h5ad=False, cellranger=False, report=False, inspect=True, nucleus=False)

Generates RNA velocity matrices for single-cell RNA seq.

kb_python.count.logger
kb_python.count.INSPECT_PARSER
kb_python.count.kallisto_pseudo(batch_path, index_path, out_dir, threads=8)

Runs kallisto pseudo.

Parameters
  • batch_path (str) – path to textfile containing batch definitions

  • index_path (str) – path to kallisto index

  • out_dir (str) – path to output directory

  • threads (int, optional) – number of threads to use, defaults to 8

Returns

dictionary containing output files

Return type

dict

kb_python.count.kallisto_bus(fastqs, index_path, technology, out_dir, threads=8, n=False, k=False)

Runs kallisto bus.

Parameters
  • fastqs (list) – list of FASTQ file paths

  • index_path (str) – path to kallisto index

  • technology (str) – single-cell technology used

  • out_dir (str) – path to output directory

  • threads (int, optional) – number of threads to use, defaults to 8

  • n (bool, optional) – include number of read in flag column (used when splitting indices), defaults to False

  • k (bool, optional) – alignment is done per k-mer (used when splitting indices), defaults to False

Returns

dictionary containing paths to generated files

Return type

dict

kb_python.count.kallisto_bus_split(fastqs, index_paths, technology, out_dir, temp_dir='tmp', threads=8, memory='4G')

Runs kallisto bus with split indices.

Parameters
  • fastqs (list) – list of FASTQ file paths or URLs

  • index_paths (list) – paths to kallisto indices

  • technology (str) – single-cell technology used

  • out_dir (str) – path to output directory

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • threads (int, optional) – number of threads to use, defaults to 8

  • memory (str, optional) – amount of memory to use, defaults to 4G

Returns

dictionary containing paths to generated files

Return type

dict

kb_python.count.bustools_mash(out_dirs, out_dir)

Runs bustools mash. Additionally, combines the `run_info.json`s into one.

Parameters
  • out_dirs (list) – list of kallisto bus output directories. Note that BUS files should be sorted by flag

  • out_dir (str) – path to output directory

Returns

dictionary containing paths to generated files

Return type

dict

kb_python.count.bustools_merge(bus_path, out_dir, ecmap_path, txnames_path)

Runs bustools merge.

Parameters
  • bus_path (str) – path to BUS file to merge

  • out_dir (str) – path to output directory, where the merged BUS file and ecmap will be written

  • ecmap_path (str) – path to ecmap file, as generated by kallisto bus

  • txnames_path (str) – path to transcript names file, as generated by kallisto bus

Returns

dictionary containing path to generated BUS file and merged ecmap

Return type

dict

kb_python.count.bustools_project(bus_path, out_path, map_path, ecmap_path, txnames_path)

Runs bustools project.

Parameters
  • bus_path (str) – path to BUS file to sort

  • out_dir (str) – path to output directory

  • map_path (str) – path to file containing source-to-destination mapping

  • ecmap_path (str) – path to ecmap file, as generated by kallisto bus

  • txnames_path (str) – path to transcript names file, as generated by kallisto bus

Returns

dictionary containing path to generated BUS file

Return type

dict

kb_python.count.bustools_sort(bus_path, out_path, temp_dir='tmp', threads=8, memory='4G', flags=False)

Runs bustools sort.

Parameters
  • bus_path (str) – path to BUS file to sort

  • out_dir (str) – path to output BUS path

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • threads (int, optional) – number of threads to use, defaults to 8

  • memory (str, optional) – amount of memory to use, defaults to 4G

  • flags (bool, optional) – whether to supply the –flags argument to sort, defaults to False

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.bustools_inspect(bus_path, out_path, whitelist_path, ecmap_path)

Runs bustools inspect.

Parameters
  • bus_path (str) – path to BUS file to sort

  • out_path (str) – path to output inspect JSON file

  • whitelist_path (str) – path to whitelist

  • ecmap_path (str) – path to ecmap file, as generated by kallisto bus

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.bustools_correct(bus_path, out_path, whitelist_path)

Runs bustools correct.

Parameters
  • bus_path (str) – path to BUS file to correct

  • out_path (str) – path to output corrected BUS file

  • whitelist_path (str) – path to whitelist

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.bustools_count(bus_path, out_prefix, t2g_path, ecmap_path, txnames_path, tcc=False, mm=False)

Runs bustools count.

Parameters
  • bus_path (str) – path to BUS file to correct

  • out_prefix (str) – prefix of the output files to generate

  • t2g_path (str) – path to output transcript-to-gene mapping

  • ecmap_path (str) – path to ecmap file, as generated by kallisto bus

  • txnames_path (str) – path to transcript names file, as generated by kallisto bus

  • tcc (bool, optional) – whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm (bool, optional) – whether to include BUS records that pseudoalign to multiple genes, defaults to False

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.bustools_capture(bus_path, out_path, capture_path, ecmap_path, txnames_path, capture_type='transcripts')

Runs bustools capture.

Parameters
  • bus_path (str) – path to BUS file to capture

  • out_path (str) – path to BUS file to generate

  • capture_path (str) – path transcripts-to-capture list

  • ecmap_path (str) – path to ecmap file, as generated by kallisto bus

  • txnames_path (str) – path to transcript names file, as generated by kallisto bus

  • capture_type (str) – the type of information in the capture list. can be one of transcripts, umis, barcode.

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.bustools_whitelist(bus_path, out_path)

Runs bustools whitelist.

Parameters
  • bus_path (str) – path to BUS file generate the whitelist from

  • out_path (str) – path to output whitelist

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.write_smartseq_batch(fastq_pairs, cell_ids, out_path)

Write a 3-column TSV specifying batch information for Smart-seq reads. This file is required to use kallisto pseudo on multiple samples (= cells).

Parameters
  • fastq_pairs (list) – list of pairs of FASTQs

  • cell_ids (list) – list of cell IDs

  • out_path (str) – path to batch file to output

Returns

dictionary of written batch file

Return type

dict

kb_python.count.matrix_to_cellranger(matrix_path, barcodes_path, genes_path, t2g_path, out_dir)

Convert bustools count matrix to cellranger-format matrix.

Parameters
  • matrix_path (str) – path to matrix

  • barcodes_path (str) – list of paths to barcodes.txt

  • genes_path (str) – path to genes.txt

  • t2g_path (str) – path to transcript-to-gene mapping

  • out_dir (str) – path to output matrix

Returns

dictionary of matrix files

Return type

dict

kb_python.count.convert_matrix(counts_dir, matrix_path, barcodes_path, genes_path=None, ec_path=None, t2g_path=None, txnames_path=None, name='gene', loom=False, h5ad=False, tcc=False, threads=8)

Convert a gene count or TCC matrix to loom or h5ad.

Parameters
  • counts_dir (str) – path to counts directory

  • matrix_path (str) – path to matrix

  • barcodes_path (str) – list of paths to barcodes.txt

  • genes_path (str, optional) – path to genes.txt, defaults to None

  • ec_path (str, optional) – path to ec.txt, defaults to None

  • t2g_path (str, optional) – path to transcript-to-gene mapping. If this is provided, the third column of the mapping is appended to the anndata var, defaults to None

  • txnames_path (str, optional) – path to transcripts.txt, defaults to None

  • name (str, optional) – name of the columns, defaults to “gene”

  • loom (bool, optional) – whether to generate loom file, defaults to False

  • h5ad (bool, optional) – whether to generate h5ad file, defaults to False

  • tcc (bool, optional) – whether the matrix is a TCC matrix, defaults to False

  • threads (int, optional) – number of threads to use, defaults to 8

Returns

dictionary of generated files

Return type

dict

kb_python.count.convert_matrices(counts_dir, matrix_paths, barcodes_paths, genes_paths=None, ec_paths=None, t2g_path=None, txnames_path=None, name='gene', loom=False, h5ad=False, nucleus=False, tcc=False, threads=8)

Convert a gene count or TCC matrix to loom or h5ad.

Parameters
  • counts_dir (str) – path to counts directory

  • matrix_paths (list) – list of paths to matrices

  • barcodes_paths (list) – list of paths to barcodes.txt

  • genes_paths (list, optional) – list of paths to genes.txt, defaults to None

  • ec_paths (list, optional) – list of path to ec.txt, defaults to None

  • t2g_path (str, optional) – path to transcript-to-gene mapping. If this is provided, the third column of the mapping is appended to the anndata var, defaults to None

  • txnames_path (str, optional) – list of paths to transcripts.txt, defaults to None

  • name (str, optional) – name of the columns, defaults to “gene”

  • loom (bool, optional) – whether to generate loom file, defaults to False

  • h5ad (bool, optional) – whether to generate h5ad file, defaults to False

  • nucleus (bool, optional) – whether the matrices contain single nucleus counts, defaults to False

  • tcc (bool, optional) – whether the matrix is a TCC matrix, defaults to False

  • threads (int, optional) – number of threads to use, defaults to 8

Returns

dictionary of generated files

Return type

dict

kb_python.count.filter_with_bustools(bus_path, ecmap_path, txnames_path, t2g_path, whitelist_path, filtered_bus_path, counts_prefix=None, tcc=False, mm=False, kite=False, temp_dir='tmp', threads=8, memory='4G', count=True, loom=False, h5ad=False, cellranger=False)

Generate filtered count matrices with bustools.

Parameters
  • bus_path (str) – path to sorted, corrected, sorted BUS file

  • ecmap_path (str) – path to matrix ec file

  • txnames_path (str) – path to list of transcripts

  • t2g_path (str) – path to transcript-to-gene mapping

  • whitelist_path (str) – path to filter whitelist to generate

  • filtered_bus_path (str) – path to filtered BUS file to generate

  • counts_prefix (str, optional) – prefix of count matrix, defaults to None

  • tcc (bool, optional) – whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm (bool, optional) – whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • kite (bool, optional) – Whether this is a KITE workflow

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • threads (int, optional) – number of threads to use, defaults to 8

  • memory (str, optional) – amount of memory to use, defaults to 4G

  • loom (bool, optional) – whether to convert the final count matrix into a loom file, defaults to False

  • h5ad (bool, optional) – whether to convert the final count matrix into a h5ad file, defaults to False

  • cellranger (bool, optional) – whether to convert the final count matrix into a cellranger-compatible matrix, defaults to False

Returns

dictionary of generated files

Return type

dict

kb_python.count.stream_fastqs(fastqs, temp_dir='tmp')

Given a list of fastqs (that may be local or remote paths), stream any remote files. Internally, calls utils.

Parameters
  • fastqs (list) – list of (remote or local) fastq paths

  • temp_dir (str) – temporary directory

Returns

all remote paths substituted with a local path

Return type

list

kb_python.count.copy_or_create_whitelist(technology, bus_path, out_dir)

Copies a pre-packaged whitelist if it is provided. Otherwise, runs bustools whitelist to generate a whitelist.

Parameters
  • technology (str) – single-cell technology used

  • bus_path (str) – path to BUS file generate the whitelist from

  • out_dir (str) – path to output directory

Returns

path to copied or generated whitelist

Return type

str

kb_python.count.convert_transcripts_to_genes(txnames_path, t2g_path, genes_path)

Convert a textfile containing transcript IDs to another textfile containing gene IDs, given a transcript-to-gene mapping.

Parameters
  • txnames_path (str) – path to transcripts.txt

  • t2g_path (str) – path to transcript-to-genes mapping

  • genes_path (str) – path to output genes.txt

Returns

path to written genes.txt

Return type

str

kb_python.count.count(index_paths, t2g_path, technology, out_dir, fastqs, whitelist_path=None, tcc=False, mm=False, filter=None, kite=False, FB=False, temp_dir='tmp', threads=8, memory='4G', overwrite=False, loom=False, h5ad=False, cellranger=False, inspect=True, report=False)

Generates count matrices for single-cell RNA seq.

Parameters
  • index_paths (list) – paths to kallisto indices

  • t2g_path (str) – path to transcript-to-gene mapping

  • technology (str) – single-cell technology used

  • out_dir (str) – path to output directory

  • fastqs (list) – list of FASTQ file paths

  • whitelist_path (str, optional) – path to whitelist, defaults to None

  • tcc (bool, optional) – whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm (bool, optional) – whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • filter (str, optional) – filter to use to generate a filtered count matrix, defaults to None

  • kite (bool, optional) – Whether this is a KITE workflow

  • FB (bool, optional) – whether 10x Genomics Feature Barcoding technology was used, defaults to False

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • threads (int, optional) – number of threads to use, defaults to 8

  • memory (str, optional) – amount of memory to use, defaults to 4G

  • overwrite (bool, optional) – overwrite an existing index file, defaults to False

  • loom (bool, optional) – whether to convert the final count matrix into a loom file, defaults to False

  • h5ad (bool, optional) – whether to convert the final count matrix into a h5ad file, defaults to False

  • cellranger (bool, optional) – whether to convert the final count matrix into a cellranger-compatible matrix, defaults to False

  • inspect (bool, optional) – whether or not to inspect the output BUS file and generate the inspect.json

  • report (bool, optional) – generate an HTMl report, defaults to False

Returns

dictionary containing path to generated index

Return type

dict

kb_python.count.count_smartseq(index_paths, t2g_path, technology, out_dir, fastq_pairs, cell_ids=None, temp_dir='tmp', threads=8, memory='4G', overwrite=False, loom=False, h5ad=False)

Generates gene or isoform count matrices from Smart-seq reads.

kb_python.count.count_velocity(index_paths, t2g_path, cdna_t2c_path, intron_t2c_path, technology, out_dir, fastqs, whitelist_path=None, tcc=False, mm=False, filter=None, temp_dir='tmp', threads=8, memory='4G', overwrite=False, loom=False, h5ad=False, cellranger=False, report=False, inspect=True, nucleus=False)

Generates RNA velocity matrices for single-cell RNA seq.

Parameters
  • index_paths (list) – paths to kallisto indices

  • t2g_path (str) – path to transcript-to-gene mapping

  • cdna_t2c_path (str) – path to cDNA transcripts-to-capture file

  • intron_t2c_path (str) – path to intron transcripts-to-capture file

  • technology (str) – single-cell technology used

  • out_dir (str) – path to output directory

  • fastqs (list) – list of FASTQ file paths

  • whitelist_path (str, optional) – path to whitelist, defaults to None

  • tcc (bool, optional) – whether to generate a TCC matrix instead of a gene count matrix, defaults to False

  • mm (bool, optional) – whether to include BUS records that pseudoalign to multiple genes, defaults to False

  • filter (str, optional) – filter to use to generate a filtered count matrix, defaults to None

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • threads (int, optional) – number of threads to use, defaults to 8

  • memory (str, optional) – amount of memory to use, defaults to 4G

  • overwrite (bool, optional) – overwrite an existing index file, defaults to False

  • loom (bool, optional) – whether to convert the final count matrix into a loom file, defaults to False

  • h5ad (bool, optional) – whether to convert the final count matrix into a h5ad file, defaults to False

  • cellranger (bool, optional) – whether to convert the final count matrix into a cellranger-compatible matrix, defaults to False

  • report (bool, optional) – generate HTML reports, defaults to False

  • inspect (bool, optional) – whether or not to inspect the output BUS file and generate the inspect.json

  • nucleus (bool, optional) – whether this is a single-nucleus experiment. if True, the spliced and unspliced count matrices will be summed, defaults to False

Returns

dictionary containing path to generated index

Return type

dict