kb_python.ref

Module Contents

Functions

generate_kite_fasta(→ Tuple[str, int])

Generate a FASTA file for feature barcoding with the KITE workflow.

create_t2g_from_fasta(→ Dict[str, str])

Parse FASTA headers to get transcripts-to-gene mapping.

create_t2c(→ Dict[str, str])

Creates a transcripts-to-capture list from a FASTA file.

kallisto_index(→ Dict[str, str])

Runs kallisto index.

split_and_index(→ Dict[str, str])

Split a FASTA file into n parts and index each one.

download_reference(→ Dict[str, str])

Downloads a provided reference file from a static url.

decompress_file(→ str)

Decompress the given path if it is a .gz file. Otherwise, return the

get_gtf_attribute_include_func(...)

Helper function to create a filtering function to include certain GTF

get_gtf_attribute_exclude_func(...)

Helper function to create a filtering function to exclude certain GTF

ref(→ Dict[str, str])

Generates files necessary to generate count matrices for single-cell RNA-seq.

ref_kite(→ Dict[str, str])

Generates files necessary for feature barcoding with the KITE workflow.

ref_lamanno(→ Dict[str, str])

Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.

exception kb_python.ref.RefError

Bases: Exception

Common base class for all non-exit exceptions.

kb_python.ref.generate_kite_fasta(feature_path: str, out_path: str, no_mismatches: bool = False) Tuple[str, int]

Generate a FASTA file for feature barcoding with the KITE workflow.

This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.

Parameters
  • feature_path – Path to TSV containing barcodes and feature names

  • out_path – Path to FASTA to generate

  • no_mismatches – Whether to generate hamming distance 1 variants, defaults to False

Returns

Path to generated FASTA, smallest barcode length

Raises

RefError – If there are barcodes of different lengths or if there are duplicate barcodes

kb_python.ref.create_t2g_from_fasta(fasta_path: str, t2g_path: str) Dict[str, str]

Parse FASTA headers to get transcripts-to-gene mapping.

Parameters
  • fasta_path – Path to FASTA file

  • t2g_path – Path to output transcript-to-gene mapping

Returns

Dictionary containing path to generated t2g mapping

kb_python.ref.create_t2c(fasta_path: str, t2c_path: str) Dict[str, str]

Creates a transcripts-to-capture list from a FASTA file.

Parameters
  • fasta_path – Path to FASTA file

  • t2c_path – Path to output transcripts-to-capture list

Returns

Dictionary containing path to generated t2c list

kb_python.ref.kallisto_index(fasta_path: str, index_path: str, k: int = 31) Dict[str, str]

Runs kallisto index.

Parameters
  • fasta_path – path to FASTA file

  • index_path – path to output kallisto index

  • k – k-mer length, defaults to 31

Returns

Dictionary containing path to generated index

kb_python.ref.split_and_index(fasta_path: str, index_prefix: str, n: int = 2, k: int = 31, temp_dir: str = 'tmp') Dict[str, str]

Split a FASTA file into n parts and index each one.

Parameters
  • fasta_path – Path to FASTA file

  • index_prefix – Prefix of output kallisto indices

  • n – Split the index into n files, defaults to 2

  • k – K-mer length, defaults to 31

  • temp_dir – Path to temporary directory, defaults to tmp

Returns

Dictionary containing path to generated index

kb_python.ref.download_reference(reference: kb_python.config.Reference, files: Dict[str, str], temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str]

Downloads a provided reference file from a static url.

The configuration for provided references is in config.py.

Parameters
  • reference – A Reference object

  • files – Dictionary that has the command-line option as keys and the path as values. used to determine if all the required paths to download the given reference have been provided

  • temp_dir – Path to temporary directory, defaults to tmp

  • overwrite – Overwrite an existing index file, defaults to False

Returns

Dictionary containing paths to generated file(s)

Raises

RefError – If the required options are not provided

kb_python.ref.decompress_file(path: str, temp_dir: str = 'tmp') str

Decompress the given path if it is a .gz file. Otherwise, return the original path.

Parameters

path – Path to the file

Returns

Unaltered path if the file is not a .gz file, otherwise path to the

uncompressed file

kb_python.ref.get_gtf_attribute_include_func(include: List[Dict[str, str]]) Callable[[ngs_tools.gtf.GtfEntry], bool]

Helper function to create a filtering function to include certain GTF entries while processing. The returned function returns True if the entry should be included.

Parameters

include – List of dictionaries representing key-value pairs of attributes to include

Returns

Filter function

kb_python.ref.get_gtf_attribute_exclude_func(exclude: List[Dict[str, str]]) Callable[[ngs_tools.gtf.GtfEntry], bool]

Helper function to create a filtering function to exclude certain GTF entries while processing. The returned function returns False if the entry should be excluded.

Parameters

exclude – List of dictionaries representing key-value pairs of attributes to exclude

Returns

Filter function

kb_python.ref.ref(fasta_paths: Union[List[str], str], gtf_paths: Union[List[str], str], cdna_path: str, index_path: str, t2g_path: str, n: int = 1, k: Optional[int] = None, include: Optional[List[Dict[str, str]]] = None, exclude: Optional[List[Dict[str, str]]] = None, temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str]

Generates files necessary to generate count matrices for single-cell RNA-seq.

Parameters
  • fasta_paths – List of paths to genomic FASTA files

  • gtf_paths – List of paths to GTF files

  • cdna_path – Path to generate the cDNA FASTA file

  • t2g_path – Path to output transcript-to-gene mapping

  • n – Split the index into n files

  • k – Override default kmer length 31, defaults to None

  • include – List of dictionaries representing key-value pairs of attributes to include

  • exclude – List of dictionaries representing key-value pairs of attributes to exclude

  • temp_dir – Path to temporary directory, defaults to tmp

  • overwrite – Overwrite an existing index file, defaults to False

Returns

Dictionary containing paths to generated file(s)

kb_python.ref.ref_kite(feature_path: str, fasta_path: str, index_path: str, t2g_path: str, n: int = 1, k: Optional[int] = None, no_mismatches: bool = False, temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str]

Generates files necessary for feature barcoding with the KITE workflow.

Parameters
  • feature_path – Path to TSV containing barcodes and feature names

  • fasta_path – Path to generate fasta file containing all sequences that are 1 hamming distance from the provide barcodes (including the actual sequence)

  • t2g_path – Path to output transcript-to-gene mapping

  • n – Split the index into n files

  • k – Override calculated optimal kmer length, defaults to None

  • no_mismatches – Whether to generate hamming distance 1 variants, defaults to False

  • temp_dir – Path to temporary directory, defaults to tmp

  • overwrite – Overwrite an existing index file, defaults to False

Returns

Dictionary containing paths to generated file(s)

kb_python.ref.ref_lamanno(fasta_paths: Union[List[str], str], gtf_paths: Union[List[str], str], cdna_path: str, intron_path: str, index_path: str, t2g_path: str, cdna_t2c_path: str, intron_t2c_path: str, n: int = 1, k: Optional[int] = None, flank: Optional[int] = None, include: Optional[List[Dict[str, str]]] = None, exclude: Optional[List[Dict[str, str]]] = None, temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str]

Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.

Parameters
  • fasta_paths – List of paths to genomic FASTA files

  • gtf_paths – List of paths to GTF files

  • cdna_path – Path to generate the cDNA FASTA file

  • intron_path – Path to generate the intron FASTA file

  • t2g_path – Path to output transcript-to-gene mapping

  • cdna_t2c_path – Path to generate the cDNA transcripts-to-capture file

  • intron_t2c_path – Path to generate the intron transcripts-to-capture file

  • n – Split the index into n files

  • k – Override default kmer length (31), defaults to None

  • flank – Number of bases to include from the flanking regions when generating the intron FASTA, defaults to None, which sets the flanking region to be k - 1 bases.

  • include – List of dictionaries representing key-value pairs of attributes to include

  • exclude – List of dictionaries representing key-value pairs of attributes to exclude

  • temp_dir – Path to temporary directory, defaults to tmp

  • overwrite – Overwrite an existing index file, defaults to False

Returns

Dictionary containing paths to generated file(s)