kb_python.ref

Module Contents

Functions

generate_kite_fasta(feature_path, out_path, no_mismatches=False)

Generate a FASTA file for feature barcoding with the KITE workflow.

create_t2g_from_fasta(fasta_path, t2g_path)

Parse FASTA headers to get transcripts-to-gene mapping.

create_t2c(fasta_path, t2c_path)

Creates a transcripts-to-capture list from a FASTA file.

kallisto_index(fasta_path, index_path, k=31)

Runs kallisto index.

split_and_index(fasta_path, index_prefix, n=2, k=31, temp_dir='tmp')

Split a FASTA file into n parts and index each one.

download_reference(reference, files, temp_dir='tmp', overwrite=False)

Downloads a provided reference file from a static url.

decompress_file(path, temp_dir='tmp')

Decompress the given path if it is a .gz file. Otherwise, return the

ref(fasta_paths, gtf_paths, cdna_path, index_path, t2g_path, n=1, k=None, temp_dir='tmp', overwrite=False)

Generates files necessary to generate count matrices for single-cell RNA-seq.

ref_kite(feature_path, fasta_path, index_path, t2g_path, n=1, k=None, no_mismatches=False, temp_dir='tmp', overwrite=False)

Generates files necessary for feature barcoding with the KITE workflow.

ref_lamanno(fasta_paths, gtf_paths, cdna_path, intron_path, index_path, t2g_path, cdna_t2c_path, intron_t2c_path, n=1, k=None, flank=None, temp_dir='tmp', overwrite=False)

Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.

kb_python.ref.generate_kite_fasta(feature_path, out_path, no_mismatches=False)

Generate a FASTA file for feature barcoding with the KITE workflow.

This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.

Parameters
  • feature_path (str) – path to TSV containing barcodes and feature names

  • out_path (str) – path to FASTA to generate

  • no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False

Raises
  • Exception – if there are barcodes of different lengths

  • Exception – if there are duplicate barcodes

Returns

(path to generated FASTA, set of barcode lengths)

Return type

tuple

kb_python.ref.create_t2g_from_fasta(fasta_path, t2g_path)

Parse FASTA headers to get transcripts-to-gene mapping.

Parameters
  • fasta_path (str) – path to FASTA file

  • t2g_path (str) – path to output transcript-to-gene mapping

Returns

dictionary containing path to generated t2g mapping

Return type

dict

kb_python.ref.create_t2c(fasta_path, t2c_path)

Creates a transcripts-to-capture list from a FASTA file.

Parameters
  • fasta_path (str) – path to FASTA file

  • t2c_path (str) – path to output transcripts-to-capture list

Returns

dictionary containing path to generated t2c list

Return type

dict

kb_python.ref.kallisto_index(fasta_path, index_path, k=31)

Runs kallisto index.

Parameters
  • fasta_path (str) – path to FASTA file

  • index_path (str) – path to output kallisto index

  • k (int, optional) – k-mer length, defaults to 31

Returns

dictionary containing path to generated index

Return type

dict

kb_python.ref.split_and_index(fasta_path, index_prefix, n=2, k=31, temp_dir='tmp')

Split a FASTA file into n parts and index each one.

Parameters
  • fasta_path (str) – path to FASTA file

  • index_prefix (str) – prefix of output kallisto indices

  • n (int, optional) – split the index into n files, defaults to 2

  • k (int, optional) – k-mer length, defaults to 31

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

Returns

dictionary containing path to generated index

Return type

dict

kb_python.ref.download_reference(reference, files, temp_dir='tmp', overwrite=False)

Downloads a provided reference file from a static url.

The configuration for provided references is in config.py.

Parameters
  • reference (Reference) – a Reference object, as defined in config.py

  • files (dict) – dictionary that has the command-line option as keys and the path as values. used to determine if all the required paths to download the given reference have been provided

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • overwrite (bool, optional) – overwrite an existing index file, defaults to False

Raises

Exception – if the required options are not provided

Returns

dictionary containing paths to generated file(s)

Return type

dict

kb_python.ref.decompress_file(path, temp_dir='tmp')

Decompress the given path if it is a .gz file. Otherwise, return the original path.

Parameters

path (str) – path to the file

Returns

unaltered path if the file is not a .gz file, otherwise path to the uncompressed file

Return type

str

kb_python.ref.ref(fasta_paths, gtf_paths, cdna_path, index_path, t2g_path, n=1, k=None, temp_dir='tmp', overwrite=False)

Generates files necessary to generate count matrices for single-cell RNA-seq.

Parameters
  • fasta_paths (list) – list of paths to genomic FASTA files

  • gtf_paths (list) – list of paths to GTF files

  • cdna_path (str) – path to generate the cDNA FASTA file

  • t2g_path (str) – path to output transcript-to-gene mapping

  • n (int) – split the index into n files

  • k (int, optional) – override default kmer length 31, defaults to None

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • overwrite (bool, optional) – overwrite an existing index file, defaults to False

Returns

dictionary containing paths to generated file(s)

Return type

dict

kb_python.ref.ref_kite(feature_path, fasta_path, index_path, t2g_path, n=1, k=None, no_mismatches=False, temp_dir='tmp', overwrite=False)

Generates files necessary for feature barcoding with the KITE workflow.

Parameters
  • feature_path (str) – path to TSV containing barcodes and feature names

  • fasta_path (str) – path to generate fasta file containing all sequences that are 1 hamming distance from the provide barcodes (including the actual sequence)

  • t2g_path (str) – path to output transcript-to-gene mapping

  • n (int) – split the index into n files

  • k (int, optional) – override calculated optimal kmer length, defaults to None

  • no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • overwrite (bool, optional) – overwrite an existing index file, defaults to False

Returns

dictionary containing paths to generated file(s)

Return type

dict

kb_python.ref.ref_lamanno(fasta_paths, gtf_paths, cdna_path, intron_path, index_path, t2g_path, cdna_t2c_path, intron_t2c_path, n=1, k=None, flank=None, temp_dir='tmp', overwrite=False)

Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.

Parameters
  • fasta_paths (list) – list of paths to genomic FASTA files

  • gtf_paths (list) – list of paths to GTF files

  • cdna_path (str) – path to generate the cDNA FASTA file

  • intron_path (str) – path to generate the intron FASTA file

  • t2g_path (str) – path to output transcript-to-gene mapping

  • cdna_t2c_path (str) – path to generate the cDNA transcripts-to-capture file

  • intron_t2c_path (str) – path to generate the intron transcripts-to-capture file

  • n (int) – split the index into n files

  • k (int, optional) – override default kmer length (31), defaults to None

  • flank (int, optional) – number of bases to include from the flanking regions when generating the intron FASTA, defaults to None, which sets the flanking region to be k - 1 bases.

  • temp_dir (str, optional) – path to temporary directory, defaults to tmp

  • overwrite (bool, optional) – overwrite an existing index file, defaults to False

Returns

dictionary containing paths to generated file(s)

Return type

dict