`kb_python.ref`¶

Module Contents¶

Functions¶

`sort_gtf`(gtf_path, out_path)	Sorts a GTF file based on its chromosome, start position, line number.
`sort_fasta`(fasta_path, out_path)	Sorts a FASTA file based on its header.
`check_chromosomes`(fasta_chromosomes, gtf_chromosomes)	Compares the two chromosome sets and outputs warnings if there are
`create_t2g_from_fasta`(fasta_path, t2g_path)	Parse FASTA headers to get transcripts-to-gene mapping.
`create_t2g_from_gtf`(gtf_path, t2g_path, intron=False)	Creates a transcript-to-gene mapping from a GTF file.
`create_t2c`(fasta_path, t2c_path)	Creates a transcripts-to-capture list from a FASTA file.
`kallisto_index`(fasta_path, index_path, k=31)	Runs kallisto index.
`split_and_index`(fasta_path, index_prefix, n=2, k=31, temp_dir=’tmp’)	Split a FASTA file into n parts and index each one.
`download_reference`(reference, files, temp_dir=’tmp’, overwrite=False)	Downloads a provided reference file from a static url.
`decompress_file`(path, temp_dir=’tmp’)	Decompress the given path if it is a .gz file. Otherwise, return the
`ref`(fasta_paths, gtf_paths, cdna_path, index_path, t2g_path, n=1, k=None, temp_dir=’tmp’, overwrite=False)	Generates files necessary to generate count matrices for single-cell RNA-seq.
`ref_kite`(feature_path, fasta_path, index_path, t2g_path, n=1, k=None, no_mismatches=False, temp_dir=’tmp’, overwrite=False)	Generates files necessary for feature barcoding with the KITE workflow.
`ref_lamanno`(fasta_paths, gtf_paths, cdna_path, intron_path, index_path, t2g_path, cdna_t2c_path, intron_t2c_path, n=1, k=None, flank=None, temp_dir=’tmp’, overwrite=False)	Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.

kb_python.ref.logger¶

kb_python.ref.sort_gtf(gtf_path, out_path)¶

Sorts a GTF file based on its chromosome, start position, line number.

Parameters:	gtf_path (str) – path to GTF file
Returns:	path to sorted GTF file, set of chromosomes in GTF file
Return type:	tuple

kb_python.ref.sort_fasta(fasta_path, out_path)¶

Sorts a FASTA file based on its header.

Parameters:	fasta_path (str) – path to FASTA file
Returns:	path to sorted FASTA file, set of chromosomes in FASTA file
Return type:	tuple

kb_python.ref.check_chromosomes(fasta_chromosomes, gtf_chromosomes)¶

Compares the two chromosome sets and outputs warnings if there are unique chromosomes in either set.

Parameters:	fasta_chromosomes (set) – set of chromosomes found in FASTA gtf_chromosomes (set) – set of chromosomes found in GTF
Returns:	intersection of the two sets
Return type:	set

kb_python.ref.create_t2g_from_fasta(fasta_path, t2g_path)¶

Parse FASTA headers to get transcripts-to-gene mapping.

Parameters:	fasta_path (str) – path to FASTA file t2g_path (str) – path to output transcript-to-gene mapping
Returns:	dictionary containing path to generated t2g mapping
Return type:	dict

kb_python.ref.create_t2g_from_gtf(gtf_path, t2g_path, intron=False)¶

Creates a transcript-to-gene mapping from a GTF file.

GTF entries that have transcript as its feature are parsed for the transcript_id, gene_id and gene_name.

Parameters:	gtf_path (str) – path to GTF file t2g_path (str) – path to output transcript-to-gene mapping intron (bool, optional) – whether or not to include intron transcript ids (with the -I prefix), defaults to False
Returns:	dictionary containing path to generated t2g mapping
Return type:	dict

kb_python.ref.create_t2c(fasta_path, t2c_path)¶

Creates a transcripts-to-capture list from a FASTA file.

Parameters:	fasta_path (str) – path to FASTA file t2c_path (str) – path to output transcripts-to-capture list
Returns:	dictionary containing path to generated t2c list
Return type:	dict

kb_python.ref.kallisto_index(fasta_path, index_path, k=31)¶

Runs kallisto index.

Parameters:	fasta_path (str) – path to FASTA file index_path (str) – path to output kallisto index k (int, optional) – k-mer length, defaults to 31
Returns:	dictionary containing path to generated index
Return type:	dict

kb_python.ref.split_and_index(fasta_path, index_prefix, n=2, k=31, temp_dir='tmp')¶

Split a FASTA file into n parts and index each one.

Parameters:	fasta_path (str) – path to FASTA file index_prefix (str) – prefix of output kallisto indices n (int, optional) – split the index into n files, defaults to 2 k (int, optional) – k-mer length, defaults to 31 temp_dir (str, optional) – path to temporary directory, defaults to tmp
Returns:	dictionary containing path to generated index
Return type:	dict

kb_python.ref.download_reference(reference, files, temp_dir='tmp', overwrite=False)¶

Downloads a provided reference file from a static url.

The configuration for provided references is in config.py.

Parameters:	reference (Reference) – a Reference object, as defined in config.py files (dict) – dictionary that has the command-line option as keys and the path as values. used to determine if all the required paths to download the given reference have been provided temp_dir (str, optional) – path to temporary directory, defaults to tmp overwrite (bool, optional) – overwrite an existing index file, defaults to False
Raises:	Exception – if the required options are not provided
Returns:	dictionary containing paths to generated file(s)
Return type:	dict

kb_python.ref.decompress_file(path, temp_dir='tmp')¶

Decompress the given path if it is a .gz file. Otherwise, return the original path.

Parameters:	path (str) – path to the file
Returns:	unaltered path if the file is not a .gz file, otherwise path to the uncompressed file
Return type:	str

kb_python.ref.ref(fasta_paths, gtf_paths, cdna_path, index_path, t2g_path, n=1, k=None, temp_dir='tmp', overwrite=False)¶

Generates files necessary to generate count matrices for single-cell RNA-seq.

Parameters:	fasta_paths (list) – list of paths to genomic FASTA files gtf_paths (list) – list of paths to GTF files cdna_path (str) – path to generate the cDNA FASTA file t2g_path (str) – path to output transcript-to-gene mapping n (int) – split the index into n files k (int, optional) – override default kmer length 31, defaults to None temp_dir (str, optional) – path to temporary directory, defaults to tmp overwrite (bool, optional) – overwrite an existing index file, defaults to False
Returns:	dictionary containing paths to generated file(s)
Return type:	dict

kb_python.ref.ref_kite(feature_path, fasta_path, index_path, t2g_path, n=1, k=None, no_mismatches=False, temp_dir='tmp', overwrite=False)¶

Generates files necessary for feature barcoding with the KITE workflow.

Parameters:	feature_path (str) – path to TSV containing barcodes and feature names fasta_path (str) – path to generate fasta file containing all sequences that are 1 hamming distance from the provide barcodes (including the actual sequence) t2g_path (str) – path to output transcript-to-gene mapping n (int) – split the index into n files k (int, optional) – override calculated optimal kmer length, defaults to None no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False temp_dir (str, optional) – path to temporary directory, defaults to tmp overwrite (bool, optional) – overwrite an existing index file, defaults to False
Returns:	dictionary containing paths to generated file(s)
Return type:	dict

kb_python.ref.ref_lamanno(fasta_paths, gtf_paths, cdna_path, intron_path, index_path, t2g_path, cdna_t2c_path, intron_t2c_path, n=1, k=None, flank=None, temp_dir='tmp', overwrite=False)¶

Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.

Parameters:	fasta_paths (list) – list of paths to genomic FASTA files gtf_paths (list) – list of paths to GTF files cdna_path (str) – path to generate the cDNA FASTA file intron_path (str) – path to generate the intron FASTA file t2g_path (str) – path to output transcript-to-gene mapping cdna_t2c_path (str) – path to generate the cDNA transcripts-to-capture file intron_t2c_path (str) – path to generate the intron transcripts-to-capture file n (int) – split the index into n files k (int, optional) – override default kmer length (31), defaults to None flank (int, optional) – number of bases to include from the flanking regions when generating the intron FASTA, defaults to None, which sets the flanking region to be k - 1 bases. temp_dir (str, optional) – path to temporary directory, defaults to tmp overwrite (bool, optional) – overwrite an existing index file, defaults to False
Returns:	dictionary containing paths to generated file(s)
Return type:	dict

kb_python.ref¶

Module Contents¶

Functions¶

`kb_python.ref`¶