kb_python.ref
¶
Module Contents¶
Functions¶
sort_gtf (gtf_path, out_path) |
Sorts a GTF file based on its chromosome, start position, line number. |
sort_fasta (fasta_path, out_path) |
Sorts a FASTA file based on its header. |
check_chromosomes (fasta_chromosomes, gtf_chromosomes) |
Compares the two chromosome sets and outputs warnings if there are |
create_t2g_from_fasta (fasta_path, t2g_path) |
Parse FASTA headers to get transcripts-to-gene mapping. |
create_t2g_from_gtf (gtf_path, t2g_path, intron=False) |
Creates a transcript-to-gene mapping from a GTF file. |
create_t2c (fasta_path, t2c_path) |
Creates a transcripts-to-capture list from a FASTA file. |
kallisto_index (fasta_path, index_path, k=31) |
Runs kallisto index. |
split_and_index (fasta_path, index_prefix, n=2, k=31, temp_dir=’tmp’) |
Split a FASTA file into n parts and index each one. |
download_reference (reference, files, temp_dir=’tmp’, overwrite=False) |
Downloads a provided reference file from a static url. |
decompress_file (path, temp_dir=’tmp’) |
Decompress the given path if it is a .gz file. Otherwise, return the |
ref (fasta_paths, gtf_paths, cdna_path, index_path, t2g_path, n=1, k=None, temp_dir=’tmp’, overwrite=False) |
Generates files necessary to generate count matrices for single-cell RNA-seq. |
ref_kite (feature_path, fasta_path, index_path, t2g_path, n=1, k=None, no_mismatches=False, temp_dir=’tmp’, overwrite=False) |
Generates files necessary for feature barcoding with the KITE workflow. |
ref_lamanno (fasta_paths, gtf_paths, cdna_path, intron_path, index_path, t2g_path, cdna_t2c_path, intron_t2c_path, n=1, k=None, flank=None, temp_dir=’tmp’, overwrite=False) |
Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq. |
-
kb_python.ref.
logger
¶
-
kb_python.ref.
sort_gtf
(gtf_path, out_path)¶ Sorts a GTF file based on its chromosome, start position, line number.
Parameters: gtf_path (str) – path to GTF file Returns: path to sorted GTF file, set of chromosomes in GTF file Return type: tuple
-
kb_python.ref.
sort_fasta
(fasta_path, out_path)¶ Sorts a FASTA file based on its header.
Parameters: fasta_path (str) – path to FASTA file Returns: path to sorted FASTA file, set of chromosomes in FASTA file Return type: tuple
-
kb_python.ref.
check_chromosomes
(fasta_chromosomes, gtf_chromosomes)¶ Compares the two chromosome sets and outputs warnings if there are unique chromosomes in either set.
Parameters: - fasta_chromosomes (set) – set of chromosomes found in FASTA
- gtf_chromosomes (set) – set of chromosomes found in GTF
Returns: intersection of the two sets
Return type: set
-
kb_python.ref.
create_t2g_from_fasta
(fasta_path, t2g_path)¶ Parse FASTA headers to get transcripts-to-gene mapping.
Parameters: - fasta_path (str) – path to FASTA file
- t2g_path (str) – path to output transcript-to-gene mapping
Returns: dictionary containing path to generated t2g mapping
Return type: dict
-
kb_python.ref.
create_t2g_from_gtf
(gtf_path, t2g_path, intron=False)¶ Creates a transcript-to-gene mapping from a GTF file.
GTF entries that have transcript as its feature are parsed for the transcript_id, gene_id and gene_name.
Parameters: - gtf_path (str) – path to GTF file
- t2g_path (str) – path to output transcript-to-gene mapping
- intron (bool, optional) – whether or not to include intron transcript ids (with the -I prefix), defaults to False
Returns: dictionary containing path to generated t2g mapping
Return type: dict
-
kb_python.ref.
create_t2c
(fasta_path, t2c_path)¶ Creates a transcripts-to-capture list from a FASTA file.
Parameters: - fasta_path (str) – path to FASTA file
- t2c_path (str) – path to output transcripts-to-capture list
Returns: dictionary containing path to generated t2c list
Return type: dict
-
kb_python.ref.
kallisto_index
(fasta_path, index_path, k=31)¶ Runs kallisto index.
Parameters: - fasta_path (str) – path to FASTA file
- index_path (str) – path to output kallisto index
- k (int, optional) – k-mer length, defaults to 31
Returns: dictionary containing path to generated index
Return type: dict
-
kb_python.ref.
split_and_index
(fasta_path, index_prefix, n=2, k=31, temp_dir='tmp')¶ Split a FASTA file into n parts and index each one.
Parameters: - fasta_path (str) – path to FASTA file
- index_prefix (str) – prefix of output kallisto indices
- n (int, optional) – split the index into n files, defaults to 2
- k (int, optional) – k-mer length, defaults to 31
- temp_dir (str, optional) – path to temporary directory, defaults to tmp
Returns: dictionary containing path to generated index
Return type: dict
-
kb_python.ref.
download_reference
(reference, files, temp_dir='tmp', overwrite=False)¶ Downloads a provided reference file from a static url.
The configuration for provided references is in config.py.
Parameters: - reference (Reference) – a Reference object, as defined in config.py
- files (dict) – dictionary that has the command-line option as keys and the path as values. used to determine if all the required paths to download the given reference have been provided
- temp_dir (str, optional) – path to temporary directory, defaults to tmp
- overwrite (bool, optional) – overwrite an existing index file, defaults to False
Raises: Exception – if the required options are not provided
Returns: dictionary containing paths to generated file(s)
Return type: dict
-
kb_python.ref.
decompress_file
(path, temp_dir='tmp')¶ Decompress the given path if it is a .gz file. Otherwise, return the original path.
Parameters: path (str) – path to the file Returns: unaltered path if the file is not a .gz file, otherwise path to the uncompressed file Return type: str
-
kb_python.ref.
ref
(fasta_paths, gtf_paths, cdna_path, index_path, t2g_path, n=1, k=None, temp_dir='tmp', overwrite=False)¶ Generates files necessary to generate count matrices for single-cell RNA-seq.
Parameters: - fasta_paths (list) – list of paths to genomic FASTA files
- gtf_paths (list) – list of paths to GTF files
- cdna_path (str) – path to generate the cDNA FASTA file
- t2g_path (str) – path to output transcript-to-gene mapping
- n (int) – split the index into n files
- k (int, optional) – override default kmer length 31, defaults to None
- temp_dir (str, optional) – path to temporary directory, defaults to tmp
- overwrite (bool, optional) – overwrite an existing index file, defaults to False
Returns: dictionary containing paths to generated file(s)
Return type: dict
-
kb_python.ref.
ref_kite
(feature_path, fasta_path, index_path, t2g_path, n=1, k=None, no_mismatches=False, temp_dir='tmp', overwrite=False)¶ Generates files necessary for feature barcoding with the KITE workflow.
Parameters: - feature_path (str) – path to TSV containing barcodes and feature names
- fasta_path (str) – path to generate fasta file containing all sequences that are 1 hamming distance from the provide barcodes (including the actual sequence)
- t2g_path (str) – path to output transcript-to-gene mapping
- n (int) – split the index into n files
- k (int, optional) – override calculated optimal kmer length, defaults to None
- no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False
- temp_dir (str, optional) – path to temporary directory, defaults to tmp
- overwrite (bool, optional) – overwrite an existing index file, defaults to False
Returns: dictionary containing paths to generated file(s)
Return type: dict
-
kb_python.ref.
ref_lamanno
(fasta_paths, gtf_paths, cdna_path, intron_path, index_path, t2g_path, cdna_t2c_path, intron_t2c_path, n=1, k=None, flank=None, temp_dir='tmp', overwrite=False)¶ Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.
Parameters: - fasta_paths (list) – list of paths to genomic FASTA files
- gtf_paths (list) – list of paths to GTF files
- cdna_path (str) – path to generate the cDNA FASTA file
- intron_path (str) – path to generate the intron FASTA file
- t2g_path (str) – path to output transcript-to-gene mapping
- cdna_t2c_path (str) – path to generate the cDNA transcripts-to-capture file
- intron_t2c_path (str) – path to generate the intron transcripts-to-capture file
- n (int) – split the index into n files
- k (int, optional) – override default kmer length (31), defaults to None
- flank (int, optional) – number of bases to include from the flanking regions when generating the intron FASTA, defaults to None, which sets the flanking region to be k - 1 bases.
- temp_dir (str, optional) – path to temporary directory, defaults to tmp
- overwrite (bool, optional) – overwrite an existing index file, defaults to False
Returns: dictionary containing paths to generated file(s)
Return type: dict