kb_python.ref
¶
Module Contents¶
Functions¶
|
Generate a FASTA file for feature barcoding with the KITE workflow. |
|
Parse FASTA headers to get transcripts-to-gene mapping. |
|
Creates a transcripts-to-capture list from a FASTA file. |
|
Runs kallisto index. |
|
Split a FASTA file into n parts and index each one. |
|
Downloads a provided reference file from a static url. |
|
Decompress the given path if it is a .gz file. Otherwise, return the |
Helper function to create a filtering function to include certain GTF |
|
Helper function to create a filtering function to exclude certain GTF |
|
|
Generates files necessary to generate count matrices for single-cell RNA-seq. |
|
Generates files necessary for feature barcoding with the KITE workflow. |
|
Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq. |
- exception kb_python.ref.RefError¶
Bases:
Exception
Common base class for all non-exit exceptions.
- kb_python.ref.generate_kite_fasta(feature_path: str, out_path: str, no_mismatches: bool = False) Tuple[str, int] ¶
Generate a FASTA file for feature barcoding with the KITE workflow.
This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.
- Parameters
feature_path – Path to TSV containing barcodes and feature names
out_path – Path to FASTA to generate
no_mismatches – Whether to generate hamming distance 1 variants, defaults to False
- Returns
Path to generated FASTA, smallest barcode length
- Raises
RefError – If there are barcodes of different lengths or if there are duplicate barcodes
- kb_python.ref.create_t2g_from_fasta(fasta_path: str, t2g_path: str) Dict[str, str] ¶
Parse FASTA headers to get transcripts-to-gene mapping.
- Parameters
fasta_path – Path to FASTA file
t2g_path – Path to output transcript-to-gene mapping
- Returns
Dictionary containing path to generated t2g mapping
- kb_python.ref.create_t2c(fasta_path: str, t2c_path: str) Dict[str, str] ¶
Creates a transcripts-to-capture list from a FASTA file.
- Parameters
fasta_path – Path to FASTA file
t2c_path – Path to output transcripts-to-capture list
- Returns
Dictionary containing path to generated t2c list
- kb_python.ref.kallisto_index(fasta_path: str, index_path: str, k: int = 31) Dict[str, str] ¶
Runs kallisto index.
- Parameters
fasta_path – path to FASTA file
index_path – path to output kallisto index
k – k-mer length, defaults to 31
- Returns
Dictionary containing path to generated index
- kb_python.ref.split_and_index(fasta_path: str, index_prefix: str, n: int = 2, k: int = 31, temp_dir: str = 'tmp') Dict[str, str] ¶
Split a FASTA file into n parts and index each one.
- Parameters
fasta_path – Path to FASTA file
index_prefix – Prefix of output kallisto indices
n – Split the index into n files, defaults to 2
k – K-mer length, defaults to 31
temp_dir – Path to temporary directory, defaults to tmp
- Returns
Dictionary containing path to generated index
- kb_python.ref.download_reference(reference: kb_python.config.Reference, files: Dict[str, str], temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str] ¶
Downloads a provided reference file from a static url.
The configuration for provided references is in config.py.
- Parameters
reference – A Reference object
files – Dictionary that has the command-line option as keys and the path as values. used to determine if all the required paths to download the given reference have been provided
temp_dir – Path to temporary directory, defaults to tmp
overwrite – Overwrite an existing index file, defaults to False
- Returns
Dictionary containing paths to generated file(s)
- Raises
RefError – If the required options are not provided
- kb_python.ref.decompress_file(path: str, temp_dir: str = 'tmp') str ¶
Decompress the given path if it is a .gz file. Otherwise, return the original path.
- Parameters
path – Path to the file
- Returns
- Unaltered path if the file is not a .gz file, otherwise path to the
uncompressed file
- kb_python.ref.get_gtf_attribute_include_func(include: List[Dict[str, str]]) Callable[[ngs_tools.gtf.GtfEntry], bool] ¶
Helper function to create a filtering function to include certain GTF entries while processing. The returned function returns True if the entry should be included.
- Parameters
include – List of dictionaries representing key-value pairs of attributes to include
- Returns
Filter function
- kb_python.ref.get_gtf_attribute_exclude_func(exclude: List[Dict[str, str]]) Callable[[ngs_tools.gtf.GtfEntry], bool] ¶
Helper function to create a filtering function to exclude certain GTF entries while processing. The returned function returns False if the entry should be excluded.
- Parameters
exclude – List of dictionaries representing key-value pairs of attributes to exclude
- Returns
Filter function
- kb_python.ref.ref(fasta_paths: Union[List[str], str], gtf_paths: Union[List[str], str], cdna_path: str, index_path: str, t2g_path: str, n: int = 1, k: Optional[int] = None, include: Optional[List[Dict[str, str]]] = None, exclude: Optional[List[Dict[str, str]]] = None, temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str] ¶
Generates files necessary to generate count matrices for single-cell RNA-seq.
- Parameters
fasta_paths – List of paths to genomic FASTA files
gtf_paths – List of paths to GTF files
cdna_path – Path to generate the cDNA FASTA file
t2g_path – Path to output transcript-to-gene mapping
n – Split the index into n files
k – Override default kmer length 31, defaults to None
include – List of dictionaries representing key-value pairs of attributes to include
exclude – List of dictionaries representing key-value pairs of attributes to exclude
temp_dir – Path to temporary directory, defaults to tmp
overwrite – Overwrite an existing index file, defaults to False
- Returns
Dictionary containing paths to generated file(s)
- kb_python.ref.ref_kite(feature_path: str, fasta_path: str, index_path: str, t2g_path: str, n: int = 1, k: Optional[int] = None, no_mismatches: bool = False, temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str] ¶
Generates files necessary for feature barcoding with the KITE workflow.
- Parameters
feature_path – Path to TSV containing barcodes and feature names
fasta_path – Path to generate fasta file containing all sequences that are 1 hamming distance from the provide barcodes (including the actual sequence)
t2g_path – Path to output transcript-to-gene mapping
n – Split the index into n files
k – Override calculated optimal kmer length, defaults to None
no_mismatches – Whether to generate hamming distance 1 variants, defaults to False
temp_dir – Path to temporary directory, defaults to tmp
overwrite – Overwrite an existing index file, defaults to False
- Returns
Dictionary containing paths to generated file(s)
- kb_python.ref.ref_lamanno(fasta_paths: Union[List[str], str], gtf_paths: Union[List[str], str], cdna_path: str, intron_path: str, index_path: str, t2g_path: str, cdna_t2c_path: str, intron_t2c_path: str, n: int = 1, k: Optional[int] = None, flank: Optional[int] = None, include: Optional[List[Dict[str, str]]] = None, exclude: Optional[List[Dict[str, str]]] = None, temp_dir: str = 'tmp', overwrite: bool = False) Dict[str, str] ¶
Generates files necessary to generate RNA velocity matrices for single-cell RNA-seq.
- Parameters
fasta_paths – List of paths to genomic FASTA files
gtf_paths – List of paths to GTF files
cdna_path – Path to generate the cDNA FASTA file
intron_path – Path to generate the intron FASTA file
t2g_path – Path to output transcript-to-gene mapping
cdna_t2c_path – Path to generate the cDNA transcripts-to-capture file
intron_t2c_path – Path to generate the intron transcripts-to-capture file
n – Split the index into n files
k – Override default kmer length (31), defaults to None
flank – Number of bases to include from the flanking regions when generating the intron FASTA, defaults to None, which sets the flanking region to be k - 1 bases.
include – List of dictionaries representing key-value pairs of attributes to include
exclude – List of dictionaries representing key-value pairs of attributes to exclude
temp_dir – Path to temporary directory, defaults to tmp
overwrite – Overwrite an existing index file, defaults to False
- Returns
Dictionary containing paths to generated file(s)