kb_python.fasta
¶
Module Contents¶
Functions¶
generate_kite_fasta (feature_path, out_path, no_mismatches=False) |
Generate a FASTA file for feature barcoding with the KITE workflow. |
generate_cdna_fasta (fasta_path, gtf_path, out_path, chromosomes=None) |
Generate a cDNA FASTA using the genome and GTF. |
generate_intron_fasta (fasta_path, gtf_path, out_path, chromosomes=None, flank=30) |
Generate an intron FASTA using the genome and GTF. |
generate_spliced_fasta (fasta_path, gtf_path, out_path) |
Generate a spliced FASTA using the genome and GTF. |
generate_unspliced_fasta (fasta_path, gtf_path, out_path) |
Generate a unspliced FASTA using the genome and GTF. |
-
kb_python.fasta.
logger
¶
-
class
kb_python.fasta.
FASTA
(fasta_path)¶ Utility class to easily read and manipulate FASTA files.
Parameters: fasta_path (str) – path to FASTA file -
PARSER
¶
-
GROUP_PARSER
¶
-
COMPLEMENT
¶
-
SEQUENCE_PARSER
¶
-
static
make_header
(seq_id, attributes)¶ Create a correctly-formatted FASTA header with the given sequence ID and attributes.
Parameters: - seq_id (str) – sequence ID
- attributes (list) – list of key-value pairs corresponding to attributes of this sequence
Returns: FASTA header
Return type: str
-
static
parse_header
(line)¶ Parse information from a FASTA header.
Parameters: line (str) – FASTA header line Returns: parsed information Return type: dict
-
static
reverse_complement
(sequence)¶ Get the reverse complement of the given DNA sequence.
Parameters: sequence (str) – DNA sequence Returns: reverse complement Return type: str
-
entries
(self, parse=True)¶ Generator that yields one FASTA entry (sequence ID + sequence) at a time.
Parameters: parse (bool, optional) – whether or not to parse the header into a dictionary, defaults to True Returns: a generator that yields a tuple of the FASTA entry Return type: generator
-
sort
(self, out_path)¶ Sort the FASTA file by sequence ID.
Parameters: out_path (str) – path to generate the sorted FASTA Returns: path to sorted FASTA file, set of chromosomes in FASTA file Return type: tuple
-
-
kb_python.fasta.
generate_kite_fasta
(feature_path, out_path, no_mismatches=False)¶ Generate a FASTA file for feature barcoding with the KITE workflow.
This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.
Parameters: - feature_path (str) – path to TSV containing barcodes and feature names
- out_path (str) – path to FASTA to generate
- no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False
Raises: - Exception – if there are barcodes of different lengths
- Exception – if there are duplicate barcodes
Returns: (path to generated FASTA, set of barcode lengths)
Return type: tuple
-
kb_python.fasta.
generate_cdna_fasta
(fasta_path, gtf_path, out_path, chromosomes=None)¶ Generate a cDNA FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position.
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to cDNA FASTA to generate
- chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None
Returns: path to generated cDNA FASTA
Return type: str
-
kb_python.fasta.
generate_intron_fasta
(fasta_path, gtf_path, out_path, chromosomes=None, flank=30)¶ Generate an intron FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The intron for a specific transcript is the collection of the following: 1. transcript - exons 2. 5’ UTR 3. 3’ UTR Additionally, append 30-bp (k - 1 where k = 31) flanks to each intron, combining sections that overlap into a single FASTA entry.
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to intron FASTA to generate
- chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None
- flank (int, optional) – the size of intron flanks, in bases, defaults to 30
Returns: path to generated intron FASTA
Return type: str
-
kb_python.fasta.
generate_spliced_fasta
(fasta_path, gtf_path, out_path)¶ Generate a spliced FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-exon splice junctions (any overlapping regions are collapsed).
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to spliced FASTA to generate
Returns: path to generated spliced FASTA
Return type: str
-
kb_python.fasta.
generate_unspliced_fasta
(fasta_path, gtf_path, out_path)¶ Generate a unspliced FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-intron splice junctions + full introns (any overlapping regions are collapsed).
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to unspliced FASTA to generate
Returns: path to generated unspliced FASTA
Return type: str