kb_python.fasta
¶
Module Contents¶
-
kb_python.fasta.
logger
¶
-
class
kb_python.fasta.
FASTA
(fasta_path)¶ Utility class to easily read and manipulate FASTA files.
Parameters: fasta_path (str) – path to FASTA file -
PARSER
¶
-
GROUP_PARSER
¶
-
BASEPAIRS
¶
-
static
make_header
(seq_id, attributes)¶ Create a correctly-formatted FASTA header with the given sequence ID and attributes.
Parameters: - seq_id (str) – sequence ID
- attributes (list) – list of key-value pairs corresponding to attributes of this sequence
Returns: FASTA header
Return type: str
-
static
parse_header
(line)¶ Parse information from a FASTA header.
Parameters: line (str) – FASTA header line Returns: parsed information Return type: dict
-
static
reverse_complement
(sequence)¶ Get the reverse complement of the given DNA sequence.
Parameters: sequence (str) – DNA sequence Returns: reverse complement Return type: str
-
entries
(self)¶ Generator that yields one FASTA entry (sequence ID + sequence) at a time.
Returns: a generator that yields a tuple of the FASTA entry Return type: generator
-
sort
(self, out_path)¶ Sort the FASTA file by sequence ID.
Parameters: out_path (str) – path to generate the sorted FASTA
-
-
kb_python.fasta.
generate_cdna_fasta
(fasta_path, gtf_path, out_path)¶ Generate a cDNA FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position.
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to cDNA FASTA to generate
Returns: path to generated cDNA FASTA
Return type: str
-
kb_python.fasta.
generate_intron_fasta
(fasta_path, gtf_path, out_path, flank=30)¶ Generate an intron FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The intron for a specific transcript is the collection of the following: 1. transcript - exons 2. 5’ UTR 3. 3’ UTR Additionally, append 30-bp (k - 1 where k = 31) flanks to each intron, combining sections that overlap into a single FASTA entry.
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to intron FASTA to generate
- flank (int, optional) – the size of intron flanks, in bases, defaults to 30
Returns: path to generated intron FASTA
Return type: str
-
kb_python.fasta.
generate_spliced_fasta
(fasta_path, gtf_path, out_path)¶ Generate a spliced FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-exon splice junctions (any overlapping regions are collapsed).
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to spliced FASTA to generate
Returns: path to generated spliced FASTA
Return type: str
-
kb_python.fasta.
generate_unspliced_fasta
(fasta_path, gtf_path, out_path)¶ Generate a unspliced FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-intron splice junctions + full introns (any overlapping regions are collapsed).
Parameters: - fasta_path (str) – path to genomic FASTA file
- gtf_path (str) – path to GTF file
- out_path (str) – path to unspliced FASTA to generate
Returns: path to generated unspliced FASTA
Return type: str