kb_python.fasta
¶
Module Contents¶
Functions¶
|
Generate a FASTA file for feature barcoding with the KITE workflow. |
|
Generate a cDNA FASTA using the genome and GTF. |
|
Generate an intron FASTA using the genome and GTF. |
|
Generate a spliced FASTA using the genome and GTF. |
|
Generate a unspliced FASTA using the genome and GTF. |
-
kb_python.fasta.
logger
¶
-
class
kb_python.fasta.
FASTA
(fasta_path)¶ Utility class to easily read and manipulate FASTA files.
- Parameters
fasta_path (str) – path to FASTA file
-
PARSER
¶
-
GROUP_PARSER
¶
-
COMPLEMENT
¶
-
SEQUENCE_PARSER
¶
-
static
make_header
(seq_id, attributes)¶ Create a correctly-formatted FASTA header with the given sequence ID and attributes.
- Parameters
seq_id (str) – sequence ID
attributes (list) – list of key-value pairs corresponding to attributes of this sequence
- Returns
FASTA header
- Return type
str
-
static
parse_header
(line)¶ Parse information from a FASTA header.
- Parameters
line (str) – FASTA header line
- Returns
parsed information
- Return type
dict
-
static
reverse_complement
(sequence)¶ Get the reverse complement of the given DNA sequence.
- Parameters
sequence (str) – DNA sequence
- Returns
reverse complement
- Return type
str
-
entries
(self, parse=True)¶ Generator that yields one FASTA entry (sequence ID + sequence) at a time.
- Parameters
parse (bool, optional) – whether or not to parse the header into a dictionary, defaults to True
- Returns
a generator that yields a tuple of the FASTA entry
- Return type
generator
-
sort
(self, out_path)¶ Sort the FASTA file by sequence ID.
- Parameters
out_path (str) – path to generate the sorted FASTA
- Returns
path to sorted FASTA file, set of chromosomes in FASTA file
- Return type
tuple
-
kb_python.fasta.
generate_kite_fasta
(feature_path, out_path, no_mismatches=False)¶ Generate a FASTA file for feature barcoding with the KITE workflow.
This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.
- Parameters
feature_path (str) – path to TSV containing barcodes and feature names
out_path (str) – path to FASTA to generate
no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False
- Raises
Exception – if there are barcodes of different lengths
Exception – if there are duplicate barcodes
- Returns
(path to generated FASTA, set of barcode lengths)
- Return type
tuple
-
kb_python.fasta.
generate_cdna_fasta
(fasta_path, gtf_path, out_path, chromosomes=None)¶ Generate a cDNA FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position.
- Parameters
fasta_path (str) – path to genomic FASTA file
gtf_path (str) – path to GTF file
out_path (str) – path to cDNA FASTA to generate
chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None
- Returns
path to generated cDNA FASTA
- Return type
str
-
kb_python.fasta.
generate_intron_fasta
(fasta_path, gtf_path, out_path, chromosomes=None, flank=30)¶ Generate an intron FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The intron for a specific transcript is the collection of the following: 1. transcript - exons 2. 5’ UTR 3. 3’ UTR Additionally, append 30-bp (k - 1 where k = 31) flanks to each intron, combining sections that overlap into a single FASTA entry.
- Parameters
fasta_path (str) – path to genomic FASTA file
gtf_path (str) – path to GTF file
out_path (str) – path to intron FASTA to generate
chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None
flank (int, optional) – the size of intron flanks, in bases, defaults to 30
- Returns
path to generated intron FASTA
- Return type
str
-
kb_python.fasta.
generate_spliced_fasta
(fasta_path, gtf_path, out_path)¶ Generate a spliced FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-exon splice junctions (any overlapping regions are collapsed).
- Parameters
fasta_path (str) – path to genomic FASTA file
gtf_path (str) – path to GTF file
out_path (str) – path to spliced FASTA to generate
- Returns
path to generated spliced FASTA
- Return type
str
-
kb_python.fasta.
generate_unspliced_fasta
(fasta_path, gtf_path, out_path)¶ Generate a unspliced FASTA using the genome and GTF.
This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-intron splice junctions + full introns (any overlapping regions are collapsed).
- Parameters
fasta_path (str) – path to genomic FASTA file
gtf_path (str) – path to GTF file
out_path (str) – path to unspliced FASTA to generate
- Returns
path to generated unspliced FASTA
- Return type
str