kb_python.fasta

Module Contents

Classes

FASTA Utility class to easily read and manipulate FASTA files.

Functions

generate_kite_fasta(feature_path, out_path, no_mismatches=False) Generate a FASTA file for feature barcoding with the KITE workflow.
generate_cdna_fasta(fasta_path, gtf_path, out_path, chromosomes=None) Generate a cDNA FASTA using the genome and GTF.
generate_intron_fasta(fasta_path, gtf_path, out_path, chromosomes=None, flank=30) Generate an intron FASTA using the genome and GTF.
generate_spliced_fasta(fasta_path, gtf_path, out_path) Generate a spliced FASTA using the genome and GTF.
generate_unspliced_fasta(fasta_path, gtf_path, out_path) Generate a unspliced FASTA using the genome and GTF.
kb_python.fasta.logger
class kb_python.fasta.FASTA(fasta_path)

Utility class to easily read and manipulate FASTA files.

Parameters:fasta_path (str) – path to FASTA file
PARSER
GROUP_PARSER
COMPLEMENT
SEQUENCE_PARSER
static make_header(seq_id, attributes)

Create a correctly-formatted FASTA header with the given sequence ID and attributes.

Parameters:
  • seq_id (str) – sequence ID
  • attributes (list) – list of key-value pairs corresponding to attributes of this sequence
Returns:

FASTA header

Return type:

str

static parse_header(line)

Parse information from a FASTA header.

Parameters:line (str) – FASTA header line
Returns:parsed information
Return type:dict
static reverse_complement(sequence)

Get the reverse complement of the given DNA sequence.

Parameters:sequence (str) – DNA sequence
Returns:reverse complement
Return type:str
entries(self, parse=True)

Generator that yields one FASTA entry (sequence ID + sequence) at a time.

Parameters:parse (bool, optional) – whether or not to parse the header into a dictionary, defaults to True
Returns:a generator that yields a tuple of the FASTA entry
Return type:generator
sort(self, out_path)

Sort the FASTA file by sequence ID.

Parameters:out_path (str) – path to generate the sorted FASTA
Returns:path to sorted FASTA file, set of chromosomes in FASTA file
Return type:tuple
kb_python.fasta.generate_kite_fasta(feature_path, out_path, no_mismatches=False)

Generate a FASTA file for feature barcoding with the KITE workflow.

This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.

Parameters:
  • feature_path (str) – path to TSV containing barcodes and feature names
  • out_path (str) – path to FASTA to generate
  • no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False
Raises:
  • Exception – if there are barcodes of different lengths
  • Exception – if there are duplicate barcodes
Returns:

(path to generated FASTA, set of barcode lengths)

Return type:

tuple

kb_python.fasta.generate_cdna_fasta(fasta_path, gtf_path, out_path, chromosomes=None)

Generate a cDNA FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position.

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to cDNA FASTA to generate
  • chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None
Returns:

path to generated cDNA FASTA

Return type:

str

kb_python.fasta.generate_intron_fasta(fasta_path, gtf_path, out_path, chromosomes=None, flank=30)

Generate an intron FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The intron for a specific transcript is the collection of the following: 1. transcript - exons 2. 5’ UTR 3. 3’ UTR Additionally, append 30-bp (k - 1 where k = 31) flanks to each intron, combining sections that overlap into a single FASTA entry.

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to intron FASTA to generate
  • chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None
  • flank (int, optional) – the size of intron flanks, in bases, defaults to 30
Returns:

path to generated intron FASTA

Return type:

str

kb_python.fasta.generate_spliced_fasta(fasta_path, gtf_path, out_path)

Generate a spliced FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-exon splice junctions (any overlapping regions are collapsed).

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to spliced FASTA to generate
Returns:

path to generated spliced FASTA

Return type:

str

kb_python.fasta.generate_unspliced_fasta(fasta_path, gtf_path, out_path)

Generate a unspliced FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-intron splice junctions + full introns (any overlapping regions are collapsed).

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to unspliced FASTA to generate
Returns:

path to generated unspliced FASTA

Return type:

str