kb_python.fasta

Module Contents

kb_python.fasta.logger
class kb_python.fasta.FASTA(fasta_path)

Utility class to easily read and manipulate FASTA files.

Parameters:fasta_path (str) – path to FASTA file
PARSER
GROUP_PARSER
BASEPAIRS
static make_header(seq_id, attributes)

Create a correctly-formatted FASTA header with the given sequence ID and attributes.

Parameters:
  • seq_id (str) – sequence ID
  • attributes (list) – list of key-value pairs corresponding to attributes of this sequence
Returns:

FASTA header

Return type:

str

static parse_header(line)

Parse information from a FASTA header.

Parameters:line (str) – FASTA header line
Returns:parsed information
Return type:dict
static reverse_complement(sequence)

Get the reverse complement of the given DNA sequence.

Parameters:sequence (str) – DNA sequence
Returns:reverse complement
Return type:str
entries(self)

Generator that yields one FASTA entry (sequence ID + sequence) at a time.

Returns:a generator that yields a tuple of the FASTA entry
Return type:generator
sort(self, out_path)

Sort the FASTA file by sequence ID.

Parameters:out_path (str) – path to generate the sorted FASTA
kb_python.fasta.generate_cdna_fasta(fasta_path, gtf_path, out_path)

Generate a cDNA FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position.

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to cDNA FASTA to generate
Returns:

path to generated cDNA FASTA

Return type:

str

kb_python.fasta.generate_intron_fasta(fasta_path, gtf_path, out_path, flank=30)

Generate an intron FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The intron for a specific transcript is the collection of the following: 1. transcript - exons 2. 5’ UTR 3. 3’ UTR Additionally, append 30-bp (k - 1 where k = 31) flanks to each intron, combining sections that overlap into a single FASTA entry.

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to intron FASTA to generate
  • flank (int, optional) – the size of intron flanks, in bases, defaults to 30
Returns:

path to generated intron FASTA

Return type:

str

kb_python.fasta.generate_spliced_fasta(fasta_path, gtf_path, out_path)

Generate a spliced FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-exon splice junctions (any overlapping regions are collapsed).

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to spliced FASTA to generate
Returns:

path to generated spliced FASTA

Return type:

str

kb_python.fasta.generate_unspliced_fasta(fasta_path, gtf_path, out_path)

Generate a unspliced FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-intron splice junctions + full introns (any overlapping regions are collapsed).

Parameters:
  • fasta_path (str) – path to genomic FASTA file
  • gtf_path (str) – path to GTF file
  • out_path (str) – path to unspliced FASTA to generate
Returns:

path to generated unspliced FASTA

Return type:

str