kb_python.fasta

Module Contents

Classes

FASTA

Utility class to easily read and manipulate FASTA files.

Functions

generate_kite_fasta(feature_path, out_path, no_mismatches=False)

Generate a FASTA file for feature barcoding with the KITE workflow.

generate_cdna_fasta(fasta_path, gtf_path, out_path, chromosomes=None)

Generate a cDNA FASTA using the genome and GTF.

generate_intron_fasta(fasta_path, gtf_path, out_path, chromosomes=None, flank=30)

Generate an intron FASTA using the genome and GTF.

generate_spliced_fasta(fasta_path, gtf_path, out_path)

Generate a spliced FASTA using the genome and GTF.

generate_unspliced_fasta(fasta_path, gtf_path, out_path)

Generate a unspliced FASTA using the genome and GTF.

kb_python.fasta.logger
class kb_python.fasta.FASTA(fasta_path)

Utility class to easily read and manipulate FASTA files.

Parameters

fasta_path (str) – path to FASTA file

PARSER
GROUP_PARSER
COMPLEMENT
SEQUENCE_PARSER
static make_header(seq_id, attributes)

Create a correctly-formatted FASTA header with the given sequence ID and attributes.

Parameters
  • seq_id (str) – sequence ID

  • attributes (list) – list of key-value pairs corresponding to attributes of this sequence

Returns

FASTA header

Return type

str

static parse_header(line)

Parse information from a FASTA header.

Parameters

line (str) – FASTA header line

Returns

parsed information

Return type

dict

static reverse_complement(sequence)

Get the reverse complement of the given DNA sequence.

Parameters

sequence (str) – DNA sequence

Returns

reverse complement

Return type

str

entries(self, parse=True)

Generator that yields one FASTA entry (sequence ID + sequence) at a time.

Parameters

parse (bool, optional) – whether or not to parse the header into a dictionary, defaults to True

Returns

a generator that yields a tuple of the FASTA entry

Return type

generator

sort(self, out_path)

Sort the FASTA file by sequence ID.

Parameters

out_path (str) – path to generate the sorted FASTA

Returns

path to sorted FASTA file, set of chromosomes in FASTA file

Return type

tuple

kb_python.fasta.generate_kite_fasta(feature_path, out_path, no_mismatches=False)

Generate a FASTA file for feature barcoding with the KITE workflow.

This FASTA contains all sequences that are 1 hamming distance from the provided barcodes. The file of barcodes must be a 2-column TSV containing the barcode sequences in the first column and their corresponding feature name in the second column. If hamming distance 1 variants collide for any pair of barcodes, the hamming distance 1 variants for those barcodes are not generated.

Parameters
  • feature_path (str) – path to TSV containing barcodes and feature names

  • out_path (str) – path to FASTA to generate

  • no_mismatches (bool, optional) – whether to generate hamming distance 1 variants, defaults to False

Raises
  • Exception – if there are barcodes of different lengths

  • Exception – if there are duplicate barcodes

Returns

(path to generated FASTA, set of barcode lengths)

Return type

tuple

kb_python.fasta.generate_cdna_fasta(fasta_path, gtf_path, out_path, chromosomes=None)

Generate a cDNA FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position.

Parameters
  • fasta_path (str) – path to genomic FASTA file

  • gtf_path (str) – path to GTF file

  • out_path (str) – path to cDNA FASTA to generate

  • chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None

Returns

path to generated cDNA FASTA

Return type

str

kb_python.fasta.generate_intron_fasta(fasta_path, gtf_path, out_path, chromosomes=None, flank=30)

Generate an intron FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The intron for a specific transcript is the collection of the following: 1. transcript - exons 2. 5’ UTR 3. 3’ UTR Additionally, append 30-bp (k - 1 where k = 31) flanks to each intron, combining sections that overlap into a single FASTA entry.

Parameters
  • fasta_path (str) – path to genomic FASTA file

  • gtf_path (str) – path to GTF file

  • out_path (str) – path to intron FASTA to generate

  • chromosomes (set, optional) – set of chromosomes to generate sequences for. If not provided, sequences for all chromosomes are generated by default, defaults to None

  • flank (int, optional) – the size of intron flanks, in bases, defaults to 30

Returns

path to generated intron FASTA

Return type

str

kb_python.fasta.generate_spliced_fasta(fasta_path, gtf_path, out_path)

Generate a spliced FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-exon splice junctions (any overlapping regions are collapsed).

Parameters
  • fasta_path (str) – path to genomic FASTA file

  • gtf_path (str) – path to GTF file

  • out_path (str) – path to spliced FASTA to generate

Returns

path to generated spliced FASTA

Return type

str

kb_python.fasta.generate_unspliced_fasta(fasta_path, gtf_path, out_path)

Generate a unspliced FASTA using the genome and GTF.

This function assumes the order in which the chromosomes appear in the genome FASTA is identical to the order in which they appear in the GTF. Additionally, the GTF must be sorted by start position. The spliced FASTA contains entries of length 2 * (k - 1) for k = 31, centered around exon-intron splice junctions + full introns (any overlapping regions are collapsed).

Parameters
  • fasta_path (str) – path to genomic FASTA file

  • gtf_path (str) – path to GTF file

  • out_path (str) – path to unspliced FASTA to generate

Returns

path to generated unspliced FASTA

Return type

str