Fasta

A module for parsing FASTA text files.

class gfftk.fasta.FASTA(fasta_file)

Bases: object

Class for handling FASTA files.

get_seq(contig)

Get the sequence for a contig.

Args:

contig (str): Contig name

Returns:

str: Sequence for the contig

gfftk.fasta.fasta2dict(fasta, full_header=False)

Read FASTA file to dictionary.

This is same as biopython SeqIO.to_dict(), return dictionary keyed by contig name and value is the sequence string.

Parameters:
  • fasta (filename) – FASTA input file (can be gzipped)

  • full_header (bool, default=False) – return full header for contig names, default is split at first space

Returns:

seqs – returns OrderedDict() of header: seq

Return type:

dict

gfftk.fasta.fasta2headers(fasta, full_header=False)

Read FASTA file set of headers.

Simple function to read FASTA file and return set of contig names

Parameters:
  • fasta (filename) – FASTA input file (can be gzipped)

  • full_header (bool, default=False) – return full header for contig names, default is split at first space

Returns:

headers – returns set() of header names

Return type:

set

gfftk.fasta.fasta2lengths(fasta, full_header=False)

Read FASTA file to dictionary of sequence lengths.

Reads FASTA file (optionally gzipped) and returns dictionary of contig header names as keys with length of sequences as values

Parameters:
  • fasta (filename) – FASTA input file (can be gzipped)

  • full_header (bool, default=False) – return full header for contig names, default is split at first space

Returns:

seqs – returns dictionary of header: len(seq)

Return type:

dict

gfftk.fasta.getSeqRegions(seqs, header, coordinates, coords=False)

From sequence dictionary return spliced coordinates.

Takes a sequence dictionary (ie from fasta2dict), the contig name (header) and the coordinates to fetch (list of tuples)

Parameters:
  • seqs (dict) – dictionary of sequences keyed by contig name/ header

  • header (str) – contig name (header) for sequence in seqs dictionary

  • coordinates (list of tuples) – list of tuples of sequence coordinates to return [(1,10), (20,30)]

Returns:

result – returns spliced DNA sequence

Return type:

str

gfftk.fasta.translate(dna, strand, phase, table=1)

Translates DNA sequence into proteins.

Takes DNA (or rather cDNA sequence) and translates to proteins/amino acids. It requires the DNA sequence, the strand, translation phase, and translation table.

Parameters:
  • dna (str) – DNA (cDNA) sequence as nucleotides

  • strand (str, (+/-)) – strand to translate (+ or -)

  • phase (int) – phase to start translation [0,1,2]

  • table (int, default=1) – translation table [1]

Returns:

protSeq – string of translated amino acid sequence

Return type:

str