API

GFFtk works by parsing annotation files and storing in a python dictionary. After initial parsing the records are sorted by contig and start position, translated into protein space to test complete gene models or not, and then output in an a python OrderedDict(). The structure looks like this:

locustag: {
   'contig': contigName,  #string
   'type': [],  # list of str one for each transcript mRNA/rRNA/tRNA/ncRNA
   'location': (start, end), #integer tuple
   'strand': +/-, #string
   'ids': [transcript/protein IDs], #list
   'mRNA':[[(ex1,ex1),(ex2,ex2)]], #list of lists of tuples (start, end)
   'CDS':[[(cds1,cds1),(cds2,cds2)]], #list of lists of tuples (start, end)
   'transcript': [seq1, seq2], #list of mRNA trnascripts
   'cds_transcript': [seq1, seq2], #list of mRNA trnascripts (no UTRs)
   'protein': [protseq1,protseq2], #list of CDS translations
   'codon_start': [1,1], #codon start for translations
   'note': [[first note, second note], [first, second, etc]], #list of lists
   'name': genename, # str common gene name
   'product': [hypothetical protein, velvet complex], #list of product definitions
   'gene_synonym': [], # list of gene name Aliases
   'EC_number': [[ec number]], # list of lists
   'go_terms': [[GO:0000001,GO:0000002]],  #list of lists
   'db_xref': [[InterPro:IPR0001,PFAM:004384]],  #list of lists
   'partialStart': [bool], # list of True/False for each transcript
   'partialStop': [bootl], $ list of True/False for each transcript
   'source': source, # string annotation source
   'phase': [[0,2,1]], list of lists
   '5UTR': [[(),()]], #list of lists of tuples (start, end)
   '3UTR': [[(),()]] #list of lists of tuples (start, end)
   }
}