API
- Convert
- GFF
- Genbank
- Fasta
- Consensus
add_evidence()bed2interlap()best_model_default()calculate_gene_distance()calculate_source_order()check_intron_compatibility()cluster_by_aed()cluster_interlap()cluster_interlap_original()contained()de_novo_distance()ensure_unique_names()extend_utrs()fasta_length()filter4evidence()filter_models_repeats()generate_consensus()getAED()get_loci()get_overlap()gff2interlap()gff_writer()gffevidence2dict()map_coords()order_sources()parse_data()reasonable_model()refine_cluster()safe_extract_coordinates()score_by_evidence()score_evidence()select_best_utrs()src_scaling_factor()sub_cluster()
- Stats
- Go
- Utils
GFFtk works by parsing annotation files and storing in a python dictionary. After initial parsing the records are sorted by contig and start position, translated into protein space to test complete gene models or not, and then output in an a python OrderedDict(). The structure looks like this:
locustag: {
'contig': contigName, #string
'type': [], # list of str one for each transcript mRNA/rRNA/tRNA/ncRNA
'location': (start, end), #integer tuple
'strand': +/-, #string
'ids': [transcript/protein IDs], #list
'mRNA':[[(ex1,ex1),(ex2,ex2)]], #list of lists of tuples (start, end)
'CDS':[[(cds1,cds1),(cds2,cds2)]], #list of lists of tuples (start, end)
'transcript': [seq1, seq2], #list of mRNA trnascripts
'cds_transcript': [seq1, seq2], #list of mRNA trnascripts (no UTRs)
'protein': [protseq1,protseq2], #list of CDS translations
'codon_start': [1,1], #codon start for translations
'note': [[first note, second note], [first, second, etc]], #list of lists
'name': genename, # str common gene name
'product': [hypothetical protein, velvet complex], #list of product definitions
'gene_synonym': [], # list of gene name Aliases
'EC_number': [[ec number]], # list of lists
'go_terms': [[GO:0000001,GO:0000002]], #list of lists
'db_xref': [[InterPro:IPR0001,PFAM:004384]], #list of lists
'partialStart': [bool], # list of True/False for each transcript
'partialStop': [bootl], $ list of True/False for each transcript
'source': source, # string annotation source
'phase': [[0,2,1]], list of lists
'5UTR': [[(),()]], #list of lists of tuples (start, end)
'3UTR': [[(),()]] #list of lists of tuples (start, end)
}
}