Modules

gfftk.gff

gfftk.gff.dict2combined_gff_fasta(annotation_dict, fasta_dict, output=False, debug=False, source=False)

Write GFFtk annotation dictionary and FASTA sequences to combined GFF3+FASTA format.

Parameters:
  • annotation_dict (dict) – GFFtk standardized annotation dictionary

  • fasta_dict (dict) – Dictionary of sequences keyed by contig name

  • output (str or file handle, default=False) – Output file path or handle. If False, writes to stdout

  • debug (bool, default=False) – Print debug information

  • source (str, default=False) – Override source field in GFF3 output

Return type:

None

gfftk.gff.dict2gff3(infile, output=False, debug=False, source=False, newline=False, url_encode=False)

Convert GFFtk standardized annotation dictionary to GFF3 file.

Annotation dictionary generated by gff2dict or tbl2dict passed as input. This function then write to GFF3 format

Parameters:
  • infile (dict of dict) – standardized annotation dictionary keyed by locus_tag

  • output (str, default=sys.stdout) – annotation file in GFF3 format

  • debug (bool, default=False) – print debug information to stderr

  • source (str, default=False) – override source field in GFF3 output

  • newline (bool, default=False) – add newline after each gene

  • url_encode (bool, default=False) – URL encode attribute values for downstream tool compatibility

gfftk.gff.dict2gff3alignments(infile, output=False, debug=False, alignments='transcript', source=False, newline=False)

Convert GFFtk standardized annotation dictionary to GFF3 alignments file.

Annotation dictionary generated by gff2dict or tbl2dict passed as input. Output format is GFF3-alignment, aka EVM evidence format

Parameters:
  • infile (dict of dict) – standardized annotation dictionary keyed by locus_tag

  • output (str, default=sys.stdout) – annotation file in GFF3 format

  • debug (bool, default=False) – print debug information to stderr

gfftk.gff.dict2gtf(infile, output=False, source=False)

Convert GFFtk standardized annotation dictionary to GTF file.

Annotation dictionary generated by gff2dict or tbl2dict passed as input. This function then write to GTF format, notably this function only writes protein coding CDS features.

Parameters:
  • infile (dict of dict) – standardized annotation dictionary keyed by locus_tag

  • output (str, default=sys.stdout) – annotation in GTF format

gfftk.gff.gff2dict(gff, fasta, annotation=False, table=1, debug=False, gap_filter=False, gff_format='auto', logger=<built-in method write of _io.TextIOWrapper object>)

Convert GFF3 and FASTA to standardized GFFtk dictionary format.

Annotation file in GFF3 format and genome FASTA file are parsed. The result is a dictionary that is keyed by locus_tag (gene name) and the value is a nested dictionary containing feature information.

Parameters:
  • gff (filename : str) – annotation text file in GFF3 format

  • fasta (filename : str) – genome text file in FASTA format

  • annotation (dict of str) – existing annotation dictionary

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • gap_filter (bool, default=False) – remove gene models that span gaps in sequence

  • logger (handle, default=sys.stderr.write) – where to log messages to

Returns:

annotation – standardized annotation dictionary (OrderedDict) keyed by locus_tag

Return type:

dict of dict

gfftk.gff.gtf2dict(gtf, fasta, annotation=False, table=1, debug=False, gap_filter=False, gtf_format='auto', logger=<built-in method write of _io.TextIOWrapper object>)

Convert GTF and FASTA to standardized GFFtk dictionary format.

Annotation file in GTF format and genome FASTA file are parsed. The result is a dictionary that is keyed by locus_tag (gene name) and the value is a nested dictionary containing feature information.

Parameters:
  • gtf (filename : str) – annotation text file in GTF format

  • fasta (filename : str) – genome text file in FASTA format

  • annotation (dict of str) – existing annotation dictionary

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • gap_filter (bool, default=False) – remove gene models that span gaps in sequence

  • logger (handle, default=sys.stderr.write) – where to log messages to

Returns:

annotation – standardized annotation dictionary (OrderedDict) keyed by locus_tag

Return type:

dict of dict

gfftk.gff.is_combined_gff_fasta(filename)

Check if a file contains both GFF3 and FASTA data.

Parameters:

filename (str) – Path to the file to check

Returns:

True if file contains ##FASTA directive, False otherwise

Return type:

bool

gfftk.gff.simplifyGO(inputList)
gfftk.gff.split_combined_gff_fasta(filename)

Split a combined GFF3+FASTA file into separate GFF3 and FASTA components.

Parameters:

filename (str) – Path to the combined file

Returns:

(gff_content, fasta_content) as file-like objects

Return type:

tuple

gfftk.gff.start_end_gap(seq, coords)
gfftk.gff.validate_and_translate_models(k, v, SeqRecords, gap_filter=False, table=1, logger=<built-in method write of _io.TextIOWrapper object>)
gfftk.gff.validate_models(annotation, fadict, logger=<built-in method write of _io.TextIOWrapper object>, table=1, gap_filter=False)

gfftk.consensus

gfftk.consensus.add_evidence(loci, evidence, source='proteins')

Add evidence alignments to loci based on genomic coordinates.

This function associates evidence alignments (proteins or transcripts) with gene loci based on their genomic coordinates. It builds interlap objects for efficient overlap queries and adds the evidence to each locus that it overlaps with.

Parameters:

locidict

Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

evidencedict

Dictionary of evidence alignments

sourcestr, optional

Type of evidence being added, either “proteins” or “transcripts” (default: “proteins”)

Returns:

None

The function modifies the loci dictionary in place

gfftk.consensus.auto_score_threshold(weights, order, user_weight=6)
gfftk.consensus.bed2interlap(bedfile, inter=False)

Parse a BED file and construct a scaffold/feature interlap dictionary.

This function reads a BED file and creates an interlap object containing the genomic features defined in the file. The interlap object allows for efficient overlap queries.

Parameters:

bedfilestr

Path to the BED file

interdict or bool, optional

Existing interlap object to update, or False to create a new one (default: False)

Returns:

tuple

A tuple containing: - inter: Dictionary mapping contig names to interlap objects containing features - length: Total length of all features in the BED file

gfftk.consensus.best_model_default(locus_name, contig, strand, locus, debug=False, weights={}, order={}, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1, evidence_derived_models=[])

Select the best gene model(s) for a locus using evidence-based scoring.

Parameters:

locus_namestr

Name of the locus

contigstr

Name of the contig

strandstr

Strand of the locus (“+” or “-“)

locusdict

Dictionary containing gene models and evidence

debugbool or str, optional

Whether to print debug information or path to debug GFF file

weightsdict, optional

Dictionary of weights for different sources

orderdict, optional

Dictionary of order values for different sources

min_exonint, optional

Minimum exon length

min_intronint, optional

Minimum intron length

max_intronint, optional

Maximum intron length

max_exonint, optional

Maximum exon length

evidence_derived_modelslist, optional

List of evidence-derived models

use_dpbool, optional

Whether to use dynamic programming approach (not used in this function)

allow_multiplebool, optional

Whether to allow multiple gene models

min_gap_sizeint, optional

Minimum gap size for splitting loci

Returns:

list

List of tuples containing gene model IDs and their information

gfftk.consensus.calculate_gene_distance(locus)

Calculate Annotation Edit Distance (AED) between all pairs of gene models in a locus.

This function computes the AED between each pair of gene models in a locus, which measures how similar their exon structures are. The AED scores are used to determine which gene models have the most agreement with other models.

Parameters:

locusdict

Dictionary containing gene models for a single locus Required keys: ‘genes’

Returns:

dict

Nested dictionary mapping gene model names to dictionaries of AED scores with other models {gene1: {gene2: aed_score, gene3: aed_score, …}, gene2: {…}, …}

gfftk.consensus.calculate_source_order(data)

Calculate a rank order of gene prediction sources based on evidence agreement.

This function analyzes how well each gene prediction source agrees with protein and transcript evidence across all loci. It filters the data to include only loci with sufficient evidence, calculates scores for each source based on evidence agreement, and returns a rank-ordered dictionary of sources with their scores.

The rank order is used to prioritize gene models when evidence is not available or is inconclusive. Sources that generally have better agreement with evidence receive higher scores and are ranked higher.

Parameters:

datadict

Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

Returns:

tuple

A tuple containing: - order: OrderedDict mapping source names to their scores, ordered by score (highest first) - n_filt: Number of loci that passed the evidence filter

gfftk.consensus.check_intron_compatibility(model_coords, transcript_coords, strand)

Check if transcript has compatible intron/exon boundaries with the gene model.

Parameters:

model_coordslist

List of (start, end) tuples for the gene model exons

transcript_coordslist

List of (start, end) tuples for the transcript alignment

strandstr

Strand of the gene model (‘+’ or ‘-‘)

Returns:

bool

True if the transcript has compatible intron/exon boundaries, False otherwise

gfftk.consensus.cluster_by_aed(locus, score=0.005)

Cluster gene models based on their Annotation Edit Distance (AED).

This function groups gene models that have very similar exon structures (low AED) into clusters. It is used to identify gene models that are essentially the same prediction but come from different sources, allowing the consensus module to select the best representative from each cluster.

Parameters:

locusdict

Dictionary containing gene models for a single locus Required keys: ‘genes’

scorefloat, optional

Maximum AED threshold for considering two gene models as part of the same cluster (default: 0.005)

Returns:

list

List of lists, where each inner list contains gene model IDs that belong to the same cluster

gfftk.consensus.cluster_interlap(obj)

Cluster genomic features using the interlap.reduce function.

This function takes an interlap object containing genomic features and clusters them based on their coordinates. Features that overlap or are adjacent to each other are grouped into the same cluster. The function then assigns the original feature data to each cluster, ensuring each gene model is only assigned to one locus.

Parameters:

objinterlap.InterLap

Interlap object containing genomic features

Returns:

list

List of dictionaries, where each dictionary represents a cluster of features Each dictionary contains: - locus: Tuple of (start, end) coordinates for the cluster - genes: List of gene features in the cluster - proteins: Empty list for protein evidence (filled later) - transcripts: Empty list for transcript evidence (filled later) - repeats: Empty list for repeat features (filled later)

gfftk.consensus.cluster_interlap_original(obj)

Original cluster_interlap function (before fix) for debugging.

gfftk.consensus.consensus(args)
gfftk.consensus.contained(a, b)

Check if coordinates in a are completely contained within coordinates in b.

This function determines if the interval defined by coordinates a is completely contained within the interval defined by coordinates b. It handles various edge cases and ensures that the coordinates are properly formatted as tuples or lists of integers.

Parameters:

atuple or list

Tuple or list of (start, end) coordinates to check if contained

btuple or list

Tuple or list of (start, end) coordinates that might contain a

Returns:

bool

True if a is completely contained within b, False otherwise

gfftk.consensus.de_novo_distance(locus)

Calculate a score for each gene model based on its similarity to other models.

This function evaluates each gene model in a locus by calculating its Annotation Edit Distance (AED) with all other gene models in the locus. It then computes a score that reflects how similar the gene model is to other models, with higher scores indicating greater similarity.

This score is used as a proxy for prediction confidence when protein or transcript evidence is not available. Gene models that have more agreement with other models receive higher scores.

Parameters:

locusdict

Dictionary containing gene models for a single locus Required keys: ‘genes’

Returns:

dict

Dictionary mapping gene model names to their de novo distance scores

gfftk.consensus.ensure_unique_names(genes)

Ensure gene model names are unique by appending a UUID slug.

This function takes a dictionary of gene models and appends a unique identifier to each gene model name to ensure there are no name collisions when combining gene models from multiple sources.

Parameters:

genesdict

Dictionary of gene models where keys are gene IDs

Returns:

dict

Dictionary of gene models with unique IDs

gfftk.consensus.extend_utrs(consensus_models, transcripts, genome_fasta, min_utr_length=10, max_utr_length=2000, log=<built-in method write of _io.TextIOWrapper object>)

Extend consensus gene models with UTRs based on transcript evidence.

This function examines transcript alignments that match consensus gene models and extends the models with 5’ and 3’ UTRs if supported by the evidence. Only transcripts with compatible intron/exon boundaries are considered. Properly handles spliced UTRs (UTRs containing introns).

UTR extension is limited to avoid overlapping with neighboring genes on the same strand.

Parameters:

consensus_modelsdict

Dictionary of consensus gene models

transcriptsdict

Dictionary of transcript alignments

genome_fastastr

Path to the genome FASTA file

min_utr_lengthint, optional

Minimum length for a UTR to be added (default: 10)

max_utr_lengthint, optional

Maximum length for a UTR extension (default: 2000)

logcallable, optional

Function to use for logging (default: sys.stderr.write)

Returns:

dict

Dictionary of consensus gene models with UTRs added where supported

gfftk.consensus.fasta_length(fasta)

Calculate the total length of all sequences in a FASTA file.

This function reads a FASTA file and sums the lengths of all sequences in the file. It can be used to determine the total genome size from a genome FASTA file.

Parameters:

fastastr

Path to the FASTA file

Returns:

int

Total length of all sequences in the FASTA file

gfftk.consensus.filter4evidence(data, n_genes=3, n_evidence=2)

Filter loci to include only those with sufficient gene models and evidence alignments.

This function filters the loci data structure to include only loci that have at least a specified number of gene models and evidence alignments (proteins + transcripts). It is used to identify high-confidence loci for calculating source weights.

Parameters:

datadict

Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

n_genesint, optional

Minimum number of gene models required in a locus (default: 3)

n_evidenceint, optional

Minimum number of evidence alignments (proteins + transcripts) required in a locus (default: 2)

Returns:

tuple

A tuple containing: - filt: Filtered dictionary of loci with sufficient gene models and evidence - n_filt: Number of loci that passed the filter

gfftk.consensus.filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False)

Filter gene models based on their overlap with repeat regions.

This function filters out gene models that have a significant overlap with repeat regions. It builds an interlap object from the repeat file (GFF3 or BED) and calculates the percentage of each gene model that overlaps with repeats. Gene models with overlap percentage greater than the filter threshold are removed.

Parameters:

fastastr

Path to the genome FASTA file

repeatsstr

Path to the repeats GFF3 or BED file

gene_modelsdict

Dictionary of gene models to filter

filter_thresholdint, optional

Maximum percentage of gene model that can overlap with repeats (default: 90)

logcallable or bool, optional

Function to use for logging, or False to disable logging (default: False)

Returns:

tuple

A tuple containing: - filtered: Dictionary of filtered gene models - dropped: Number of gene models that were filtered out

gfftk.consensus.generate_consensus(fasta, genes, proteins, transcripts, weights, out, debug=False, minscore=False, repeats=False, repeat_overlap=90, tiebreakers='calculated', min_exon=3, min_intron=11, max_intron=-1, max_exon=-1, evidence_derived_models=[], num_processes=None, utrs=True, min_utr_length=10, max_utr_length=2000, log=<built-in method write of _io.TextIOWrapper object>)

Generate consensus gene models from multiple gene prediction sources and evidence.

This function is the main entry point for the consensus module. It takes gene predictions from multiple sources, along with protein and transcript evidence, and generates consensus gene models by selecting the best model at each locus based on evidence and source weights.

The function performs the following steps: 1. Parse input GFF3 files and cluster gene models into loci 2. Calculate source weights based on evidence if tiebreakers=”calculated” 3. Select the best gene model at each locus based on evidence and source weights 4. Filter out gene models that overlap with repeats (if repeats are provided) 5. Write the consensus gene models to a GFF3 file

Parameters:

fastastr

Path to the genome FASTA file

geneslist

List of paths to gene prediction GFF3 files

proteinslist

List of paths to protein alignment GFF3 files

transcriptslist

List of paths to transcript alignment GFF3 files

weightslist

List of source:weight pairs for weighting gene prediction sources

outstr

Path to the output GFF3 file

debugbool or str, optional

Whether to print debug information or path to debug GFF file (default: False)

minscorebool or int, optional

Minimum score threshold for gene models, or False to calculate automatically (default: False)

repeatsbool or str, optional

Path to repeats GFF3 or BED file, or False to skip repeat filtering (default: False)

repeat_overlapint, optional

Maximum percentage of gene model that can overlap with repeats (default: 90)

tiebreakersstr, optional

Method for calculating source weights, either “calculated” or “user” (default: “calculated”)

min_exonint, optional

Minimum exon length in nucleotides (default: 3)

min_intronint, optional

Minimum intron length in nucleotides (default: 11)

max_intronint, optional

Maximum intron length in nucleotides, or -1 for no limit (default: -1)

max_exonint, optional

Maximum exon length in nucleotides, or -1 for no limit (default: -1)

evidence_derived_modelslist, optional

List of sources that are derived from evidence and should be treated differently (default: [])

num_processesint or None, optional

Number of processes to use for parallel execution, or None for sequential (default: None)

logcallable, optional

Function to use for logging (default: sys.stderr.write)

Returns:

dict

Dictionary of consensus gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, coords, etc.)

gfftk.consensus.getAED(query, reference)

Calculate Annotation Edit Distance (AED) between two transcript coordinates.

AED measures the similarity between two gene models by comparing their exon structures. It is calculated as 1 - (SN + SP) / 2, where: - SN (Sensitivity) is the fraction of reference bases that are correctly predicted - SP (Specificity) is the fraction of prediction bases that overlap with the reference

An AED of 0 means the gene models are identical, while an AED of 1 means they are completely different.

Parameters:

querylist

List of (start, end) coordinate tuples for the query gene model’s exons

referencelist

List of (start, end) coordinate tuples for the reference gene model’s exons

Returns:

float

AED score between 0 (identical) and 1 (completely different)

gfftk.consensus.get_loci(annot_dict)

Organize gene models into loci based on genomic coordinates and strand.

This function takes a dictionary of gene models and organizes them into loci based on their genomic coordinates and strand. It creates interlap objects for efficient overlap queries and clusters overlapping gene models into loci. It also filters out pseudogenes and gene models with multiple stop codons.

Parameters:

annot_dictdict

Dictionary of gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, CDS, etc.)

Returns:

tuple

A tuple containing: - loci: Hierarchical dictionary of loci organized by contig and strand

{contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

  • n_loci: Total number of loci

  • pseudo: List of pseudogenes that were filtered out

gfftk.consensus.get_overlap(a, b)

Calculate the overlap between two genomic intervals.

This function calculates the number of base pairs that overlap between two genomic intervals. If the intervals do not overlap, it returns 0.

Parameters:

atuple or list

Tuple or list of (start, end) coordinates for the first interval

btuple or list

Tuple or list of (start, end) coordinates for the second interval

Returns:

int

Number of base pairs that overlap between the two intervals, or 0 if they don’t overlap

gfftk.consensus.gff2interlap(infile, fasta, inter=False)

Parse a GFF3 file and construct a scaffold/gene interlap dictionary.

This function reads a GFF3 file and creates an interlap object containing the genomic features defined in the file. The interlap object allows for efficient overlap queries.

Parameters:

infilestr

Path to the GFF3 file

fastastr

Path to the genome FASTA file

interdict or bool, optional

Existing interlap object to update, or False to create a new one (default: False)

Returns:

tuple

A tuple containing: - inter: Dictionary mapping contig names to interlap objects containing features - length: Total length of all features in the GFF3 file

gfftk.consensus.gff_writer(input, output)

Write consensus gene models to a GFF3 file.

This function takes a dictionary of consensus gene models and writes them to a GFF3 file. It sorts the gene models by contig and start location, and assigns sequential locus tags to each gene model. It also handles the conversion of gene model coordinates to GFF3 features (gene, mRNA, exon, CDS).

Parameters:

inputdict

Dictionary of consensus gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, coords, etc.)

outputstr

Path to the output GFF3 file

Returns:

None

The function writes to the specified output file but does not return a value

gfftk.consensus.gffevidence2dict(file, Evi)

Parse evidence alignments from a GFF3 file into a dictionary.

This function reads a GFF3 file containing evidence alignments (proteins or transcripts) and converts it into a dictionary mapping alignment IDs to their information. It handles multi-exon alignments by combining exons with the same ID into a single entry.

Parameters:

filestr

Path to the GFF3 file containing evidence alignments

Evidict

Existing dictionary to update with new evidence alignments

Returns:

dict

Dictionary mapping alignment IDs to their information (target, type, source, strand, phase, contig, coords, location, score)

gfftk.consensus.map_coords(g_coords, e_coords)

Map evidence coordinates onto gene model coordinates.

This function takes evidence coordinates (protein or transcript alignments) and maps them onto gene model coordinates. It calculates the offset between each evidence coordinate and the corresponding gene model coordinate, which is used to determine how well the evidence aligns with the gene model.

Parameters:

g_coordslist

List of (start, end) coordinate tuples for the gene model’s exons

e_coordslist

List of (start, end) coordinate tuples for the evidence alignments

Returns:

list

List of lists, where each inner list contains the offset between an evidence coordinate and the corresponding gene model coordinate. The list has the same length as g_coords, with empty lists for gene model coordinates that don’t have a corresponding evidence coordinate.

gfftk.consensus.order_sources(locus)

Calculate evidence-based scores for gene models in a locus for source ranking.

This function evaluates each gene model in a locus by calculating how well it aligns with protein and transcript evidence. Unlike score_by_evidence, this function is used specifically for ranking gene prediction sources based on their agreement with evidence.

Parameters:

locusdict

Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’, ‘proteins’, ‘transcripts’

Returns:

dict

Dictionary mapping gene model names to information: - source: Source of the gene model - coords: Coordinates of the gene model - score: Combined evidence score for the gene model

gfftk.consensus.parse_data(genome, gene, protein, transcript, log=<built-in method write of _io.TextIOWrapper object>)

Parse input data files and build a locus data structure.

This function reads gene prediction files, protein alignment files, and transcript alignment files, and organizes them into a hierarchical locus data structure. It assigns unique identifiers to gene models to avoid name collisions and tracks the sources of all predictions and evidence.

Parameters:

genomestr

Path to the genome FASTA file

genelist

List of paths to gene prediction GFF3 files

proteinlist

List of paths to protein alignment GFF3 files

transcriptlist

List of paths to transcript alignment GFF3 files

logcallable, optional

Function to use for logging (default: sys.stderr.write)

Returns:

dict

Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

gfftk.consensus.reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1)

Check if a gene model has reasonable exon and intron lengths.

This function evaluates a gene model to determine if it has reasonable exon and intron lengths based on specified thresholds. It checks minimum exon length, maximum exon length, minimum intron length, maximum intron length, and minimum protein length.

Parameters:

coordslist

List of (start, end) coordinate tuples for the gene model’s exons

min_proteinint, optional

Minimum protein length in amino acids (default: 30)

min_exonint, optional

Minimum exon length in nucleotides (default: 3)

min_intronint, optional

Minimum intron length in nucleotides (default: 10)

max_intronint, optional

Maximum intron length in nucleotides, or -1 for no limit (default: -1)

max_exonint, optional

Maximum exon length in nucleotides, or -1 for no limit (default: -1)

Returns:

bool or str

True if the gene model is reasonable, or a string describing the reason it’s not reasonable

gfftk.consensus.refine_cluster(locus, derived=[])

Identify potential sub-loci within a locus based on non-overlapping gene models.

This function analyzes a locus to determine if it contains multiple non-overlapping gene models from the same source, which might indicate that the locus should be split into multiple sub-loci. It focuses on ab initio gene predictors, which typically don’t predict overlapping models, so multiple models from the same source in a locus suggest the presence of multiple genes.

Parameters:

locusdict

Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’

derivedlist, optional

List of sources that are derived from evidence and should be ignored (default: [])

Returns:

dict or bool

Dictionary mapping sub-locus indices to lists of gene models if sub-loci are found, or False if no sub-loci are identified

gfftk.consensus.safe_extract_coordinates(coords)

Safely extract min and max coordinates from a nested coordinate structure.

This function handles various coordinate formats and structures, extracting the minimum and maximum coordinates while gracefully handling errors. It’s designed to work with potentially malformed or inconsistent coordinate data.

Parameters:

coordslist or tuple

Nested coordinate structure (list of lists, tuples, etc.)

Returns:

tuple or None

A tuple containing (min_coord, max_coord) if extraction succeeds, or None if extraction fails

gfftk.consensus.score_aggregator(locus_name, locus, weights, order, de_novo_aed_scores, evidence_scores, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1)
gfftk.consensus.score_by_evidence(locus, weights={}, derived=[])

Calculate evidence-based scores for gene models in a locus.

This function evaluates each gene model in a locus by calculating how well it aligns with protein and transcript evidence. It assigns scores based on the alignment quality and the weights assigned to different gene prediction sources.

Parameters:

locusdict

Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’, ‘proteins’, ‘transcripts’

weightsdict, optional

Dictionary mapping gene model sources to weight values (default: {})

derivedlist, optional

List of sources that are derived from evidence and should not be scored (default: [])

Returns:

dict

Dictionary mapping gene model names to evidence scores: - protein_evidence_score: Sum of protein evidence scores - transcript_evidence_score: Sum of transcript evidence scores

gfftk.consensus.score_evidence(g_coords, e_coords, weight=2)

Calculate a score for how well evidence aligns with a gene model.

This function evaluates how well evidence coordinates (protein or transcript alignments) match a gene model’s exon structure. It considers both the coverage (percentage of the gene model covered by evidence) and the matching of intron junctions (splice sites).

The scoring system ranges from 0 to 10 (before applying the weight multiplier): - 10: Perfect match (evidence exactly matches the gene model) - 5-9: Partial match (evidence partially covers the gene model or has some matching junctions) - 0: No match (evidence does not overlap with the gene model)

For multi-exon genes, the score is adjusted based on: - Base score from exon coverage (0-10 for each exon) - Percent coverage of the entire gene model - Ratio of matching intron junctions

Parameters:

g_coordslist

List of (start, end) coordinate tuples for the gene model’s exons

e_coordslist

List of (start, end) coordinate tuples for the evidence alignments

weightint, optional

Weight multiplier to apply to the final score (default: 2)

Returns:

int

Score indicating how well the evidence supports the gene model, ranging from 0 (no support) to higher values (strong support)

gfftk.consensus.select_best_utrs(utr_exons_list, strand, min_length=10, max_length=2000)

Select the best UTR exons from multiple transcript evidence.

This function implements several strategies for selecting the most representative UTR structure from multiple transcript alignments.

Parameters:

utr_exons_listlist

List of lists, where each inner list contains UTR exon tuples (start, end)

strandstr

Strand of the gene model (‘+’ or ‘-‘)

min_lengthint, optional

Minimum total length for a UTR to be considered (default: 10)

max_lengthint, optional

Maximum total length for a UTR to be considered (default: 2000)

Returns:

tuple

(best_utrs, method_used) - best_utrs: List of (start, end) tuples representing the best UTR exons - method_used: String describing the method used to select the UTRs

gfftk.consensus.solve_sub_loci(result)
gfftk.consensus.src_scaling_factor(obj)

Calculate a scaling factor based on the diversity of gene prediction sources that agree.

This function analyzes the AED scores between gene models to determine how many different gene prediction sources agree with each other. It returns a scaling factor that reflects the proportion of unique sources that have only one model in agreement with others.

The scaling factor is used to adjust de novo distance scores to favor gene models that have agreement across multiple different sources rather than multiple models from the same source.

Parameters:

objdict

Dictionary mapping gene model names to their AED scores with other gene models

Returns:

float

Scaling factor between 0 and 1, where 1 indicates all sources have only one model in agreement with others, and lower values indicate multiple models from the same source agree with others

gfftk.consensus.sub_cluster(obj)

Split a cluster of gene models into sub-clusters based on source.

This function analyzes a cluster of gene models to determine if it contains multiple models from the same source. If it does, it splits the cluster into sub-clusters, where each sub-cluster contains models that are more likely to belong together based on their overlap.

Parameters:

objlist

List of gene model tuples, where each tuple contains: (name, source, coords, codon_start)

Returns:

list

List of lists, where each inner list contains gene models that belong to the same sub-cluster

gfftk.convert

gfftk.convert.convert(args)
gfftk.convert.gff2cdstranscripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to CDS transcript [no UTRs] FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.

Parameters:
  • gff (filename) – genome annotation text file in GFF3 format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

  • grep (list, default=[]) – Filter gene models, keep matches. [key:value]

  • grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gff2combined(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 and FASTA to combined GFF3+FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write to combined GFF3+FASTA format with both annotations and sequences.

Parameters:
  • gff (filename) – genome annotation text file in GFF3 format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – combined GFF3+FASTA format file

  • grep (list, default=[]) – filter results to only include gene models with locus_tag matching grep

  • grepv (list, default=[]) – filter results to exclude gene models with locus_tag matching grepv

gfftk.convert.gff2gbff(gff, fasta, output=False, table=1, organism=False, strain=False, debug=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])

Convert GFF3 format to GenBank format.

Will parse GFF3 format into GFFtk annotation dictionary and then write to GenBank output.

Parameters:
  • gff (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – annotation file in GenBank format

gfftk.convert.gff2gff3(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[], url_encode=False)

Convert GFF3 format to GFF3 format with filtering.

Will parse GFF3 format into GFFtk annotation dictionary, apply filtering, and then write to GFF3 output. This is useful for filtering GFF3 files. Default is to write to stdout.

Parameters:
  • gff (filename) – genome annotation text file in GFF3 format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – annotation file in GFF3 format

  • grep (list, default=[]) – Filter gene models, keep matches. [key:value]

  • grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gff2gtf(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to GTF format.

Will parse GFF3 format into GFFtk annotation dictionary and then write to GTF output. Only coding genes are output with this method. Default is to write to stdout.

Parameters:
  • gff (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – annotation file in GTF format

  • grep (list, default=[]) – Filter gene models, keep matches. [key:value]

  • grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gff2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[])

Convert GFF3 format to translated protein FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.

Parameters:
  • gff (filename) – genome annotation text file in GFF3 format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • strip_stop (bool, default=False) – remove stop codons (*) from translation

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.gff2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to NCBI TBL format .

Will parse GFF3 annotation format into GFFtk annotation dictionary and then write to NCBI TBL output. Default is to write to stdout.

Parameters:
  • gff (filename) – genome annotation text file in GFF3 format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – annotation file in NCBI TBL format

gfftk.convert.gff2transcripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to transcript FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write transcripts in FASTA format.

Parameters:
  • gff (filename) – genome annotation text file in GFF3 format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

  • grep (list, default=[]) – Filter gene models, keep matches. [key:value]

  • grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gtf2cdstranscripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to CDS transcript [no UTRs] FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.

Parameters:
  • gff (filename) – genome annotation text file in GTF format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.gtf2gbff(gtf, fasta, output=False, table=1, organism=False, strain=False, debug=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])

Convert GTF format to GenBank format.

Will parse GTF format into GFFtk annotation dictionary and then write to GenBank output.

Parameters:
  • gtf (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – annotation file in GenBank format

gfftk.convert.gtf2gff(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to GFF format.

Will parse GTF format into GFFtk annotation dictionary and then write to GFF3 output. Only coding genes are output with this method. Default is to write to stdout.

Parameters:
  • gff (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – annotation file in GTF format

gfftk.convert.gtf2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[])

Convert GTF format to translated protein FASTA format.

Will parse GTF format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.

Parameters:
  • gff (filename) – genome annotation text file in GTF format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • strip_stop (bool, default=False) – remove stop codons (*) from translation

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.gtf2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to NCBI TBL format .

Will parse GTF annotation format into GFFtk annotation dictionary and then write to NCBI TBL output. Default is to write to stdout.

Parameters:
  • gff (filename) – genome annotation text file in GTF format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – annotation file in NCBI TBL format

gfftk.convert.gtf2transcripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to transcript FASTA format.

Will parse GTF format into GFFtk annotation dictionary and then write transcripts in FASTA format.

Parameters:
  • gff (filename) – genome annotation text file in GTF format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • debug (bool, default=False) – print debug information to stderr

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.tbl2cdstranscripts(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to CDS transcript [no UTRS] in FASTA format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.

Parameters:
  • tbl (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.tbl2gbff(tbl, fasta, output=False, table=1, organism=False, strain=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])

Convert NCBI TBL format to GenBank format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GenBank output.

Parameters:
  • tbl (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – annotation file in GenBank format

gfftk.convert.tbl2gff3(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to GFF3 format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GFF3 output. Default is to write to stdout.

Parameters:
  • tbl (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – annotation file in GFF3 format

  • grep (list, default=[]) – Filter gene models, keep matches. [key:value]

  • grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.tbl2gtf(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to GTF format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GTF output. Only coding genes are output with this method. Default is to write to stdout.

Parameters:
  • tbl (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – annotation file in GTF format

  • grep (list, default=[]) – Filter gene models, keep matches. [key:value]

  • grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.tbl2proteins(tbl, fasta, output=False, table=1, strip_stop=False, grep=[], grepv=[])

Convert NCBI TBL format to translated protein FASTA format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.

Parameters:
  • tbl (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • strip_stop (bool, default=False) – remove stop codons (*) from translation

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.tbl2transcripts(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to transcript FASTA format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write transcripts in FASTA format.

Parameters:
  • tbl (filename) – genome annotation text file in NCBI tbl format

  • fasta (filename) – genome sequence in FASTA format

  • table (int, default=1) – codon table [1]

  • output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.genbank

gfftk.genbank.dict2gbff(annots, seqs, outfile, organism=None, circular=False, lowercase=False)
gfftk.genbank.dict2tbl(genesDict, scaff2genes, scaffLen, SeqCenter, SeqRefNum, skipList, output=False, annotations=False, external=False)

function to take funannotate annotation dictionaries and convert to NCBI tbl output

gfftk.genbank.drop_alt_coords(info, idxs)
gfftk.genbank.duplicate_coords(cds)
gfftk.genbank.fetch_coords(v, i=0, feature='gene')
gfftk.genbank.findUTRs(cds, mrna, strand)
gfftk.genbank.reformatGO(term, goDict={})
gfftk.genbank.sbt_writer(out)
gfftk.genbank.table2asn(tbl, genome, output=False, sbt=False, organism=False, strain=False, tmpdir='/tmp', table=1, cleanup=True)
gfftk.genbank.tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False)

need a method to convert directly from NCBI tbl format to several output formats to avoid conversion problems with GBK files that have mutliple transcripts if can load funannotate dictionary directly from tbl format, then can write the other formats directly