Modules

gfftk.gff

gfftk.gff.dict2combined_gff_fasta(annotation_dict, fasta_dict, output=False, debug=False, source=False)

Write GFFtk annotation dictionary and FASTA sequences to combined GFF3+FASTA format.

Parameters:

annotation_dict (dict) – GFFtk standardized annotation dictionary
fasta_dict (dict) – Dictionary of sequences keyed by contig name
output (str or file handle, default=False) – Output file path or handle. If False, writes to stdout
debug (bool, default=False) – Print debug information
source (str, default=False) – Override source field in GFF3 output

Return type:

None

gfftk.gff.dict2gff3(infile, output=False, debug=False, source=False, newline=False, url_encode=False)

Convert GFFtk standardized annotation dictionary to GFF3 file.

Annotation dictionary generated by gff2dict or tbl2dict passed as input. This function then write to GFF3 format

Parameters:

infile (dict of dict) – standardized annotation dictionary keyed by locus_tag
output (str, default=sys.stdout) – annotation file in GFF3 format
debug (bool, default=False) – print debug information to stderr
source (str, default=False) – override source field in GFF3 output
newline (bool, default=False) – add newline after each gene
url_encode (bool, default=False) – URL encode attribute values for downstream tool compatibility

gfftk.gff.dict2gff3alignments(infile, output=False, debug=False, alignments='transcript', source=False, newline=False)

Convert GFFtk standardized annotation dictionary to GFF3 alignments file.

Annotation dictionary generated by gff2dict or tbl2dict passed as input. Output format is GFF3-alignment, aka EVM evidence format

Parameters:

infile (dict of dict) – standardized annotation dictionary keyed by locus_tag
output (str, default=sys.stdout) – annotation file in GFF3 format
debug (bool, default=False) – print debug information to stderr

gfftk.gff.dict2gtf(infile, output=False, source=False)

Convert GFFtk standardized annotation dictionary to GTF file.

Annotation dictionary generated by gff2dict or tbl2dict passed as input. This function then write to GTF format, notably this function only writes protein coding CDS features.

Parameters:

infile (dict of dict) – standardized annotation dictionary keyed by locus_tag
output (str, default=sys.stdout) – annotation in GTF format

gfftk.gff.gff2dict(gff, fasta, annotation=False, table=1, debug=False, gap_filter=False, gff_format='auto', logger=<built-in method write of _io.TextIOWrapper object>)

Convert GFF3 and FASTA to standardized GFFtk dictionary format.

Annotation file in GFF3 format and genome FASTA file are parsed. The result is a dictionary that is keyed by locus_tag (gene name) and the value is a nested dictionary containing feature information.

Parameters:

gff (filename : str) – annotation text file in GFF3 format
fasta (filename : str) – genome text file in FASTA format
annotation (dict of str) – existing annotation dictionary
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
gap_filter (bool, default=False) – remove gene models that span gaps in sequence
logger (handle, default=sys.stderr.write) – where to log messages to

Returns:

annotation – standardized annotation dictionary (OrderedDict) keyed by locus_tag

Return type:

dict of dict

gfftk.gff.gtf2dict(gtf, fasta, annotation=False, table=1, debug=False, gap_filter=False, gtf_format='auto', logger=<built-in method write of _io.TextIOWrapper object>)

Convert GTF and FASTA to standardized GFFtk dictionary format.

Annotation file in GTF format and genome FASTA file are parsed. The result is a dictionary that is keyed by locus_tag (gene name) and the value is a nested dictionary containing feature information.

Parameters:

gtf (filename : str) – annotation text file in GTF format
fasta (filename : str) – genome text file in FASTA format
annotation (dict of str) – existing annotation dictionary
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
gap_filter (bool, default=False) – remove gene models that span gaps in sequence
logger (handle, default=sys.stderr.write) – where to log messages to

Returns:

annotation – standardized annotation dictionary (OrderedDict) keyed by locus_tag

Return type:

dict of dict

gfftk.gff.is_combined_gff_fasta(filename)

Check if a file contains both GFF3 and FASTA data.

Parameters:: filename (str) – Path to the file to check
Returns:: True if file contains ##FASTA directive, False otherwise
Return type:: bool

gfftk.gff.simplifyGO(inputList)

gfftk.gff.split_combined_gff_fasta(filename)

Split a combined GFF3+FASTA file into separate GFF3 and FASTA components.

Parameters:: filename (str) – Path to the combined file
Returns:: (gff_content, fasta_content) as file-like objects
Return type:: tuple

gfftk.gff.start_end_gap(seq, coords)

gfftk.gff.validate_and_translate_models(k, v, SeqRecords, gap_filter=False, table=1, logger=<built-in method write of _io.TextIOWrapper object>)

gfftk.gff.validate_models(annotation, fadict, logger=<built-in method write of _io.TextIOWrapper object>, table=1, gap_filter=False)

gfftk.consensus

gfftk.consensus.add_evidence(loci, evidence, source='proteins')

Add evidence alignments to loci based on genomic coordinates.

This function associates evidence alignments (proteins or transcripts) with gene loci based on their genomic coordinates. It builds interlap objects for efficient overlap queries and adds the evidence to each locus that it overlaps with.

Parameters:

locidict: Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
evidencedict: Dictionary of evidence alignments
sourcestr, optional: Type of evidence being added, either “proteins” or “transcripts” (default: “proteins”)

Returns:

None: The function modifies the loci dictionary in place

gfftk.consensus.auto_score_threshold(weights, order, user_weight=6)

gfftk.consensus.bed2interlap(bedfile, inter=False)

Parse a BED file and construct a scaffold/feature interlap dictionary.

This function reads a BED file and creates an interlap object containing the genomic features defined in the file. The interlap object allows for efficient overlap queries.

Parameters:

bedfilestr: Path to the BED file
interdict or bool, optional: Existing interlap object to update, or False to create a new one (default: False)

Returns:

tuple: A tuple containing: - inter: Dictionary mapping contig names to interlap objects containing features - length: Total length of all features in the BED file

gfftk.consensus.best_model_default(locus_name, contig, strand, locus, debug=False, weights={}, order={}, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1, evidence_derived_models=[])

Select the best gene model(s) for a locus using evidence-based scoring.

Parameters:

locus_namestr: Name of the locus
contigstr: Name of the contig
strandstr: Strand of the locus (“+” or “-“)
locusdict: Dictionary containing gene models and evidence
debugbool or str, optional: Whether to print debug information or path to debug GFF file
weightsdict, optional: Dictionary of weights for different sources
orderdict, optional: Dictionary of order values for different sources
min_exonint, optional: Minimum exon length
min_intronint, optional: Minimum intron length
max_intronint, optional: Maximum intron length
max_exonint, optional: Maximum exon length
evidence_derived_modelslist, optional: List of evidence-derived models
use_dpbool, optional: Whether to use dynamic programming approach (not used in this function)
allow_multiplebool, optional: Whether to allow multiple gene models
min_gap_sizeint, optional: Minimum gap size for splitting loci

Returns:

list: List of tuples containing gene model IDs and their information

gfftk.consensus.calculate_gene_distance(locus)

Calculate Annotation Edit Distance (AED) between all pairs of gene models in a locus.

This function computes the AED between each pair of gene models in a locus, which measures how similar their exon structures are. The AED scores are used to determine which gene models have the most agreement with other models.

Parameters:

locusdict: Dictionary containing gene models for a single locus Required keys: ‘genes’

Returns:

dict: Nested dictionary mapping gene model names to dictionaries of AED scores with other models {gene1: {gene2: aed_score, gene3: aed_score, …}, gene2: {…}, …}

gfftk.consensus.calculate_source_order(data)

Calculate a rank order of gene prediction sources based on evidence agreement.

This function analyzes how well each gene prediction source agrees with protein and transcript evidence across all loci. It filters the data to include only loci with sufficient evidence, calculates scores for each source based on evidence agreement, and returns a rank-ordered dictionary of sources with their scores.

The rank order is used to prioritize gene models when evidence is not available or is inconclusive. Sources that generally have better agreement with evidence receive higher scores and are ranked higher.

Parameters:

datadict: Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

Returns:

tuple: A tuple containing: - order: OrderedDict mapping source names to their scores, ordered by score (highest first) - n_filt: Number of loci that passed the evidence filter

gfftk.consensus.check_intron_compatibility(model_coords, transcript_coords, strand)

Check if transcript has compatible intron/exon boundaries with the gene model.

Parameters:

model_coordslist: List of (start, end) tuples for the gene model exons
transcript_coordslist: List of (start, end) tuples for the transcript alignment
strandstr: Strand of the gene model (‘+’ or ‘-‘)

Returns:

bool: True if the transcript has compatible intron/exon boundaries, False otherwise

gfftk.consensus.cluster_by_aed(locus, score=0.005)

Cluster gene models based on their Annotation Edit Distance (AED).

This function groups gene models that have very similar exon structures (low AED) into clusters. It is used to identify gene models that are essentially the same prediction but come from different sources, allowing the consensus module to select the best representative from each cluster.

Parameters:

locusdict: Dictionary containing gene models for a single locus Required keys: ‘genes’
scorefloat, optional: Maximum AED threshold for considering two gene models as part of the same cluster (default: 0.005)

Returns:

list: List of lists, where each inner list contains gene model IDs that belong to the same cluster

gfftk.consensus.cluster_interlap(obj)

Cluster genomic features using the interlap.reduce function.

This function takes an interlap object containing genomic features and clusters them based on their coordinates. Features that overlap or are adjacent to each other are grouped into the same cluster. The function then assigns the original feature data to each cluster, ensuring each gene model is only assigned to one locus.

Parameters:

objinterlap.InterLap: Interlap object containing genomic features

Returns:

list: List of dictionaries, where each dictionary represents a cluster of features Each dictionary contains: - locus: Tuple of (start, end) coordinates for the cluster - genes: List of gene features in the cluster - proteins: Empty list for protein evidence (filled later) - transcripts: Empty list for transcript evidence (filled later) - repeats: Empty list for repeat features (filled later)

gfftk.consensus.cluster_interlap_original(obj): Original cluster_interlap function (before fix) for debugging.

gfftk.consensus.consensus(args)

gfftk.consensus.contained(a, b)

Check if coordinates in a are completely contained within coordinates in b.

This function determines if the interval defined by coordinates a is completely contained within the interval defined by coordinates b. It handles various edge cases and ensures that the coordinates are properly formatted as tuples or lists of integers.

Parameters:

atuple or list: Tuple or list of (start, end) coordinates to check if contained
btuple or list: Tuple or list of (start, end) coordinates that might contain a

Returns:

bool: True if a is completely contained within b, False otherwise

gfftk.consensus.de_novo_distance(locus)

Calculate a score for each gene model based on its similarity to other models.

This function evaluates each gene model in a locus by calculating its Annotation Edit Distance (AED) with all other gene models in the locus. It then computes a score that reflects how similar the gene model is to other models, with higher scores indicating greater similarity.

This score is used as a proxy for prediction confidence when protein or transcript evidence is not available. Gene models that have more agreement with other models receive higher scores.

Parameters:

locusdict: Dictionary containing gene models for a single locus Required keys: ‘genes’

Returns:

dict: Dictionary mapping gene model names to their de novo distance scores

gfftk.consensus.ensure_unique_names(genes)

Ensure gene model names are unique by appending a UUID slug.

This function takes a dictionary of gene models and appends a unique identifier to each gene model name to ensure there are no name collisions when combining gene models from multiple sources.

Parameters:

genesdict: Dictionary of gene models where keys are gene IDs

Returns:

dict: Dictionary of gene models with unique IDs

gfftk.consensus.extend_utrs(consensus_models, transcripts, genome_fasta, min_utr_length=10, max_utr_length=2000, log=<built-in method write of _io.TextIOWrapper object>)

Extend consensus gene models with UTRs based on transcript evidence.

This function examines transcript alignments that match consensus gene models and extends the models with 5’ and 3’ UTRs if supported by the evidence. Only transcripts with compatible intron/exon boundaries are considered. Properly handles spliced UTRs (UTRs containing introns).

UTR extension is limited to avoid overlapping with neighboring genes on the same strand.

Parameters:

consensus_modelsdict: Dictionary of consensus gene models
transcriptsdict: Dictionary of transcript alignments
genome_fastastr: Path to the genome FASTA file
min_utr_lengthint, optional: Minimum length for a UTR to be added (default: 10)
max_utr_lengthint, optional: Maximum length for a UTR extension (default: 2000)
logcallable, optional: Function to use for logging (default: sys.stderr.write)

Returns:

dict: Dictionary of consensus gene models with UTRs added where supported

gfftk.consensus.fasta_length(fasta)

Calculate the total length of all sequences in a FASTA file.

This function reads a FASTA file and sums the lengths of all sequences in the file. It can be used to determine the total genome size from a genome FASTA file.

Parameters:

fastastr: Path to the FASTA file

Returns:

int: Total length of all sequences in the FASTA file

gfftk.consensus.filter4evidence(data, n_genes=3, n_evidence=2)

Filter loci to include only those with sufficient gene models and evidence alignments.

This function filters the loci data structure to include only loci that have at least a specified number of gene models and evidence alignments (proteins + transcripts). It is used to identify high-confidence loci for calculating source weights.

Parameters:

datadict: Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
n_genesint, optional: Minimum number of gene models required in a locus (default: 3)
n_evidenceint, optional: Minimum number of evidence alignments (proteins + transcripts) required in a locus (default: 2)

Returns:

tuple: A tuple containing: - filt: Filtered dictionary of loci with sufficient gene models and evidence - n_filt: Number of loci that passed the filter

gfftk.consensus.filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False)

Filter gene models based on their overlap with repeat regions.

This function filters out gene models that have a significant overlap with repeat regions. It builds an interlap object from the repeat file (GFF3 or BED) and calculates the percentage of each gene model that overlaps with repeats. Gene models with overlap percentage greater than the filter threshold are removed.

Parameters:

fastastr: Path to the genome FASTA file
repeatsstr: Path to the repeats GFF3 or BED file
gene_modelsdict: Dictionary of gene models to filter
filter_thresholdint, optional: Maximum percentage of gene model that can overlap with repeats (default: 90)
logcallable or bool, optional: Function to use for logging, or False to disable logging (default: False)

Returns:

tuple: A tuple containing: - filtered: Dictionary of filtered gene models - dropped: Number of gene models that were filtered out

gfftk.consensus.generate_consensus(fasta, genes, proteins, transcripts, weights, out, debug=False, minscore=False, repeats=False, repeat_overlap=90, tiebreakers='calculated', min_exon=3, min_intron=11, max_intron=-1, max_exon=-1, evidence_derived_models=[], num_processes=None, utrs=True, min_utr_length=10, max_utr_length=2000, log=<built-in method write of _io.TextIOWrapper object>)

Generate consensus gene models from multiple gene prediction sources and evidence.

This function is the main entry point for the consensus module. It takes gene predictions from multiple sources, along with protein and transcript evidence, and generates consensus gene models by selecting the best model at each locus based on evidence and source weights.

The function performs the following steps: 1. Parse input GFF3 files and cluster gene models into loci 2. Calculate source weights based on evidence if tiebreakers=”calculated” 3. Select the best gene model at each locus based on evidence and source weights 4. Filter out gene models that overlap with repeats (if repeats are provided) 5. Write the consensus gene models to a GFF3 file

Parameters:

fastastr: Path to the genome FASTA file
geneslist: List of paths to gene prediction GFF3 files
proteinslist: List of paths to protein alignment GFF3 files
transcriptslist: List of paths to transcript alignment GFF3 files
weightslist: List of source:weight pairs for weighting gene prediction sources
outstr: Path to the output GFF3 file
debugbool or str, optional: Whether to print debug information or path to debug GFF file (default: False)
minscorebool or int, optional: Minimum score threshold for gene models, or False to calculate automatically (default: False)
repeatsbool or str, optional: Path to repeats GFF3 or BED file, or False to skip repeat filtering (default: False)
repeat_overlapint, optional: Maximum percentage of gene model that can overlap with repeats (default: 90)
tiebreakersstr, optional: Method for calculating source weights, either “calculated” or “user” (default: “calculated”)
min_exonint, optional: Minimum exon length in nucleotides (default: 3)
min_intronint, optional: Minimum intron length in nucleotides (default: 11)
max_intronint, optional: Maximum intron length in nucleotides, or -1 for no limit (default: -1)
max_exonint, optional: Maximum exon length in nucleotides, or -1 for no limit (default: -1)
evidence_derived_modelslist, optional: List of sources that are derived from evidence and should be treated differently (default: [])
num_processesint or None, optional: Number of processes to use for parallel execution, or None for sequential (default: None)
logcallable, optional: Function to use for logging (default: sys.stderr.write)

Returns:

dict: Dictionary of consensus gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, coords, etc.)

gfftk.consensus.getAED(query, reference)

Calculate Annotation Edit Distance (AED) between two transcript coordinates.

AED measures the similarity between two gene models by comparing their exon structures. It is calculated as 1 - (SN + SP) / 2, where: - SN (Sensitivity) is the fraction of reference bases that are correctly predicted - SP (Specificity) is the fraction of prediction bases that overlap with the reference

An AED of 0 means the gene models are identical, while an AED of 1 means they are completely different.

Parameters:

querylist: List of (start, end) coordinate tuples for the query gene model’s exons
referencelist: List of (start, end) coordinate tuples for the reference gene model’s exons

Returns:

float: AED score between 0 (identical) and 1 (completely different)

gfftk.consensus.get_loci(annot_dict)

Organize gene models into loci based on genomic coordinates and strand.

This function takes a dictionary of gene models and organizes them into loci based on their genomic coordinates and strand. It creates interlap objects for efficient overlap queries and clusters overlapping gene models into loci. It also filters out pseudogenes and gene models with multiple stop codons.

Parameters:

annot_dictdict: Dictionary of gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, CDS, etc.)

Returns:

tuple

A tuple containing: - loci: Hierarchical dictionary of loci organized by contig and strand

{contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

n_loci: Total number of loci
pseudo: List of pseudogenes that were filtered out

gfftk.consensus.get_overlap(a, b)

Calculate the overlap between two genomic intervals.

This function calculates the number of base pairs that overlap between two genomic intervals. If the intervals do not overlap, it returns 0.

Parameters:

atuple or list: Tuple or list of (start, end) coordinates for the first interval
btuple or list: Tuple or list of (start, end) coordinates for the second interval

Returns:

int: Number of base pairs that overlap between the two intervals, or 0 if they don’t overlap

gfftk.consensus.gff2interlap(infile, fasta, inter=False)

Parse a GFF3 file and construct a scaffold/gene interlap dictionary.

This function reads a GFF3 file and creates an interlap object containing the genomic features defined in the file. The interlap object allows for efficient overlap queries.

Parameters:

infilestr: Path to the GFF3 file
fastastr: Path to the genome FASTA file
interdict or bool, optional: Existing interlap object to update, or False to create a new one (default: False)

Returns:

tuple: A tuple containing: - inter: Dictionary mapping contig names to interlap objects containing features - length: Total length of all features in the GFF3 file

gfftk.consensus.gff_writer(input, output)

Write consensus gene models to a GFF3 file.

This function takes a dictionary of consensus gene models and writes them to a GFF3 file. It sorts the gene models by contig and start location, and assigns sequential locus tags to each gene model. It also handles the conversion of gene model coordinates to GFF3 features (gene, mRNA, exon, CDS).

Parameters:

inputdict: Dictionary of consensus gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, coords, etc.)
outputstr: Path to the output GFF3 file

Returns:

None: The function writes to the specified output file but does not return a value

gfftk.consensus.gffevidence2dict(file, Evi)

Parse evidence alignments from a GFF3 file into a dictionary.

This function reads a GFF3 file containing evidence alignments (proteins or transcripts) and converts it into a dictionary mapping alignment IDs to their information. It handles multi-exon alignments by combining exons with the same ID into a single entry.

Parameters:

filestr: Path to the GFF3 file containing evidence alignments
Evidict: Existing dictionary to update with new evidence alignments

Returns:

dict: Dictionary mapping alignment IDs to their information (target, type, source, strand, phase, contig, coords, location, score)

gfftk.consensus.map_coords(g_coords, e_coords)

Map evidence coordinates onto gene model coordinates.

This function takes evidence coordinates (protein or transcript alignments) and maps them onto gene model coordinates. It calculates the offset between each evidence coordinate and the corresponding gene model coordinate, which is used to determine how well the evidence aligns with the gene model.

Parameters:

g_coordslist: List of (start, end) coordinate tuples for the gene model’s exons
e_coordslist: List of (start, end) coordinate tuples for the evidence alignments

Returns:

list: List of lists, where each inner list contains the offset between an evidence coordinate and the corresponding gene model coordinate. The list has the same length as g_coords, with empty lists for gene model coordinates that don’t have a corresponding evidence coordinate.

gfftk.consensus.order_sources(locus)

Calculate evidence-based scores for gene models in a locus for source ranking.

This function evaluates each gene model in a locus by calculating how well it aligns with protein and transcript evidence. Unlike score_by_evidence, this function is used specifically for ranking gene prediction sources based on their agreement with evidence.

Parameters:

locusdict: Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’, ‘proteins’, ‘transcripts’

Returns:

dict: Dictionary mapping gene model names to information: - source: Source of the gene model - coords: Coordinates of the gene model - score: Combined evidence score for the gene model

gfftk.consensus.parse_data(genome, gene, protein, transcript, log=<built-in method write of _io.TextIOWrapper object>)

Parse input data files and build a locus data structure.

This function reads gene prediction files, protein alignment files, and transcript alignment files, and organizes them into a hierarchical locus data structure. It assigns unique identifiers to gene models to avoid name collisions and tracks the sources of all predictions and evidence.

Parameters:

genomestr: Path to the genome FASTA file
genelist: List of paths to gene prediction GFF3 files
proteinlist: List of paths to protein alignment GFF3 files
transcriptlist: List of paths to transcript alignment GFF3 files
logcallable, optional: Function to use for logging (default: sys.stderr.write)

Returns:

dict: Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}

gfftk.consensus.reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1)

Check if a gene model has reasonable exon and intron lengths.

This function evaluates a gene model to determine if it has reasonable exon and intron lengths based on specified thresholds. It checks minimum exon length, maximum exon length, minimum intron length, maximum intron length, and minimum protein length.

Parameters:

coordslist: List of (start, end) coordinate tuples for the gene model’s exons
min_proteinint, optional: Minimum protein length in amino acids (default: 30)
min_exonint, optional: Minimum exon length in nucleotides (default: 3)
min_intronint, optional: Minimum intron length in nucleotides (default: 10)
max_intronint, optional: Maximum intron length in nucleotides, or -1 for no limit (default: -1)
max_exonint, optional: Maximum exon length in nucleotides, or -1 for no limit (default: -1)

Returns:

bool or str: True if the gene model is reasonable, or a string describing the reason it’s not reasonable

gfftk.consensus.refine_cluster(locus, derived=[])

Identify potential sub-loci within a locus based on non-overlapping gene models.

This function analyzes a locus to determine if it contains multiple non-overlapping gene models from the same source, which might indicate that the locus should be split into multiple sub-loci. It focuses on ab initio gene predictors, which typically don’t predict overlapping models, so multiple models from the same source in a locus suggest the presence of multiple genes.

Parameters:

locusdict: Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’
derivedlist, optional: List of sources that are derived from evidence and should be ignored (default: [])

Returns:

dict or bool: Dictionary mapping sub-locus indices to lists of gene models if sub-loci are found, or False if no sub-loci are identified

gfftk.consensus.safe_extract_coordinates(coords)

Safely extract min and max coordinates from a nested coordinate structure.

This function handles various coordinate formats and structures, extracting the minimum and maximum coordinates while gracefully handling errors. It’s designed to work with potentially malformed or inconsistent coordinate data.

Parameters:

coordslist or tuple: Nested coordinate structure (list of lists, tuples, etc.)

Returns:

tuple or None: A tuple containing (min_coord, max_coord) if extraction succeeds, or None if extraction fails

gfftk.consensus.score_aggregator(locus_name, locus, weights, order, de_novo_aed_scores, evidence_scores, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1)

gfftk.consensus.score_by_evidence(locus, weights={}, derived=[])

Calculate evidence-based scores for gene models in a locus.

This function evaluates each gene model in a locus by calculating how well it aligns with protein and transcript evidence. It assigns scores based on the alignment quality and the weights assigned to different gene prediction sources.

Parameters:

locusdict: Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’, ‘proteins’, ‘transcripts’
weightsdict, optional: Dictionary mapping gene model sources to weight values (default: {})
derivedlist, optional: List of sources that are derived from evidence and should not be scored (default: [])

Returns:

dict: Dictionary mapping gene model names to evidence scores: - protein_evidence_score: Sum of protein evidence scores - transcript_evidence_score: Sum of transcript evidence scores

gfftk.consensus.score_evidence(g_coords, e_coords, weight=2)

Calculate a score for how well evidence aligns with a gene model.

This function evaluates how well evidence coordinates (protein or transcript alignments) match a gene model’s exon structure. It considers both the coverage (percentage of the gene model covered by evidence) and the matching of intron junctions (splice sites).

The scoring system ranges from 0 to 10 (before applying the weight multiplier): - 10: Perfect match (evidence exactly matches the gene model) - 5-9: Partial match (evidence partially covers the gene model or has some matching junctions) - 0: No match (evidence does not overlap with the gene model)

For multi-exon genes, the score is adjusted based on: - Base score from exon coverage (0-10 for each exon) - Percent coverage of the entire gene model - Ratio of matching intron junctions

Parameters:

g_coordslist: List of (start, end) coordinate tuples for the gene model’s exons
e_coordslist: List of (start, end) coordinate tuples for the evidence alignments
weightint, optional: Weight multiplier to apply to the final score (default: 2)

Returns:

int: Score indicating how well the evidence supports the gene model, ranging from 0 (no support) to higher values (strong support)

gfftk.consensus.select_best_utrs(utr_exons_list, strand, min_length=10, max_length=2000)

Select the best UTR exons from multiple transcript evidence.

This function implements several strategies for selecting the most representative UTR structure from multiple transcript alignments.

Parameters:

utr_exons_listlist: List of lists, where each inner list contains UTR exon tuples (start, end)
strandstr: Strand of the gene model (‘+’ or ‘-‘)
min_lengthint, optional: Minimum total length for a UTR to be considered (default: 10)
max_lengthint, optional: Maximum total length for a UTR to be considered (default: 2000)

Returns:

tuple: (best_utrs, method_used) - best_utrs: List of (start, end) tuples representing the best UTR exons - method_used: String describing the method used to select the UTRs

gfftk.consensus.solve_sub_loci(result)

gfftk.consensus.src_scaling_factor(obj)

Calculate a scaling factor based on the diversity of gene prediction sources that agree.

This function analyzes the AED scores between gene models to determine how many different gene prediction sources agree with each other. It returns a scaling factor that reflects the proportion of unique sources that have only one model in agreement with others.

The scaling factor is used to adjust de novo distance scores to favor gene models that have agreement across multiple different sources rather than multiple models from the same source.

Parameters:

objdict: Dictionary mapping gene model names to their AED scores with other gene models

Returns:

float: Scaling factor between 0 and 1, where 1 indicates all sources have only one model in agreement with others, and lower values indicate multiple models from the same source agree with others

gfftk.consensus.sub_cluster(obj)

Split a cluster of gene models into sub-clusters based on source.

This function analyzes a cluster of gene models to determine if it contains multiple models from the same source. If it does, it splits the cluster into sub-clusters, where each sub-cluster contains models that are more likely to belong together based on their overlap.

Parameters:

objlist: List of gene model tuples, where each tuple contains: (name, source, coords, codon_start)

Returns:

list: List of lists, where each inner list contains gene models that belong to the same sub-cluster

gfftk.convert

gfftk.convert.convert(args)

gfftk.convert.gff2cdstranscripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to CDS transcript [no UTRs] FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.

Parameters:

gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gff2combined(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 and FASTA to combined GFF3+FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write to combined GFF3+FASTA format with both annotations and sequences.

Parameters:

gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – combined GFF3+FASTA format file
grep (list, default=[]) – filter results to only include gene models with locus_tag matching grep
grepv (list, default=[]) – filter results to exclude gene models with locus_tag matching grepv

gfftk.convert.gff2gbff(gff, fasta, output=False, table=1, organism=False, strain=False, debug=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])

Convert GFF3 format to GenBank format.

Will parse GFF3 format into GFFtk annotation dictionary and then write to GenBank output.

Parameters:

gff (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GenBank format

gfftk.convert.gff2gff3(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[], url_encode=False)

Convert GFF3 format to GFF3 format with filtering.

Will parse GFF3 format into GFFtk annotation dictionary, apply filtering, and then write to GFF3 output. This is useful for filtering GFF3 files. Default is to write to stdout.

Parameters:

gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in GFF3 format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gff2gtf(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to GTF format.

Will parse GFF3 format into GFFtk annotation dictionary and then write to GTF output. Only coding genes are output with this method. Default is to write to stdout.

Parameters:

gff (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in GTF format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gff2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[])

Convert GFF3 format to translated protein FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.

Parameters:

gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
strip_stop (bool, default=False) – remove stop codons (*) from translation
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.gff2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to NCBI TBL format .

Will parse GFF3 annotation format into GFFtk annotation dictionary and then write to NCBI TBL output. Default is to write to stdout.

Parameters:

gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in NCBI TBL format

gfftk.convert.gff2transcripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GFF3 format to transcript FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write transcripts in FASTA format.

Parameters:

gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.gtf2cdstranscripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to CDS transcript [no UTRs] FASTA format.

Will parse GFF3 format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.

Parameters:

gff (filename) – genome annotation text file in GTF format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.gtf2gbff(gtf, fasta, output=False, table=1, organism=False, strain=False, debug=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])

Convert GTF format to GenBank format.

Will parse GTF format into GFFtk annotation dictionary and then write to GenBank output.

Parameters:

gtf (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GenBank format

gfftk.convert.gtf2gff(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to GFF format.

Will parse GTF format into GFFtk annotation dictionary and then write to GFF3 output. Only coding genes are output with this method. Default is to write to stdout.

Parameters:

gff (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in GTF format

gfftk.convert.gtf2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[])

Convert GTF format to translated protein FASTA format.

Will parse GTF format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.

Parameters:

gff (filename) – genome annotation text file in GTF format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
strip_stop (bool, default=False) – remove stop codons (*) from translation
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.gtf2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to NCBI TBL format .

Will parse GTF annotation format into GFFtk annotation dictionary and then write to NCBI TBL output. Default is to write to stdout.

Parameters:

gff (filename) – genome annotation text file in GTF format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in NCBI TBL format

gfftk.convert.gtf2transcripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])

Convert GTF format to transcript FASTA format.

Will parse GTF format into GFFtk annotation dictionary and then write transcripts in FASTA format.

Parameters:

gff (filename) – genome annotation text file in GTF format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.tbl2cdstranscripts(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to CDS transcript [no UTRS] in FASTA format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.

Parameters:

tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.tbl2gbff(tbl, fasta, output=False, table=1, organism=False, strain=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])

Convert NCBI TBL format to GenBank format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GenBank output.

Parameters:

tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GenBank format

gfftk.convert.tbl2gff3(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to GFF3 format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GFF3 output. Default is to write to stdout.

Parameters:

tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GFF3 format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.tbl2gtf(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to GTF format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GTF output. Only coding genes are output with this method. Default is to write to stdout.

Parameters:

tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GTF format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]

gfftk.convert.tbl2proteins(tbl, fasta, output=False, table=1, strip_stop=False, grep=[], grepv=[])

Convert NCBI TBL format to translated protein FASTA format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.

Parameters:

tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
strip_stop (bool, default=False) – remove stop codons (*) from translation
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.convert.tbl2transcripts(tbl, fasta, output=False, table=1, grep=[], grepv=[])

Convert NCBI TBL format to transcript FASTA format.

Will parse NCBI TBL format into GFFtk annotation dictionary and then write transcripts in FASTA format.

Parameters:

tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format

gfftk.genbank

gfftk.genbank.dict2gbff(annots, seqs, outfile, organism=None, circular=False, lowercase=False)

gfftk.genbank.dict2tbl(genesDict, scaff2genes, scaffLen, SeqCenter, SeqRefNum, skipList, output=False, annotations=False, external=False): function to take funannotate annotation dictionaries and convert to NCBI tbl output

gfftk.genbank.drop_alt_coords(info, idxs)

gfftk.genbank.duplicate_coords(cds)

gfftk.genbank.fetch_coords(v, i=0, feature='gene')

gfftk.genbank.findUTRs(cds, mrna, strand)

gfftk.genbank.reformatGO(term, goDict={})

gfftk.genbank.sbt_writer(out)

gfftk.genbank.table2asn(tbl, genome, output=False, sbt=False, organism=False, strain=False, tmpdir='/tmp', table=1, cleanup=True)

gfftk.genbank.tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False): need a method to convert directly from NCBI tbl format to several output formats to avoid conversion problems with GBK files that have mutliple transcripts if can load funannotate dictionary directly from tbl format, then can write the other formats directly