Modules
gfftk.gff
- gfftk.gff.dict2combined_gff_fasta(annotation_dict, fasta_dict, output=False, debug=False, source=False)
Write GFFtk annotation dictionary and FASTA sequences to combined GFF3+FASTA format.
- Parameters:
annotation_dict (dict) – GFFtk standardized annotation dictionary
fasta_dict (dict) – Dictionary of sequences keyed by contig name
output (str or file handle, default=False) – Output file path or handle. If False, writes to stdout
debug (bool, default=False) – Print debug information
source (str, default=False) – Override source field in GFF3 output
- Return type:
None
- gfftk.gff.dict2gff3(infile, output=False, debug=False, source=False, newline=False, url_encode=False)
Convert GFFtk standardized annotation dictionary to GFF3 file.
Annotation dictionary generated by gff2dict or tbl2dict passed as input. This function then write to GFF3 format
- Parameters:
infile (dict of dict) – standardized annotation dictionary keyed by locus_tag
output (str, default=sys.stdout) – annotation file in GFF3 format
debug (bool, default=False) – print debug information to stderr
source (str, default=False) – override source field in GFF3 output
newline (bool, default=False) – add newline after each gene
url_encode (bool, default=False) – URL encode attribute values for downstream tool compatibility
- gfftk.gff.dict2gff3alignments(infile, output=False, debug=False, alignments='transcript', source=False, newline=False)
Convert GFFtk standardized annotation dictionary to GFF3 alignments file.
Annotation dictionary generated by gff2dict or tbl2dict passed as input. Output format is GFF3-alignment, aka EVM evidence format
- gfftk.gff.dict2gtf(infile, output=False, source=False)
Convert GFFtk standardized annotation dictionary to GTF file.
Annotation dictionary generated by gff2dict or tbl2dict passed as input. This function then write to GTF format, notably this function only writes protein coding CDS features.
- gfftk.gff.gff2dict(gff, fasta, annotation=False, table=1, debug=False, gap_filter=False, gff_format='auto', logger=<built-in method write of _io.TextIOWrapper object>)
Convert GFF3 and FASTA to standardized GFFtk dictionary format.
Annotation file in GFF3 format and genome FASTA file are parsed. The result is a dictionary that is keyed by locus_tag (gene name) and the value is a nested dictionary containing feature information.
- Parameters:
gff (filename : str) – annotation text file in GFF3 format
fasta (filename : str) – genome text file in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
gap_filter (bool, default=False) – remove gene models that span gaps in sequence
logger (handle, default=sys.stderr.write) – where to log messages to
- Returns:
annotation – standardized annotation dictionary (OrderedDict) keyed by locus_tag
- Return type:
- gfftk.gff.gtf2dict(gtf, fasta, annotation=False, table=1, debug=False, gap_filter=False, gtf_format='auto', logger=<built-in method write of _io.TextIOWrapper object>)
Convert GTF and FASTA to standardized GFFtk dictionary format.
Annotation file in GTF format and genome FASTA file are parsed. The result is a dictionary that is keyed by locus_tag (gene name) and the value is a nested dictionary containing feature information.
- Parameters:
gtf (filename : str) – annotation text file in GTF format
fasta (filename : str) – genome text file in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
gap_filter (bool, default=False) – remove gene models that span gaps in sequence
logger (handle, default=sys.stderr.write) – where to log messages to
- Returns:
annotation – standardized annotation dictionary (OrderedDict) keyed by locus_tag
- Return type:
- gfftk.gff.is_combined_gff_fasta(filename)
Check if a file contains both GFF3 and FASTA data.
- gfftk.gff.simplifyGO(inputList)
- gfftk.gff.split_combined_gff_fasta(filename)
Split a combined GFF3+FASTA file into separate GFF3 and FASTA components.
- gfftk.gff.start_end_gap(seq, coords)
- gfftk.gff.validate_and_translate_models(k, v, SeqRecords, gap_filter=False, table=1, logger=<built-in method write of _io.TextIOWrapper object>)
- gfftk.gff.validate_models(annotation, fadict, logger=<built-in method write of _io.TextIOWrapper object>, table=1, gap_filter=False)
gfftk.consensus
- gfftk.consensus.add_evidence(loci, evidence, source='proteins')
Add evidence alignments to loci based on genomic coordinates.
This function associates evidence alignments (proteins or transcripts) with gene loci based on their genomic coordinates. It builds interlap objects for efficient overlap queries and adds the evidence to each locus that it overlaps with.
Parameters:
- locidict
Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
- evidencedict
Dictionary of evidence alignments
- sourcestr, optional
Type of evidence being added, either “proteins” or “transcripts” (default: “proteins”)
Returns:
- None
The function modifies the loci dictionary in place
- gfftk.consensus.auto_score_threshold(weights, order, user_weight=6)
- gfftk.consensus.bed2interlap(bedfile, inter=False)
Parse a BED file and construct a scaffold/feature interlap dictionary.
This function reads a BED file and creates an interlap object containing the genomic features defined in the file. The interlap object allows for efficient overlap queries.
Parameters:
- bedfilestr
Path to the BED file
- interdict or bool, optional
Existing interlap object to update, or False to create a new one (default: False)
Returns:
- tuple
A tuple containing: - inter: Dictionary mapping contig names to interlap objects containing features - length: Total length of all features in the BED file
- gfftk.consensus.best_model_default(locus_name, contig, strand, locus, debug=False, weights={}, order={}, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1, evidence_derived_models=[])
Select the best gene model(s) for a locus using evidence-based scoring.
Parameters:
- locus_namestr
Name of the locus
- contigstr
Name of the contig
- strandstr
Strand of the locus (“+” or “-“)
- locusdict
Dictionary containing gene models and evidence
- debugbool or str, optional
Whether to print debug information or path to debug GFF file
- weightsdict, optional
Dictionary of weights for different sources
- orderdict, optional
Dictionary of order values for different sources
- min_exonint, optional
Minimum exon length
- min_intronint, optional
Minimum intron length
- max_intronint, optional
Maximum intron length
- max_exonint, optional
Maximum exon length
- evidence_derived_modelslist, optional
List of evidence-derived models
- use_dpbool, optional
Whether to use dynamic programming approach (not used in this function)
- allow_multiplebool, optional
Whether to allow multiple gene models
- min_gap_sizeint, optional
Minimum gap size for splitting loci
Returns:
- list
List of tuples containing gene model IDs and their information
- gfftk.consensus.calculate_gene_distance(locus)
Calculate Annotation Edit Distance (AED) between all pairs of gene models in a locus.
This function computes the AED between each pair of gene models in a locus, which measures how similar their exon structures are. The AED scores are used to determine which gene models have the most agreement with other models.
Parameters:
- locusdict
Dictionary containing gene models for a single locus Required keys: ‘genes’
Returns:
- dict
Nested dictionary mapping gene model names to dictionaries of AED scores with other models {gene1: {gene2: aed_score, gene3: aed_score, …}, gene2: {…}, …}
- gfftk.consensus.calculate_source_order(data)
Calculate a rank order of gene prediction sources based on evidence agreement.
This function analyzes how well each gene prediction source agrees with protein and transcript evidence across all loci. It filters the data to include only loci with sufficient evidence, calculates scores for each source based on evidence agreement, and returns a rank-ordered dictionary of sources with their scores.
The rank order is used to prioritize gene models when evidence is not available or is inconclusive. Sources that generally have better agreement with evidence receive higher scores and are ranked higher.
Parameters:
- datadict
Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
Returns:
- tuple
A tuple containing: - order: OrderedDict mapping source names to their scores, ordered by score (highest first) - n_filt: Number of loci that passed the evidence filter
- gfftk.consensus.check_intron_compatibility(model_coords, transcript_coords, strand)
Check if transcript has compatible intron/exon boundaries with the gene model.
Parameters:
- model_coordslist
List of (start, end) tuples for the gene model exons
- transcript_coordslist
List of (start, end) tuples for the transcript alignment
- strandstr
Strand of the gene model (‘+’ or ‘-‘)
Returns:
- bool
True if the transcript has compatible intron/exon boundaries, False otherwise
- gfftk.consensus.cluster_by_aed(locus, score=0.005)
Cluster gene models based on their Annotation Edit Distance (AED).
This function groups gene models that have very similar exon structures (low AED) into clusters. It is used to identify gene models that are essentially the same prediction but come from different sources, allowing the consensus module to select the best representative from each cluster.
Parameters:
- locusdict
Dictionary containing gene models for a single locus Required keys: ‘genes’
- scorefloat, optional
Maximum AED threshold for considering two gene models as part of the same cluster (default: 0.005)
Returns:
- list
List of lists, where each inner list contains gene model IDs that belong to the same cluster
- gfftk.consensus.cluster_interlap(obj)
Cluster genomic features using the interlap.reduce function.
This function takes an interlap object containing genomic features and clusters them based on their coordinates. Features that overlap or are adjacent to each other are grouped into the same cluster. The function then assigns the original feature data to each cluster, ensuring each gene model is only assigned to one locus.
Parameters:
- objinterlap.InterLap
Interlap object containing genomic features
Returns:
- list
List of dictionaries, where each dictionary represents a cluster of features Each dictionary contains: - locus: Tuple of (start, end) coordinates for the cluster - genes: List of gene features in the cluster - proteins: Empty list for protein evidence (filled later) - transcripts: Empty list for transcript evidence (filled later) - repeats: Empty list for repeat features (filled later)
- gfftk.consensus.cluster_interlap_original(obj)
Original cluster_interlap function (before fix) for debugging.
- gfftk.consensus.consensus(args)
- gfftk.consensus.contained(a, b)
Check if coordinates in a are completely contained within coordinates in b.
This function determines if the interval defined by coordinates a is completely contained within the interval defined by coordinates b. It handles various edge cases and ensures that the coordinates are properly formatted as tuples or lists of integers.
Parameters:
- atuple or list
Tuple or list of (start, end) coordinates to check if contained
- btuple or list
Tuple or list of (start, end) coordinates that might contain a
Returns:
- bool
True if a is completely contained within b, False otherwise
- gfftk.consensus.de_novo_distance(locus)
Calculate a score for each gene model based on its similarity to other models.
This function evaluates each gene model in a locus by calculating its Annotation Edit Distance (AED) with all other gene models in the locus. It then computes a score that reflects how similar the gene model is to other models, with higher scores indicating greater similarity.
This score is used as a proxy for prediction confidence when protein or transcript evidence is not available. Gene models that have more agreement with other models receive higher scores.
Parameters:
- locusdict
Dictionary containing gene models for a single locus Required keys: ‘genes’
Returns:
- dict
Dictionary mapping gene model names to their de novo distance scores
- gfftk.consensus.ensure_unique_names(genes)
Ensure gene model names are unique by appending a UUID slug.
This function takes a dictionary of gene models and appends a unique identifier to each gene model name to ensure there are no name collisions when combining gene models from multiple sources.
Parameters:
- genesdict
Dictionary of gene models where keys are gene IDs
Returns:
- dict
Dictionary of gene models with unique IDs
- gfftk.consensus.extend_utrs(consensus_models, transcripts, genome_fasta, min_utr_length=10, max_utr_length=2000, log=<built-in method write of _io.TextIOWrapper object>)
Extend consensus gene models with UTRs based on transcript evidence.
This function examines transcript alignments that match consensus gene models and extends the models with 5’ and 3’ UTRs if supported by the evidence. Only transcripts with compatible intron/exon boundaries are considered. Properly handles spliced UTRs (UTRs containing introns).
UTR extension is limited to avoid overlapping with neighboring genes on the same strand.
Parameters:
- consensus_modelsdict
Dictionary of consensus gene models
- transcriptsdict
Dictionary of transcript alignments
- genome_fastastr
Path to the genome FASTA file
- min_utr_lengthint, optional
Minimum length for a UTR to be added (default: 10)
- max_utr_lengthint, optional
Maximum length for a UTR extension (default: 2000)
- logcallable, optional
Function to use for logging (default: sys.stderr.write)
Returns:
- dict
Dictionary of consensus gene models with UTRs added where supported
- gfftk.consensus.fasta_length(fasta)
Calculate the total length of all sequences in a FASTA file.
This function reads a FASTA file and sums the lengths of all sequences in the file. It can be used to determine the total genome size from a genome FASTA file.
Parameters:
- fastastr
Path to the FASTA file
Returns:
- int
Total length of all sequences in the FASTA file
- gfftk.consensus.filter4evidence(data, n_genes=3, n_evidence=2)
Filter loci to include only those with sufficient gene models and evidence alignments.
This function filters the loci data structure to include only loci that have at least a specified number of gene models and evidence alignments (proteins + transcripts). It is used to identify high-confidence loci for calculating source weights.
Parameters:
- datadict
Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
- n_genesint, optional
Minimum number of gene models required in a locus (default: 3)
- n_evidenceint, optional
Minimum number of evidence alignments (proteins + transcripts) required in a locus (default: 2)
Returns:
- tuple
A tuple containing: - filt: Filtered dictionary of loci with sufficient gene models and evidence - n_filt: Number of loci that passed the filter
- gfftk.consensus.filter_models_repeats(fasta, repeats, gene_models, filter_threshold=90, log=False)
Filter gene models based on their overlap with repeat regions.
This function filters out gene models that have a significant overlap with repeat regions. It builds an interlap object from the repeat file (GFF3 or BED) and calculates the percentage of each gene model that overlaps with repeats. Gene models with overlap percentage greater than the filter threshold are removed.
Parameters:
- fastastr
Path to the genome FASTA file
- repeatsstr
Path to the repeats GFF3 or BED file
- gene_modelsdict
Dictionary of gene models to filter
- filter_thresholdint, optional
Maximum percentage of gene model that can overlap with repeats (default: 90)
- logcallable or bool, optional
Function to use for logging, or False to disable logging (default: False)
Returns:
- tuple
A tuple containing: - filtered: Dictionary of filtered gene models - dropped: Number of gene models that were filtered out
- gfftk.consensus.generate_consensus(fasta, genes, proteins, transcripts, weights, out, debug=False, minscore=False, repeats=False, repeat_overlap=90, tiebreakers='calculated', min_exon=3, min_intron=11, max_intron=-1, max_exon=-1, evidence_derived_models=[], num_processes=None, utrs=True, min_utr_length=10, max_utr_length=2000, log=<built-in method write of _io.TextIOWrapper object>)
Generate consensus gene models from multiple gene prediction sources and evidence.
This function is the main entry point for the consensus module. It takes gene predictions from multiple sources, along with protein and transcript evidence, and generates consensus gene models by selecting the best model at each locus based on evidence and source weights.
The function performs the following steps: 1. Parse input GFF3 files and cluster gene models into loci 2. Calculate source weights based on evidence if tiebreakers=”calculated” 3. Select the best gene model at each locus based on evidence and source weights 4. Filter out gene models that overlap with repeats (if repeats are provided) 5. Write the consensus gene models to a GFF3 file
Parameters:
- fastastr
Path to the genome FASTA file
- geneslist
List of paths to gene prediction GFF3 files
- proteinslist
List of paths to protein alignment GFF3 files
- transcriptslist
List of paths to transcript alignment GFF3 files
- weightslist
List of source:weight pairs for weighting gene prediction sources
- outstr
Path to the output GFF3 file
- debugbool or str, optional
Whether to print debug information or path to debug GFF file (default: False)
- minscorebool or int, optional
Minimum score threshold for gene models, or False to calculate automatically (default: False)
- repeatsbool or str, optional
Path to repeats GFF3 or BED file, or False to skip repeat filtering (default: False)
- repeat_overlapint, optional
Maximum percentage of gene model that can overlap with repeats (default: 90)
- tiebreakersstr, optional
Method for calculating source weights, either “calculated” or “user” (default: “calculated”)
- min_exonint, optional
Minimum exon length in nucleotides (default: 3)
- min_intronint, optional
Minimum intron length in nucleotides (default: 11)
- max_intronint, optional
Maximum intron length in nucleotides, or -1 for no limit (default: -1)
- max_exonint, optional
Maximum exon length in nucleotides, or -1 for no limit (default: -1)
- evidence_derived_modelslist, optional
List of sources that are derived from evidence and should be treated differently (default: [])
- num_processesint or None, optional
Number of processes to use for parallel execution, or None for sequential (default: None)
- logcallable, optional
Function to use for logging (default: sys.stderr.write)
Returns:
- dict
Dictionary of consensus gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, coords, etc.)
- gfftk.consensus.getAED(query, reference)
Calculate Annotation Edit Distance (AED) between two transcript coordinates.
AED measures the similarity between two gene models by comparing their exon structures. It is calculated as 1 - (SN + SP) / 2, where: - SN (Sensitivity) is the fraction of reference bases that are correctly predicted - SP (Specificity) is the fraction of prediction bases that overlap with the reference
An AED of 0 means the gene models are identical, while an AED of 1 means they are completely different.
Parameters:
- querylist
List of (start, end) coordinate tuples for the query gene model’s exons
- referencelist
List of (start, end) coordinate tuples for the reference gene model’s exons
Returns:
- float
AED score between 0 (identical) and 1 (completely different)
- gfftk.consensus.get_loci(annot_dict)
Organize gene models into loci based on genomic coordinates and strand.
This function takes a dictionary of gene models and organizes them into loci based on their genomic coordinates and strand. It creates interlap objects for efficient overlap queries and clusters overlapping gene models into loci. It also filters out pseudogenes and gene models with multiple stop codons.
Parameters:
- annot_dictdict
Dictionary of gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, CDS, etc.)
Returns:
- tuple
A tuple containing: - loci: Hierarchical dictionary of loci organized by contig and strand
{contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
n_loci: Total number of loci
pseudo: List of pseudogenes that were filtered out
- gfftk.consensus.get_overlap(a, b)
Calculate the overlap between two genomic intervals.
This function calculates the number of base pairs that overlap between two genomic intervals. If the intervals do not overlap, it returns 0.
Parameters:
- atuple or list
Tuple or list of (start, end) coordinates for the first interval
- btuple or list
Tuple or list of (start, end) coordinates for the second interval
Returns:
- int
Number of base pairs that overlap between the two intervals, or 0 if they don’t overlap
- gfftk.consensus.gff2interlap(infile, fasta, inter=False)
Parse a GFF3 file and construct a scaffold/gene interlap dictionary.
This function reads a GFF3 file and creates an interlap object containing the genomic features defined in the file. The interlap object allows for efficient overlap queries.
Parameters:
- infilestr
Path to the GFF3 file
- fastastr
Path to the genome FASTA file
- interdict or bool, optional
Existing interlap object to update, or False to create a new one (default: False)
Returns:
- tuple
A tuple containing: - inter: Dictionary mapping contig names to interlap objects containing features - length: Total length of all features in the GFF3 file
- gfftk.consensus.gff_writer(input, output)
Write consensus gene models to a GFF3 file.
This function takes a dictionary of consensus gene models and writes them to a GFF3 file. It sorts the gene models by contig and start location, and assigns sequential locus tags to each gene model. It also handles the conversion of gene model coordinates to GFF3 features (gene, mRNA, exon, CDS).
Parameters:
- inputdict
Dictionary of consensus gene models, where keys are gene IDs and values are dictionaries containing gene model information (contig, location, strand, source, coords, etc.)
- outputstr
Path to the output GFF3 file
Returns:
- None
The function writes to the specified output file but does not return a value
- gfftk.consensus.gffevidence2dict(file, Evi)
Parse evidence alignments from a GFF3 file into a dictionary.
This function reads a GFF3 file containing evidence alignments (proteins or transcripts) and converts it into a dictionary mapping alignment IDs to their information. It handles multi-exon alignments by combining exons with the same ID into a single entry.
Parameters:
- filestr
Path to the GFF3 file containing evidence alignments
- Evidict
Existing dictionary to update with new evidence alignments
Returns:
- dict
Dictionary mapping alignment IDs to their information (target, type, source, strand, phase, contig, coords, location, score)
- gfftk.consensus.map_coords(g_coords, e_coords)
Map evidence coordinates onto gene model coordinates.
This function takes evidence coordinates (protein or transcript alignments) and maps them onto gene model coordinates. It calculates the offset between each evidence coordinate and the corresponding gene model coordinate, which is used to determine how well the evidence aligns with the gene model.
Parameters:
- g_coordslist
List of (start, end) coordinate tuples for the gene model’s exons
- e_coordslist
List of (start, end) coordinate tuples for the evidence alignments
Returns:
- list
List of lists, where each inner list contains the offset between an evidence coordinate and the corresponding gene model coordinate. The list has the same length as g_coords, with empty lists for gene model coordinates that don’t have a corresponding evidence coordinate.
- gfftk.consensus.order_sources(locus)
Calculate evidence-based scores for gene models in a locus for source ranking.
This function evaluates each gene model in a locus by calculating how well it aligns with protein and transcript evidence. Unlike score_by_evidence, this function is used specifically for ranking gene prediction sources based on their agreement with evidence.
Parameters:
- locusdict
Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’, ‘proteins’, ‘transcripts’
Returns:
- dict
Dictionary mapping gene model names to information: - source: Source of the gene model - coords: Coordinates of the gene model - score: Combined evidence score for the gene model
- gfftk.consensus.parse_data(genome, gene, protein, transcript, log=<built-in method write of _io.TextIOWrapper object>)
Parse input data files and build a locus data structure.
This function reads gene prediction files, protein alignment files, and transcript alignment files, and organizes them into a hierarchical locus data structure. It assigns unique identifiers to gene models to avoid name collisions and tracks the sources of all predictions and evidence.
Parameters:
- genomestr
Path to the genome FASTA file
- genelist
List of paths to gene prediction GFF3 files
- proteinlist
List of paths to protein alignment GFF3 files
- transcriptlist
List of paths to transcript alignment GFF3 files
- logcallable, optional
Function to use for logging (default: sys.stderr.write)
Returns:
- dict
Hierarchical dictionary of loci organized by contig and strand {contig: {“+”: [locus1, locus2, …], “-”: [locus1, locus2, …]}}
- gfftk.consensus.reasonable_model(coords, min_protein=30, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1)
Check if a gene model has reasonable exon and intron lengths.
This function evaluates a gene model to determine if it has reasonable exon and intron lengths based on specified thresholds. It checks minimum exon length, maximum exon length, minimum intron length, maximum intron length, and minimum protein length.
Parameters:
- coordslist
List of (start, end) coordinate tuples for the gene model’s exons
- min_proteinint, optional
Minimum protein length in amino acids (default: 30)
- min_exonint, optional
Minimum exon length in nucleotides (default: 3)
- min_intronint, optional
Minimum intron length in nucleotides (default: 10)
- max_intronint, optional
Maximum intron length in nucleotides, or -1 for no limit (default: -1)
- max_exonint, optional
Maximum exon length in nucleotides, or -1 for no limit (default: -1)
Returns:
- bool or str
True if the gene model is reasonable, or a string describing the reason it’s not reasonable
- gfftk.consensus.refine_cluster(locus, derived=[])
Identify potential sub-loci within a locus based on non-overlapping gene models.
This function analyzes a locus to determine if it contains multiple non-overlapping gene models from the same source, which might indicate that the locus should be split into multiple sub-loci. It focuses on ab initio gene predictors, which typically don’t predict overlapping models, so multiple models from the same source in a locus suggest the presence of multiple genes.
Parameters:
- locusdict
Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’
- derivedlist, optional
List of sources that are derived from evidence and should be ignored (default: [])
Returns:
- dict or bool
Dictionary mapping sub-locus indices to lists of gene models if sub-loci are found, or False if no sub-loci are identified
- gfftk.consensus.safe_extract_coordinates(coords)
Safely extract min and max coordinates from a nested coordinate structure.
This function handles various coordinate formats and structures, extracting the minimum and maximum coordinates while gracefully handling errors. It’s designed to work with potentially malformed or inconsistent coordinate data.
Parameters:
- coordslist or tuple
Nested coordinate structure (list of lists, tuples, etc.)
Returns:
- tuple or None
A tuple containing (min_coord, max_coord) if extraction succeeds, or None if extraction fails
- gfftk.consensus.score_aggregator(locus_name, locus, weights, order, de_novo_aed_scores, evidence_scores, min_exon=3, min_intron=10, max_intron=-1, max_exon=-1)
- gfftk.consensus.score_by_evidence(locus, weights={}, derived=[])
Calculate evidence-based scores for gene models in a locus.
This function evaluates each gene model in a locus by calculating how well it aligns with protein and transcript evidence. It assigns scores based on the alignment quality and the weights assigned to different gene prediction sources.
Parameters:
- locusdict
Dictionary containing gene models and evidence for a single locus Required keys: ‘genes’, ‘proteins’, ‘transcripts’
- weightsdict, optional
Dictionary mapping gene model sources to weight values (default: {})
- derivedlist, optional
List of sources that are derived from evidence and should not be scored (default: [])
Returns:
- dict
Dictionary mapping gene model names to evidence scores: - protein_evidence_score: Sum of protein evidence scores - transcript_evidence_score: Sum of transcript evidence scores
- gfftk.consensus.score_evidence(g_coords, e_coords, weight=2)
Calculate a score for how well evidence aligns with a gene model.
This function evaluates how well evidence coordinates (protein or transcript alignments) match a gene model’s exon structure. It considers both the coverage (percentage of the gene model covered by evidence) and the matching of intron junctions (splice sites).
The scoring system ranges from 0 to 10 (before applying the weight multiplier): - 10: Perfect match (evidence exactly matches the gene model) - 5-9: Partial match (evidence partially covers the gene model or has some matching junctions) - 0: No match (evidence does not overlap with the gene model)
For multi-exon genes, the score is adjusted based on: - Base score from exon coverage (0-10 for each exon) - Percent coverage of the entire gene model - Ratio of matching intron junctions
Parameters:
- g_coordslist
List of (start, end) coordinate tuples for the gene model’s exons
- e_coordslist
List of (start, end) coordinate tuples for the evidence alignments
- weightint, optional
Weight multiplier to apply to the final score (default: 2)
Returns:
- int
Score indicating how well the evidence supports the gene model, ranging from 0 (no support) to higher values (strong support)
- gfftk.consensus.select_best_utrs(utr_exons_list, strand, min_length=10, max_length=2000)
Select the best UTR exons from multiple transcript evidence.
This function implements several strategies for selecting the most representative UTR structure from multiple transcript alignments.
Parameters:
- utr_exons_listlist
List of lists, where each inner list contains UTR exon tuples (start, end)
- strandstr
Strand of the gene model (‘+’ or ‘-‘)
- min_lengthint, optional
Minimum total length for a UTR to be considered (default: 10)
- max_lengthint, optional
Maximum total length for a UTR to be considered (default: 2000)
Returns:
- tuple
(best_utrs, method_used) - best_utrs: List of (start, end) tuples representing the best UTR exons - method_used: String describing the method used to select the UTRs
- gfftk.consensus.solve_sub_loci(result)
- gfftk.consensus.src_scaling_factor(obj)
Calculate a scaling factor based on the diversity of gene prediction sources that agree.
This function analyzes the AED scores between gene models to determine how many different gene prediction sources agree with each other. It returns a scaling factor that reflects the proportion of unique sources that have only one model in agreement with others.
The scaling factor is used to adjust de novo distance scores to favor gene models that have agreement across multiple different sources rather than multiple models from the same source.
Parameters:
- objdict
Dictionary mapping gene model names to their AED scores with other gene models
Returns:
- float
Scaling factor between 0 and 1, where 1 indicates all sources have only one model in agreement with others, and lower values indicate multiple models from the same source agree with others
- gfftk.consensus.sub_cluster(obj)
Split a cluster of gene models into sub-clusters based on source.
This function analyzes a cluster of gene models to determine if it contains multiple models from the same source. If it does, it splits the cluster into sub-clusters, where each sub-cluster contains models that are more likely to belong together based on their overlap.
Parameters:
- objlist
List of gene model tuples, where each tuple contains: (name, source, coords, codon_start)
Returns:
- list
List of lists, where each inner list contains gene models that belong to the same sub-cluster
gfftk.convert
- gfftk.convert.convert(args)
- gfftk.convert.gff2cdstranscripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GFF3 format to CDS transcript [no UTRs] FASTA format.
Will parse GFF3 format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.
- Parameters:
gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]
- gfftk.convert.gff2combined(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GFF3 and FASTA to combined GFF3+FASTA format.
Will parse GFF3 format into GFFtk annotation dictionary and then write to combined GFF3+FASTA format with both annotations and sequences.
- Parameters:
gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – combined GFF3+FASTA format file
grep (list, default=[]) – filter results to only include gene models with locus_tag matching grep
grepv (list, default=[]) – filter results to exclude gene models with locus_tag matching grepv
- gfftk.convert.gff2gbff(gff, fasta, output=False, table=1, organism=False, strain=False, debug=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])
Convert GFF3 format to GenBank format.
Will parse GFF3 format into GFFtk annotation dictionary and then write to GenBank output.
- gfftk.convert.gff2gff3(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[], url_encode=False)
Convert GFF3 format to GFF3 format with filtering.
Will parse GFF3 format into GFFtk annotation dictionary, apply filtering, and then write to GFF3 output. This is useful for filtering GFF3 files. Default is to write to stdout.
- Parameters:
gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in GFF3 format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]
- gfftk.convert.gff2gtf(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GFF3 format to GTF format.
Will parse GFF3 format into GFFtk annotation dictionary and then write to GTF output. Only coding genes are output with this method. Default is to write to stdout.
- Parameters:
gff (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – annotation file in GTF format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]
- gfftk.convert.gff2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[])
Convert GFF3 format to translated protein FASTA format.
Will parse GFF3 format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.
- Parameters:
gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
strip_stop (bool, default=False) – remove stop codons (*) from translation
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
- gfftk.convert.gff2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GFF3 format to NCBI TBL format .
Will parse GFF3 annotation format into GFFtk annotation dictionary and then write to NCBI TBL output. Default is to write to stdout.
- gfftk.convert.gff2transcripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GFF3 format to transcript FASTA format.
Will parse GFF3 format into GFFtk annotation dictionary and then write transcripts in FASTA format.
- Parameters:
gff (filename) – genome annotation text file in GFF3 format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]
- gfftk.convert.gtf2cdstranscripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GTF format to CDS transcript [no UTRs] FASTA format.
Will parse GFF3 format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.
- Parameters:
- gfftk.convert.gtf2gbff(gtf, fasta, output=False, table=1, organism=False, strain=False, debug=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])
Convert GTF format to GenBank format.
Will parse GTF format into GFFtk annotation dictionary and then write to GenBank output.
- gfftk.convert.gtf2gff(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GTF format to GFF format.
Will parse GTF format into GFFtk annotation dictionary and then write to GFF3 output. Only coding genes are output with this method. Default is to write to stdout.
- gfftk.convert.gtf2proteins(gff, fasta, output=False, table=1, strip_stop=False, debug=False, grep=[], grepv=[])
Convert GTF format to translated protein FASTA format.
Will parse GTF format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.
- Parameters:
gff (filename) – genome annotation text file in GTF format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
strip_stop (bool, default=False) – remove stop codons (*) from translation
debug (bool, default=False) – print debug information to stderr
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
- gfftk.convert.gtf2tbl(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GTF format to NCBI TBL format .
Will parse GTF annotation format into GFFtk annotation dictionary and then write to NCBI TBL output. Default is to write to stdout.
- gfftk.convert.gtf2transcripts(gff, fasta, output=False, table=1, debug=False, grep=[], grepv=[])
Convert GTF format to transcript FASTA format.
Will parse GTF format into GFFtk annotation dictionary and then write transcripts in FASTA format.
- Parameters:
- gfftk.convert.tbl2cdstranscripts(tbl, fasta, output=False, table=1, grep=[], grepv=[])
Convert NCBI TBL format to CDS transcript [no UTRS] in FASTA format.
Will parse NCBI TBL format into GFFtk annotation dictionary and then write CDS transcripts in FASTA format.
- gfftk.convert.tbl2gbff(tbl, fasta, output=False, table=1, organism=False, strain=False, tmpdir='/tmp', cleanup=True, grep=[], grepv=[])
Convert NCBI TBL format to GenBank format.
Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GenBank output.
- gfftk.convert.tbl2gff3(tbl, fasta, output=False, table=1, grep=[], grepv=[])
Convert NCBI TBL format to GFF3 format.
Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GFF3 output. Default is to write to stdout.
- Parameters:
tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GFF3 format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]
- gfftk.convert.tbl2gtf(tbl, fasta, output=False, table=1, grep=[], grepv=[])
Convert NCBI TBL format to GTF format.
Will parse NCBI TBL format into GFFtk annotation dictionary and then write to GTF output. Only coding genes are output with this method. Default is to write to stdout.
- Parameters:
tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
output (str, default=sys.stdout) – annotation file in GTF format
grep (list, default=[]) – Filter gene models, keep matches. [key:value]
grepv (list, default=[]) – Filter gene models, remove matches [key:value]
- gfftk.convert.tbl2proteins(tbl, fasta, output=False, table=1, strip_stop=False, grep=[], grepv=[])
Convert NCBI TBL format to translated protein FASTA format.
Will parse NCBI TBL format into GFFtk annotation dictionary and then write protein coding translations to FASTA format.
- Parameters:
tbl (filename) – genome annotation text file in NCBI tbl format
fasta (filename) – genome sequence in FASTA format
table (int, default=1) – codon table [1]
strip_stop (bool, default=False) – remove stop codons (*) from translation
output (str, default=sys.stdout) – translated amino acids (proteins) in FASTA format
- gfftk.convert.tbl2transcripts(tbl, fasta, output=False, table=1, grep=[], grepv=[])
Convert NCBI TBL format to transcript FASTA format.
Will parse NCBI TBL format into GFFtk annotation dictionary and then write transcripts in FASTA format.
gfftk.genbank
- gfftk.genbank.dict2gbff(annots, seqs, outfile, organism=None, circular=False, lowercase=False)
- gfftk.genbank.dict2tbl(genesDict, scaff2genes, scaffLen, SeqCenter, SeqRefNum, skipList, output=False, annotations=False, external=False)
function to take funannotate annotation dictionaries and convert to NCBI tbl output
- gfftk.genbank.drop_alt_coords(info, idxs)
- gfftk.genbank.duplicate_coords(cds)
- gfftk.genbank.fetch_coords(v, i=0, feature='gene')
- gfftk.genbank.findUTRs(cds, mrna, strand)
- gfftk.genbank.reformatGO(term, goDict={})
- gfftk.genbank.sbt_writer(out)
- gfftk.genbank.table2asn(tbl, genome, output=False, sbt=False, organism=False, strain=False, tmpdir='/tmp', table=1, cleanup=True)
- gfftk.genbank.tbl2dict(inputfile, fasta, annotation=False, table=1, debug=False)
need a method to convert directly from NCBI tbl format to several output formats to avoid conversion problems with GBK files that have mutliple transcripts if can load funannotate dictionary directly from tbl format, then can write the other formats directly