Filtering Annotations
GFFtk provides powerful filtering capabilities through the --grep and --grepv options in the convert command. These options allow you to filter gene annotations based on any attribute using flexible pattern matching.
Basic Usage
Filter Syntax
The basic syntax for filtering is:
--grep key:pattern- Keep annotations matching the pattern--grepv key:pattern- Remove annotations matching the pattern
Pattern Format
Patterns can be specified in several formats:
key:pattern- Basic string matchingkey:pattern:flags- Pattern with regex flagskey:regex_pattern- Regular expression patterns
Supported flags:
- i - Case-insensitive matching
- m - Multiline matching
- s - Dot matches all characters including newlines
Common Examples
Filter by Gene Product
# Keep only kinase genes
gfftk convert -i input.gff3 -f genome.fasta -o kinases.gff3 --grep product:kinase
# Keep transporter genes
gfftk convert -i input.gff3 -f genome.fasta -o transporters.gff3 --grep product:transporter
# Remove hypothetical proteins
gfftk convert -i input.gff3 -f genome.fasta -o known_proteins.gff3 \
--grepv product:"hypothetical.*protein"
Filter by Annotation Source
# Keep only augustus predictions
gfftk convert -i input.gff3 -f genome.fasta -o augustus_only.gff3 --grep source:augustus
# Remove augustus predictions
gfftk convert -i input.gff3 -f genome.fasta -o no_augustus.gff3 --grepv source:augustus
# Keep high-confidence sources
gfftk convert -i input.gff3 -f genome.fasta -o high_conf.gff3 --grep source:genemark
Filter by Location
# Genes on specific chromosome
gfftk convert -i input.gff3 -f genome.fasta -o chr1_genes.gff3 --grep contig:chr1
# Genes on multiple chromosomes (regex)
gfftk convert -i input.gff3 -f genome.fasta -o main_chrs.gff3 --grep contig:"^chr[1-5]$"
# Plus strand genes only
gfftk convert -i input.gff3 -f genome.fasta -o plus_strand.gff3 --grep strand:\\+
# Minus strand genes only
gfftk convert -i input.gff3 -f genome.fasta -o minus_strand.gff3 --grep strand:-
Advanced Filtering
Case-Insensitive Matching
# Case-insensitive search for kinases
gfftk convert -i input.gff3 -f genome.fasta -o kinases.gff3 --grep product:KINASE:i
# Case-insensitive source filtering
gfftk convert -i input.gff3 -f genome.fasta -o augustus.gff3 --grep source:AUGUSTUS:i
Regular Expression Patterns
# Genes starting with specific pattern
gfftk convert -i input.gff3 -f genome.fasta -o pattern_genes.gff3 --grep name:"^gene[0-9]+"
# Genes with specific functional domains
gfftk convert -i input.gff3 -f genome.fasta -o domains.gff3 \
--grep product:"(kinase|phosphatase|transferase)"
# Exclude ribosomal proteins
gfftk convert -i input.gff3 -f genome.fasta -o no_ribosomal.gff3 \
--grepv product:"ribosomal.*protein"
Multiple Filters
You can combine multiple filters for complex selection:
# Multiple grep patterns (OR logic)
gfftk convert -i input.gff3 -f genome.fasta -o enzymes.gff3 \
--grep product:kinase --grep product:phosphatase
# Combined grep and grepv (AND logic)
gfftk convert -i input.gff3 -f genome.fasta -o filtered.gff3 \
--grep product:kinase --grepv source:augustus
# Complex multi-step filtering
gfftk convert -i input.gff3 -f genome.fasta -o complex_filter.gff3 \
--grep contig:chr1 --grep product:kinase --grepv note:pseudogene
Filter Keys
You can filter on any annotation attribute. Common keys include:
Core Attributes
product- Gene product/function descriptionsource- Annotation source (augustus, genemark, etc.)name- Gene name or identifiercontig- Chromosome or contig namestrand- DNA strand (+ or -)type- Feature type (mRNA, CDS, etc.)
Annotation Details
note- Gene notes and commentsdb_xref- Database cross-referencesgo_terms- Gene Ontology termsEC_number- Enzyme Commission numbersgene_synonym- Alternative gene names
Output Formats
Filtering works with all output formats:
# Filter and convert to proteins
gfftk convert -i input.gff3 -f genome.fasta -o kinases.faa \
--output-format proteins --grep product:kinase
# Filter and convert to GTF
gfftk convert -i input.gff3 -f genome.fasta -o filtered.gtf \
--output-format gtf --grepv source:augustus
# Filter and convert to TBL
gfftk convert -i input.gff3 -f genome.fasta -o filtered.tbl \
--output-format tbl --grep contig:chr1
# Filter and extract transcripts
gfftk convert -i input.gff3 -f genome.fasta -o transcripts.fasta \
--output-format transcripts --grep product:kinase
Tips and Best Practices
Test filters first: Use
--grepto see what matches before using--grepvQuote complex patterns: Use quotes around patterns with spaces or special characters
Use anchors: Use
^and$for exact matches (e.g.,^chr1$vschr1)Combine logically: Multiple
--grep= OR logic,--grep+--grepv= AND logicValidate regex: Test complex regex patterns with online tools before using
Case sensitivity: Remember to add
:iflag for case-insensitive matching
Error Handling
Common issues and solutions:
Invalid regex: Check your regex syntax if you get pattern errors
No matches: Verify the key name and pattern are correct
Case sensitivity: Add
:iflag if case doesn’t matchSpecial characters: Escape special regex characters with backslash
Empty results: Check that your filter criteria aren’t too restrictive
For more examples, see the Tutorial documentation.