Tutorial
========

This tutorial will guide you through common tasks using gfftk.

Installation
-----------

You can install gfftk using pip:

.. code-block:: bash

    pip install gfftk

For more installation options, see the :doc:`installation guide <install>`.

Basic GFF3 Operations
--------------------

Parsing a GFF3 File
~~~~~~~~~~~~~~~~~~

Let's start by parsing a GFF3 file:

.. code-block:: python

    import gfftk

    # Parse a GFF3 file
    gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")

    # Print the number of genes
    print(f"Number of genes: {len(gff_dict)}")

Modifying Gene Annotations
~~~~~~~~~~~~~~~~~~~~~~~~~

You can modify gene annotations in the parsed GFF3 data:

.. code-block:: python

    import gfftk

    # Parse a GFF3 file
    gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")

    # Modify gene annotations
    for gene_id, gene in gff_dict.items():
        # Add a note to each gene
        if "note" not in gene:
            gene["note"] = []
        gene["note"].append("Modified by gfftk")

        # Update the source
        gene["source"] = "gfftk"

        # Update mRNA sources
        for mrna in gene.get("mRNA", []):
            mrna["source"] = "gfftk"

    # Write the modified data back to a GFF3 file
    gfftk.gff.dict2gff3(gff_dict, output="modified.gff3")

Filtering Genes
~~~~~~~~~~~~~

You can filter genes based on various criteria using both the Python API and command-line interface.

**Manual Filtering with Python API:**

.. code-block:: python

    import gfftk

    # Parse a GFF3 file
    gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")

    # Filter genes by length
    filtered_genes = {}
    for gene_id, gene in gff_dict.items():
        gene_length = gene["location"][1] - gene["location"][0] + 1
        if gene_length >= 1000:  # Only keep genes >= 1000 bp
            filtered_genes[gene_id] = gene

    # Write the filtered data back to a GFF3 file
    gfftk.gff.dict2gff3(filtered_genes, output="filtered.gff3")

**Built-in Filtering with Convert Command:**

The convert command provides built-in filtering options using ``--grep`` and ``--grepv`` flags:

.. code-block:: bash

    # Keep only kinase genes
    gfftk convert -i input.gff3 -f genome.fasta -o kinases.gff3 --grep product:kinase

    # Remove augustus predictions
    gfftk convert -i input.gff3 -f genome.fasta -o filtered.gff3 --grepv source:augustus

    # Case-insensitive filtering
    gfftk convert -i input.gff3 -f genome.fasta -o results.gff3 --grep product:KINASE:i

    # Combined filtering: keep kinases but remove augustus predictions
    gfftk convert -i input.gff3 -f genome.fasta -o filtered.gff3 \
        --grep product:kinase --grepv source:augustus

**Filter Pattern Syntax:**

- Basic pattern: ``key:pattern`` (e.g., ``product:kinase``)
- Case-insensitive: ``key:pattern:i`` (e.g., ``product:KINASE:i``)
- Regex patterns: ``key:regex_pattern`` (e.g., ``contig:^chr[0-9]+$``)
- Multiple patterns: Use multiple ``--grep`` or ``--grepv`` flags

**Common Filter Examples:**

.. code-block:: bash

    # Filter by gene product
    gfftk convert -i input.gff3 -f genome.fasta -o transporters.gff3 --grep product:transporter

    # Filter by annotation source
    gfftk convert -i input.gff3 -f genome.fasta -o genemark_only.gff3 --grep source:genemark

    # Filter by chromosome/contig
    gfftk convert -i input.gff3 -f genome.fasta -o chr1_genes.gff3 --grep contig:chr1

    # Filter by strand
    gfftk convert -i input.gff3 -f genome.fasta -o plus_strand.gff3 --grep strand:\\+

    # Remove hypothetical proteins
    gfftk convert -i input.gff3 -f genome.fasta -o known_proteins.gff3 \
        --grepv product:"hypothetical.*protein"

**Available Filter Keys:**

You can filter on any annotation attribute including:

- ``product`` - Gene product/function
- ``source`` - Annotation source (augustus, genemark, etc.)
- ``name`` - Gene name
- ``note`` - Gene notes/comments
- ``contig`` - Chromosome/contig name
- ``strand`` - DNA strand (+ or -)
- ``type`` - Feature type
- ``db_xref`` - Database cross-references
- ``go_terms`` - Gene Ontology terms

Format Conversion
---------------

Converting GFF3 to GTF
~~~~~~~~~~~~~~~~~~~~~

You can convert a GFF3 file to GTF format using the command line:

.. code-block:: bash

    gfftk convert -i input.gff3 -f genome.fasta -o output.gtf

Or using the Python API:

.. code-block:: python

    import gfftk

    # Convert GFF3 to GTF
    gfftk.convert.gff2gtf("input.gff3", "genome.fasta", "output.gtf")

**Converting with Filtering:**

You can combine format conversion with filtering:

.. code-block:: bash

    # Convert only kinase genes to GTF
    gfftk convert -i input.gff3 -f genome.fasta -o kinases.gtf --grep product:kinase

    # Convert to GTF excluding augustus predictions
    gfftk convert -i input.gff3 -f genome.fasta -o filtered.gtf --grepv source:augustus

Converting GFF3 to BED
~~~~~~~~~~~~~~~~~~~~~

You can convert a GFF3 file to BED format using the command line:

.. code-block:: bash

    gfftk convert -i input.gff3 -f bed -o output.bed

Or using the Python API:

.. code-block:: python

    import gfftk

    # Convert GFF3 to BED
    gfftk.convert.gff2bed("input.gff3", "output.bed")

Converting GFF3 to TBL
~~~~~~~~~~~~~~~~~~~~~

You can convert a GFF3 file to TBL format (for GenBank submission) using the command line:

.. code-block:: bash

    gfftk convert -i input.gff3 -f tbl -g genome.fasta -o output.tbl

Or using the Python API:

.. code-block:: python

    import gfftk

    # Convert GFF3 to TBL
    gfftk.convert.gff2tbl("input.gff3", "genome.fasta", "output.tbl")

Extracting Protein Sequences
~~~~~~~~~~~~~~~~~~~~~~~~~

You can extract protein sequences from a GFF3 file using the command line:

.. code-block:: bash

    gfftk convert -i input.gff3 -f genome.fasta -o proteins.fasta --output-format proteins

Or using the Python API:

.. code-block:: python

    import gfftk

    # Extract protein sequences
    gfftk.convert.gff2proteins("input.gff3", "genome.fasta", "proteins.fasta")

**Extracting Filtered Protein Sequences:**

You can extract proteins for specific gene sets:

.. code-block:: bash

    # Extract only kinase proteins
    gfftk convert -i input.gff3 -f genome.fasta -o kinases.faa \
        --output-format proteins --grep product:kinase

    # Extract proteins excluding hypothetical proteins
    gfftk convert -i input.gff3 -f genome.fasta -o known_proteins.faa \
        --output-format proteins --grepv product:"hypothetical.*protein"

Extracting Transcript Sequences
~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can extract transcript sequences from a GFF3 file using the command line:

.. code-block:: bash

    gfftk convert -i input.gff3 -f genome.fasta -o transcripts.fasta --output-format transcripts

Or using the Python API:

.. code-block:: python

    import gfftk

    # Extract transcript sequences
    gfftk.convert.gff2transcripts("input.gff3", "genome.fasta", "transcripts.fasta")

**Extracting Filtered Transcript Sequences:**

You can extract transcripts for specific gene sets:

.. code-block:: bash

    # Extract transcripts from specific chromosome
    gfftk convert -i input.gff3 -f genome.fasta -o chr1_transcripts.fasta \
        --output-format transcripts --grep contig:chr1

    # Extract transcripts from high-confidence predictions
    gfftk convert -i input.gff3 -f genome.fasta -o confident_transcripts.fasta \
        --output-format transcripts --grepv source:augustus

Consensus Gene Models
-------------------

Generating Consensus Gene Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can generate consensus gene models from multiple sources using the command line:

.. code-block:: bash

    gfftk consensus -i input1.gff3 input2.gff3 -f genome.fasta -o consensus.gff3

Or using the Python API:

.. code-block:: python

    import gfftk

    # Generate consensus gene models
    consensus = gfftk.consensus.generate_consensus(
        ["input1.gff3", "input2.gff3"],
        "genome.fasta",
        weights={"input1": 1, "input2": 2},
        threshold=3,
    )

    # Write the consensus gene models to a GFF3 file
    gfftk.gff.dict2gff3(consensus, output="consensus.gff3")

Using Weights for Consensus Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can assign different weights to different input sources:

.. code-block:: bash

    gfftk consensus -i input1.gff3 input2.gff3 input3.gff3 -f genome.fasta -o consensus.gff3 -w weights.json

Where weights.json is a JSON file with the following structure:

.. code-block:: json

    {
        "input1": 1,
        "input2": 2,
        "input3": 3
    }

Advanced Topics
-------------

Working with GenBank Files
~~~~~~~~~~~~~~~~~~~~~~~~

You can convert between GFF3 and GenBank formats:

.. code-block:: python

    import gfftk

    # Convert GFF3 to TBL (for GenBank submission)
    gfftk.genbank.gff2tbl("input.gff3", "genome.fasta", "output.tbl")

    # Convert GFF3 to GenBank
    gfftk.genbank.gff2gbk("input.gff3", "genome.fasta", "output.gbk")

    # Convert GenBank to GFF3
    gfftk.genbank.gbk2gff("input.gbk", "output.gff3")

Comparing GFF3 Files
------------------

You can compare two GFF3 files to identify differences using the command line:

.. code-block:: bash

    gfftk compare -i input1.gff3 -c input2.gff3 -f genome.fasta -o comparison.txt

Or using the Python API:

.. code-block:: python

    import gfftk

    # Parse the GFF3 files
    gff_dict1 = gfftk.gff.gff2dict("input1.gff3", "genome.fasta")
    gff_dict2 = gfftk.gff.gff2dict("input2.gff3", "genome.fasta")

    # Compare the GFF3 files
    comparison = gfftk.compare.compareAnnotations(gff_dict1, gff_dict2, "genome.fasta")

    # Print the comparison results
    print(f"Shared genes: {len(comparison['shared'])}")
    print(f"Unique to input1: {len(comparison['unique1'])}")
    print(f"Unique to input2: {len(comparison['unique2'])}")

Working with FASTA Files
~~~~~~~~~~~~~~~~~~~~~

gfftk provides functions for working with FASTA files:

.. code-block:: python

    import gfftk

    # Parse a FASTA file
    fasta_dict = gfftk.fasta.fasta2dict("genome.fasta")

    # Get the length of each sequence
    for seq_id, seq in fasta_dict.items():
        print(f"{seq_id}: {len(seq)} bp")

    # Reverse complement a sequence
    rev_comp = gfftk.fasta.RevComp(fasta_dict["seq1"])

    # Translate a sequence
    protein = gfftk.fasta.translate(fasta_dict["seq1"], "+", 0)

    # Extract a region from a sequence
    region = gfftk.fasta.getSeqRegions(fasta_dict, [["seq1", 1, 100]])[0]

    # Write a FASTA file
    gfftk.fasta.dict2fasta(fasta_dict, "output.fasta")

Working with Combined GFF3+FASTA Files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

gfftk supports combined GFF3+FASTA files, which contain both annotation data and sequence data in a single file. This format is commonly used by some working groups and databases.

**Reading Combined Files**

You can read combined GFF3+FASTA files by passing ``None`` for the FASTA parameter:

.. code-block:: python

    import gfftk

    # Parse a combined GFF3+FASTA file
    gff_dict = gfftk.gff.gff2dict("combined.gff", None)

    # The function automatically detects the ##FASTA directive and splits the content
    print(f"Number of genes: {len(gff_dict)}")

**Writing Combined Files**

You can create combined GFF3+FASTA files using the ``dict2combined_gff_fasta`` function:

.. code-block:: python

    import gfftk

    # Parse separate GFF3 and FASTA files
    gff_dict = gfftk.gff.gff2dict("input.gff3", "genome.fasta")
    fasta_dict = gfftk.fasta.fasta2dict("genome.fasta")

    # Write to combined format
    gfftk.gff.dict2combined_gff_fasta(gff_dict, fasta_dict, output="combined.gff")

**Using the Command Line**

You can also use the command-line interface to work with combined files:

.. code-block:: bash

    # Create a combined file from separate GFF3 and FASTA files
    gfftk convert -i input.gff3 -f genome.fasta --output-format combined -o combined.gff

    # Convert a combined file back to separate GFF3 format
    gfftk convert -i combined.gff --output-format gff3 -o output.gff3

**Non-Standard GFF3 Features**

gfftk now supports several non-standard GFF3 features commonly used by some annotation pipelines:

* ``intron`` - Intron features
* ``noncoding_exon`` - Non-coding exon features
* ``five_prime_UTR_intron`` - 5' UTR intron features
* ``pseudogenic_exon`` - Pseudogenic exon features

These features are automatically recognized and parsed when present in GFF3 files.

GFF3 File Manipulation
~~~~~~~~~~~~~~~~~~~

gfftk provides several commands for manipulating GFF3 files:

1. **Sorting GFF3 Files**

.. code-block:: bash

    gfftk sort -i input.gff3 -o sorted.gff3

2. **Sanitizing GFF3 Files**

.. code-block:: bash

    gfftk sanitize -i input.gff3 -o sanitized.gff3

3. **Renaming Features in GFF3 Files**

.. code-block:: bash

    gfftk rename -i input.gff3 -o renamed.gff3 -p PREFIX