Model formats


Currently, there are essentially the following formats:

  1. Unquantized Models

    These consist of a folder containing the actual model files. The folder name typically does not include “AWQ”, “EXL2”, or “GPTQ”. The loader in OTGW is “Transformers”.

  2. AWQ (Alternating weight quantization)

    AWQ inference occurs entirely on the GPU. Currently, only Nvidia graphics cards are supported.

    AWQ works with variable quantization depths and attempts to mitigate quality losses while still saving space. AWQ models are very fast.

    AWQ has worked in OTGW once, but support has been removed due to version conflicts. Perhaps it will return in the future...

    You can find more information here.

  3. EXL2 / EXL3 (Exllama v2 / v3)

    EXL2/3 is also a format that runs entirely on the GPU.

    There are various constant quantization depths for most models, specified in BPW (bits per weight). 8BPW is considered “overkill” by some, the highest recommended quantization is probably 6 or 6.5 BPW. Below 3 BPW, the quality also deteriorates significantly, so I would recommend choosing quantization between 4 and 6.5 BPW if the VRAM allows.

    EXL2/3 requires a bit more VRAM than GGUF but is also somewhat faster.

    The loader for EXL2 is “ExLlamav2_HF”, for EXL3 it is “Exllamav3”

    You can find more information here.

  4. GGUF

    GGUF is the only format that can run entirely on the GPU, but also offers the possibility to offload individual layers of the LLM into RAM. This significantly impacts speed, but better to run slowly than not at all! This also enables, as a novelty, to run “really large” models.

    The loader for GGUF is “llama.cpp”.

    GGUF is my preferred format as it seems to produce the best quality and, beyond that, offers the flexibility to move some layers of the model into the CPU RAM, which is much slower, but it allows running models that exceed the available (or far exceed) VRAM.

    You can find more information here.

  5. GPTQ

    GPTQ is also a format that runs entirely on the GPU. Here, there are mainly 4-bit or 8-bit quantizations, as well as variable „Group Sizes“ and „Act Order“. The size corresponds to the number of parameters in billions in gigabytes for 8-bit, and about half for 4-bit.

    The loader for GPTQ is also “ExLlamav2_HF”.

    You can find more information here.