Modell formats

There are mainly the following formats:

  1. Unquantized models

    These consist of a folder that contains the actual model files. The name of the folder usually does not contain “AWQ”, “EXL2”, or “GPTQ”. The loader in OTGW is “Transformers”.

  2. AWQ (Alternating weight quantization)

    AWQ-Interferenz is executed purely on the GPU. Currently only Nvidia GPUs are supported.

    AWQ works with variable quantization depths and tries to minimize quality losses and saving resources at the same time. AWQ models are very fast.

    AWQ has worked in OTGW, but its support was dropped because of version conflicts. Maybe it will return someday...

    You can find more information here.

  3. EXL2 (Exllama v2)

    EXL2 is another format that is executed only on the GPU.

    For most models you will find various constant quantization depths, which are expressed in BPW (bits per weight). 8BPW is regarded as “overkill” by most, the highest recommended quantization seems to be 6 or 6.5 BPW. From 3 BPW downwards there will be huge quality losses, so I recommend using between 4 and 6.5 BPW if your VRAM allows for it.

    EXL2 is a bit heavier on the GPU VRAM usage but also a bit faster.

    The loader for EXL2 is “ExLlamav2_HF”.

    You can find more information here.

  4. GGUF

    GGUF is the only format which can be executed as well on the CPU as well as on the GPU - on a mixtrure of both. That way you can operate some layers of a model on the GPU / VRAM and the rest on the CPU with “normal” RAM, if the whole model does not fit into your VRAM. You will experience huge losses in computation speed, but better a slowly running model than a model not running at all! This also allows for experimenting with “real big” models if you want to test drive them.

    The loader for GGUF is „llama.cpp“.

    Currently GGUF is my favourite format, because it offers more flexibility to run large models and it also seems to have the best quality output.

    You can find more information here.

  5. GPTQ

    GPTQ is also a format that is executed only on the GPU. You'll find quantizations in 4 and 8 bit mostly, but variable “group sizes” and “act orders”. The size is roughly the number of the parameters in billion (109) in gigabytes for 8 bit and half of it for 4 bit.

    The loader for GPTQ is also „ExLlamav2_HF“.

    You can find more information here.