Doing more with less: LLM quantization (part 2)
What if you could get similar results from your large language model (LLM) with 75% less GPU memory? In my previous article,, we discussed the benefits of smaller LLMs and some of the techniques for shrinking them. In this article, we’ll put this to test by comparing the results of the smaller and larger versions of the same LLM.As you’ll recall, quantization is one of the techniques for reducing the size of a LLM. Quantization achieves this by representing the LLM parameters (e.g. weights) in lower precision formats: from 32-bit floating point (FP32) to 8-bit integer (INT8) or INT4. The