Big day for people who use AI locally. According to benchmarks this is a big step forward to free, small LLMs.

  • just another dev@lemmy.my-box.dev
    link
    fedilink
    English
    arrow-up
    1
    ·
    5 months ago

    Ah, that’s a wonderful use case. One of my favourite models has a storytelling lora applied to it, maybe that would be useful to you too?

    At any rate, if you’d end up publishing your model, I’d love to hear about it.

      • just another dev@lemmy.my-box.dev
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 months ago

        Oof - not on my 12gb 3060 it doesn’t :/ Even at 48k context and the Q4_K quantization, it’s ollama its doing a lot of offloading to the cpu. What kind of hardware are you running it on?

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          5 months ago

          A 3090.

          But it should be fine on a 3060, with zero offloading.

          Dump ollama for long context. Grab a 5-6bpw exl2 quantization and load it with Q4 or Q6 cache depending on how much context you want. I personally use EXUI, but text-gen-webui and tabbyapi (with some other frontend) will also load them.