BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Anyone Played with Programming Local LLMs, Such as Llama?



I don't consider myself an expert on what you're asking but I dabble so
I'll take a crack at it (since nobody else did!).  it's a complex and
*very* rapidly changing subject, there's a lot to say; broad questions can
lead to long answers... (sorry to everyone in advance!)

On Sat, Nov 30, 2024 at 02:09:42PM -0800, Kent Borg wrote:
> I think I should play with the technology from
> a programming angle to get a feel for what they can do and what they can't.

when you say "from a programming angle" do you mean:
 * using LLMs to program (which they can /almost/ do... at the beginning of
   this year they couldn't at all, but around april they started showing a
   bit of promise, and now they're scoring 10x-20x what they were at the
   beginning of the year on programming benchmarks...)
 * writing a program which calls LLMs?  typically code will interact with
   LLMs over the openai api, sending it the prompt in a json message and
   getting the completion back also in json.  there's not much to this
   interaction... you can send it to someone else's computer, or run the
   model locally, but the api doesn't care.

> I want to understand how? their very broad training can be directed in very
> specific ways, such as how can I get them to recognize patterns of my
> choice,

this last bit falls into the category of "fine tuning", and there are
several ways to do that:
 * most expensively, you can fine-tune the entire model by incrementally
   retraining its weights on new data - you'd typically use a lot of data
   for this, and a lot of gpus, and a lot of days or weeks.
 * ...or similar retraining on a well-chosen subset of weights.  still
   needs a lot of data and gpus, still fairly expensive and slow.
 * a much more common method is called LoRA (low-rank adaptations), which
   can be done fast & cheap with even just hundreds of exemplars.  people
   have made google colabs which can train loras for 32 billion weight
   models and still fit in the free tier of colab (with free gpus).
 * the easiest way, especially when experimenting, falls into the category
   of "prompt engineering" -- you give it the example inputs & outputs
   in every query as context, and its completion is guided by those
   patterns.  last year context windows got long enough to be able to do
   something reasonable with this, and this summer models started getting
   released with 5x-10x longer still context windows.
 * (more that I'll skip because this is already too long...)

> I think I would like to play with images, but I'm not stubborn in
> that regard. Certainly text is useful, so is audio?

images as input or images as output?  those will typically be handled by
different sets of models.  and audio in is a different set, and audio out
is a different set... there are generalist models which try to have all of
those as input (but typically only text as output), but the specialist
models not surprisingly can often do better.

> Googling about (well, duckduckgoing about) I see that Meta's Llama isn't
> option that is free to play with*. There are other free to use LLMs:
> Granite, Mistral, and Gemma are the ones I have found so far.

??? meta's been open-sourcing their inference code from the start,
beginning with llama 1 (last year - ancient history!); but they didn't
release their weights for llama 1 (though they were leaked).  but meta's
been releasing all model weights since llama 2 (also last year).

qwen seems to currently have a leading set of open generalist/chat models.
deepseek has some good ones too.  and also mistral, google's gemma, even
cohere's command-r.  and there are lots of fine-tunes of these (especially
of llama), and many specialist models in each modality & domain. literally
over a million models are available for you to download from huggingface
and run on your own hardware (or on someone else's!).

> * Pedantic observation: As far as I can tell *none* of the LLMs are open
> source. Sure, the compilable code might be open source but that's not were
> the intellectual property lies. The billions of model parameters are the
> secret sauce, and they are the ultimate opaque blob.

it's accurate that the model weights are the secret sauce, but as above,
most of these are open.  they're "opaque" in the sense that nobody in the
world understands exactly why they do what they do, but you can download
them for more than a million models, including every one that's been
mentioned in this thread up to this point.

the pedants will say that the *training* data and code which resulted in
those weights after $100m worth of compute isn't released, so nobody can
reproduce them from scratch (but if I had $100m that's not what I'd spend
it on), and I think this is true with the sole exception of the allen
institute, who has released training data, training code, training logs,
fine-tuning code, etc.

> Question: Has anyone here played with writing code to drive LLMs? Any
> pointers for getting out of the mud easily? (Any warnings?)

if what you want to do is see what llms can do - and I know this isn't in
line with the ethos of this group (me included), but: don't write off paid
providers (especially if you don't have beefy gpus); paid queries are
usually pennies to fractions of a penny each, and if you're not doing that
many this might add up to a dozen dollars a month.  open models you
self-host have recently gotten very very good, but unless you have multiple
modern gpus you won't be able to run the good versions of them at
reasonable speed.  the models you can run won't compare (at least in the
chat/generalist domain; I think in audio they will compare, and in image
generation I think they will compare but just be slow).  there are
providers like openrouter and together.ai who will bounce your queries to
most providers and even have some good free(-beer) models which makes it
easy.  oh, and google gives you 1500 free queries/day on some of their
midrange models (50 free/day on their best).

if you are self-hosting, ollama is the easiest way to get started (with the
limited models they support), and I'd also have a serious look at jan.ai
(don't write it off as a chat gui, it allegedly makes it very easy to
download any model and run it with an api port to hit).

quantization really helps large models fit into your vram and even dram.
you probably want 5-6 bits/weight.  complex topic.

"tool use" is a hot topic these days, letting an llm issue commands to a
set of pre-cooked functionalities you provide (e.g. web search, or fetch
weather or stock price, or even issue bash commands).

"rag" (retrieval-augmented generation) is popular: retrieve data relevant
to the query - from the web, or a database, or document collection, etc, -
and provide that as context with the query.  (similar to the prompt
engineering I mentioned above)

imho nobody puts enough effort into the broader form of prompt engineering,
which is crafting the instructions you're giving the llms.  everyone asks a
one-line question and assumes the llms will know what to do; they don't,
you have to very carefully guide them.

have llms be verbose: they "think" by generating streams of output.  if you
have them be brief, you're limiting their "thinking" and output quality
suffers.  (current name for this is CoT, chain-of-thought)

warnings?  don't ever believe anything an llm says, no matter how confident
it sounds.  it lies.  a lot.  have it cite sources and check them yourself.

> -kb, the Kent who expects he will be using his Framework 13 laptop: a 6-core
> (12-thread) AMD Ryzen 7640U CPU with 64GB of RAM, but maybe he plugs in a
> Hailo M.2 AI module, too.

I don't know anything about the hailo m.2; I briefly looked it up just now,
and what I didn't see was any hbm/vram - which makes me question its value.
these days, llms are memory bandwidth-bound (and stats on their website
were from 2021?! different world!).  how much memory you have in your gpu
is what determines what models you can run at a tolerable speed.  your
memory bandwidth is great for cpus but is 50-100GB/s; the leading gpus
which all serious players are using have >1500GB/s, and 80GB.  even
midrange gamer gpus have 24GB of 1000GB/s now.

you absolutely can run llms cpu-only, but you'll either need to choose
smaller models (maybe a few billion weights at most?) or be okay with
waiting a long time for a response.  I'm unsure whether that hailo module
will make any difference; your compute likely outruns your memory already.
plus, as above, $200 buys a lot (10k-100k?) of paid queries from the very
best models on the very best hardware.

yikes, this is way too long (sorry!), I'll stop now.
--grg