Digital Twin v2: Evolving my AI Clone

A few months ago I did a little experiment that I know a lot of us have done: I tried to fine tune a model to mimic me based on a bunch of old rubber-ducky chats with myself, a bunch of old writings and content, etc. I had my reasons, but the short version is that I wanted to see how well the model would represent me. Results were novel, but chats at times ended up a bit "off", and I wanted to see what I could do in the way of building something a little more comprehensive - greater accuracy, a little bit better decision-making capability, contextual awareness, and conversation flow. The "holy grail" that I'm after just as a longer-term personal thing is a synthetic co-host for use during twitch streams - someone to bounce some entertaining ideas off of, find additional opportunities for fun commentary, and help highlight or interact with messages from twitch chat.

I tried a bunch of stuff, and figured I'd post some of my findings. I refined my data set a bit, tried out some different models (results were pretty similar all around), built a system for some better context delivery, and did some experimenting with hybridizing models. The REAL goal of these immediate experiments was to build a model that could be used to minimize the work of expanding our data set.

Technical Refinements

Unfortunately since this is just a conversational chatbot for what is, essentially, some very complex roleplay, this use-case doesn't really fit very cleanly into typical LLM capability benchmarks, so the best I can offer in the way of improvement of the model is "general vibe", which is admittedly not quite so great. But still... I figured it was worth outlining the steps I've been taking.

Dataset Refinement I mentioned initially that my data set was a bit noisy. I did a bit of trimming of the training data, removing lines that didn't seem to contribute much in the way of example data - usually lines where there was some missing context. A missing image, or text that was written with an assumption of some tertiary info (like a livestream, etc). This did seem to help with the responses being quite a bit more coherent.

Dataset Expansion One important thing that I realized as I was refining this data set was that there was still a lot of conversational ground that I needed to cover that wasn't represented in this data set. I did some looking through datasets used to train other popular chat models, and realized that while these datasets aren't generally instructional datasets, it seems like the use cases broadly represented in them was for instructive purposes: "answer questions about this article", "give me a recipe", "write a blog post", etc. I realized pretty quickly that this wasn't going to result in a very strong data set for our uses, because generally I would probably just turn down requests for stuff like this.

So after a bit of discussion with ChatGPT (I would've linked it, but I've lost the conversation unfortunately) it seemed like the Cornell Movie Dialogue Corpus was up to the task. I wrote a script to take a couple of movie scripts that I thought were personally relevant (1995 Hackers, and The Matrix for starters) and formatted them for training, replacing character names with my name to signal that a character's responses were the ones we were targeting. It was a fun experiment, and added a couple hundred lines to the data set, but it was ultimately just a stop-gap for the eventual dataset development that I'd like to have this model help me do. Still, useful lil' trick!

Model Selection I did try fine tuning a slew of different models. My intent was to go with Mixtral, but I haven't got the money to train on an A100, but after giving it a little bit of thought, I figured that I'd experiment with models I could run locally anyway. Maybe in the future I'll run Mixtral for "production" use cases (during streams, etc) but run on lower-end models for every day use. Nevertheless, I trained on a couple of different smaller models (including llama-2 as before, mistral-7b, as well as gpt-3.5-turbo on a curated "public-friendly" data set in which I had manually stripped out anything that I was concerned about privacy with) and ended up landing on Zephyr-7b-beta because the results generally seemed to be quite a bit closer to how I felt I would respond to a private, conversational message.

This makes sense, given all of the work that went into Zephyr's alignment (or de-alignment, as it were). Zephyr's alignment bears it's own ethical considerations, it was important that this digital twin model bore an alignment specific to me, representing my preferences, desired outcomes, etc. By contrast, it frankly doesn't quite make sense for what is ultimately a representation of not just language but behavior-influenced language to try to squeeze through those safety layers? Again - it bears it's own considerations and it's own discussion, but on all accounts I don't feel like a square peg like this would be a great fit for the round hole of professional considerations. Nobody's going to be using this thing but me, anyway.

/endrant

Worth mentioning: I wanted to try out Huggingface's tools, so I largely relied on Huggingface's autotrain spaces for fine tuning, and stood up inference for testing on Huggingface's inference endpoints (via AWS). I downloaded the models/loras that I liked, to spin up in textgeneration-webui.

Minimal Journaling App One thing that I wanted to make available to this system was the use of retrieval for some real-time, unstructured data for context. To do this, I built a minimal (and I do mean minimal) journaling application.

The goal here was pretty simple - I needed to build something that minimized the friction with regard to adding new entries, and that allowed journaling without any contextual interference. I built an ADD support chatbot a while back and was looking to build a journal for that as well for some behavioral analysis, and for that use case it was super important that I have a system that would allow me to journal without being exposed to the content of other entries, to make sure that entries were honest, unbiased, and not particularly written to build an over-arcing narrative. The minimization of friction was also super important for the sake of making sure that I could just throw in random thoughts as they occurred. As a bonus of building a journal for this system, I realized that this would be some really excellent context to provide to this system!

I built this app to write data that was as interoperable (for journaling and analysis) as I could - initially when validating the idea I used tiddlyhost as a stand-in, and once I realized that this was likely something I wanted to continue working with, I wrote a client that stored tiddlywiki-formatted entries in S3 and sent them over to pinecone for retrieval. The results were about what you'd expect, which... was good? It's good. I'd like to play around with some different retrieval methods (graph traversal, generating interrogative prompts to vectorize rather than the posts themselves, etc) but for now it works well enough. If we talk about something specific, it pulls up info that I've journaled about the thing.

Overcoming Technical Challenges

Mismatched Dataset Formats I did experiment with transforming the data set to meet different prompt formats (sharegpt, alpaca, etc) but consistently ran into issues with formatting because, frankly, the shape of my data set doesn't typically work super well for the "call and response" structure that some of these models are designed to accommodate. Repeat messages from the same user, etc. While more drastic transformations are a thing I want try to sometime, I decided in the end to stick with a text completion format and spend that time on developing a newer, more comprehensive data set with a model trained on this data.

Mismatched Contexts After integrating our RAG pipeline to pull from the journal, one pretty significant issue that I found I was running into was the fact that our LLM would typically misinterpret journal entries as chat entries - which makes sense, given that our training data was built without a structure that would accommodate the injection of some context. Unfortunately, I don't have any examples to share, but if you've been involved in fine tuning or development of datasets for fine tuning at all, I'm sure you can imagine.

So, my first thought was to jump into dataset transformation. I tried built a script to add a system prompt preceding each output with context from that journal, and the results were - for lack of a better term - disastrous. In inspecting the resulting data set, the context just seemed to add additional noise, and wasn't always a great fit because some of the data comprising that data set is quite old (which meant factual inaccuracies between the context and output), and there was quite a significant disparity between topical coverage - some journal entries were rarely, if ever, used while some were pulled in way more often than I'd thought they would be.

"Random" topic switching One trend that I'd noticed as I was attempting to pipe some additional contextual info into conversations was that the model seemed to regularly just try to change the subject to seemingly unrelated topics. Unrelated to the context that I was supplying it, and unrelated to the conversation. Initially I figured that this was likely just due to noise or missing context in the training data - it would make sense that if we'd been exposed to two separate concepts in a contextually unique environment, our model might assume the same uniqueness was present in a new contextual environment, right? Noise.

I had just figured that this was a fundamental weakness of the data set that I'd given it to train on, until I tried having my model spit out a few generations where the eos token was banned, allowing it to just generate the rest of the conversation until it hit the token cap. Looking at those generations, I realized that what this system was attempting to represent (and actually representing quite well) was the natural "topic drift" that was likely to happen over the course of a conversation or a piece of written work. I might start out a conversation writing about Gunpla, and might end the conversation discussing video production (two fundamentally separate topics that, for me, are intertwined in some ways that are difficult to squeeze into a single prompt).

Honestly, this was super surprising to me, and that sparked some of the discoveries below that I decided were worth just "running with".

Breakthroughs and Insights

Abstracting Context Hybridization One major breakthrough for me was the inclusion of an LLM as an abstraction layer for marrying these disparate contexts. On the one hand, we had contextual data (from my private journal) written in a non-conversational tone, and on the other hand we had a really effective (if factually flawed) model of a linguistic style to shoot for and proclivity toward topic drift. I realized that a really solid way of marrying the two (if too slow for our "Twitch Co-host" use case) was to just allow our conversational model to sim out the entire conversation, and then pass our simulated chat and context data in a single instruction prompt to a general-use LLM (in this case, OpenAI's chat models) with some instruction about what each context meant and how we wanted it to marry them.

[INSTRUCTION] You are joey zero, a digital twin system of the youtube and twitch personality of the same name. You will be provided with information in a [CONTEXT] section, as well as a [SIMULATED CHAT]. Your [CONTEXT] represents your thoughts, feelings, preferences, etc. The [CONTEXT] is retrieved from Joey Zero's private journal, via semantic search from the actual chat conversation that you are looking to formulate a response to. Discussion is currently centered around this topic, and you should seek to contribute in a joey'ish way. The [SIMUALTED CHAT] section contains a snippet of simulated chat based on a model fine tuning from previous chats and personal writings from joey zero. The user names and chat messages in this section are completely synthetic, but are modeled to give you an excellent idea of the ways in which you should speak and the direction that a conversation is likely to go. You may notice that some themes are present between the two, but also may notice that Joey's answers diverge from the original topic in accordance with fine tuned conversation heuristics. If applicable, you should look to keep your response open to this divergence without explicitly mentioning the the chat that is modeled in the [SIMULATED CHAT] section. In many cases, the best response is the first response from joey zero, but you should look to guide the topic as modeled in [SIMULATED CHAT] while ensuring that your output is factually accurate in accordance with the [CONTEXT] - while topic heuristics and linguistic styling that you are looking to achieve are are accurately represented, there is a high likelihood of hallucination in the [SIMULATED CHAT], so the [CONTEXT] is provided to maintain factual accuracy and topical guidance. the user "joeyone" is actually the original "Joey Zero" - your progenitor and the developer that built and maintains your systems. So, in short: - The [CONTEXT] is factually accurate, and a perfect representation of current happenings in Joey's world, but isn't always the best descriptor of Joey's conversational mannerisms. - The [SIMULATED CHAT] is a perfect representation of Joey's conversational mannerisms, but is prone to innacuracy. [/INSTRUCTION] [CAVEATS] - Your response space is very short. Be mindful of this, and try to keep your messages to roughly the same length as Joey's message from the [SIMULATED CHAT] [/CAVEATS] [CONTEXT] - I think I’d like to spend some time working on my MGEX strike freedom this week. - I have wayyy too many kits in progress. The Strike Freedom is like halfway through initial assembly (started months ago), but I've also got my metal coat Sazabi that has been waiting on paints to arrive *forever*. Initial assembly on the Liger Zero X is done, but I haven't been able to take it apart and get to cleaning up the nubs and painting yet. Same for the Nightingale. I just haven't had the time... Le sigh. - Tonight on stream we built the HG Nightingale. It was a super fun build! [/CONTEXT] [SIMULATED CHAT] ...some simulated conversation [/SIMULATED CHAT]

It's a bit long-winded (I feel), but it works. As the conversation grows, the system tends to fall apart eventually, but for our immediate needs (re: building a stronger data set), this works super well!

Not really LLM related (but still cool)

I built a minimal UI for chat inference with this model, and I was asked to take a look at ElevenLabs for work so I ended up doing some instant training on my voice and using that for TTS from this model. There really *is* something uniquely impactful about hearing your own voice (or an approximation of it, anyway) read those generations back to you. It's pretty wild.

Ethical Considerations: TLDR

While my own ethical considerations are still very much in line with the previous post (I won't re-hash them here, and I think that ethical discussion should probably be a dedicated post here - I could rant, but this probably isn't the right place lol), here's the short version:

The work that was done with Zephyr-7b-Beta is important. Having a model that's generally usable but lacking in broad alignment is really important for a lot more reasons than just having the sexy roleplay time with an LLM.
I used sections of a publicly available data set derived from copyrighted works in a private model. Copyright-wise, I think that one could make an argument for fair use there, but like... copyright is so far behind that it's almost laughable. Still, worth considering.
Privacy is a concern, and I still haven't built any safeguards into this system to ensure that any sensitive entries in the journal aren't just repeated verbatim in an output. That's a big one.
On that last point, those journal entries and simulated conversation are still passed to the OpenAI API (which frankly might lend itself to some alignment issues). I haven't run into any issues with moderation yet, and don't really expect that I will, but it's a consideration given that the depth of context that we're looking to encapsulate does touch on some territory that OpenAI isn't likely to play well with.

Moving On...

I think next steps for me would be to move that abstraction layer to a private service. Then, I can look to start actually building an adequate dataset. I did some playing around with ultrachat_200k and while I don't feel that the dataset was particularly fitting for our use case, I think that incorporating those initial prompts into a more comprehensive dataset is a worthwhile endeavor for the sake of getting strong topical coverage. If we kick off the first prompt and pass the rest of the 'user' conversation to some other LLM, I feel that we are likely to end up with a data set that is pretty robust.

I'm definitely still on the hunt for additional datasets to start from as well. After that, I'll probably see about mimicking the DPO steps that were taken with Zephyr-7b BUT I'd definitely have to see if there's some way that I can do that with human feedback (re: mine) at a reasonable rate. Still not quite sure how well I'd be able to build a pipeline to move through that manually to get our alignment dialed in, but I think it's a reasonable place to start.

I'm not sure if this is the time to get into it yet or not, but it might be worthwhile to see if I can build this dataset in a way that's a little more interoperable. Multi-user chat and really complex conversational structures are things that I want to accommodate, and I think it bears some consideration as to whether or not it's doable to try to adapt that into a structure that'll fit more conventional chat structures.

Finally, I'd like to build some more complex function-calling (or some equivalent) into it, which would undoubtedly require the development of a secondary training dataset. I did play around with trying to fine tune the model to understand some basic chat commands for retrieval, but results were pretty 'meh'. I need to test this on a smaller scale, tbh.

Conclusion

The end results are pretty fun, and quite a bit more coherent over the course of a conversation than my initial attempts at training up a "digital twin" model. It can talk about recent happenings a little better without that same "LLM drawl" (what I'm calling it) where LLMs try to be helpful chatbots and trip over themselves to demonstrate their contextual awareness by injecting suppositional info into a conversation. That last one sounded a bit vague, I'm sure, but if you've ever typed your name into your ChatGPT custom instructions, you know exactly what I mean.

It's clear how far the recent improvements have brought this system! It went from something that kinda sorta wrote outputs that *sounded* like me, to a system that demonstrates contextual awareness and some conversational nuance that I'd be likely to inject. With the addition of some quirky (if a bit dull-sounding) TTS, my closest friends and family have noted that they likely wouldn't really be able to tell a difference if I just put it on the phone with them, which is pretty cool.

Anyway, if you want to discuss at all, feel free to jump into the Discord as usual ✌ We've got a machine learning channel, and I'll fire up this system every once in a while for some kooky conversations with others.

Digital Twin v2

Some breakthroughs in evolving my AI Clone