Personal AI Announces MODEL-3: The Next Evolution of Enterprise AI Teams

January 23, 2025

Jan 23, 2025 | SAN DIEGO, CALIFORNIA

In MODEL-3, we are introducing new capabilities tailored to enterprise needs in legal, finance, and healthcare industries. Our upgraded multi-memory system recalls relevant interactions from previous sessions (e.g., patient visits), enabling comprehensive decision making; our upgraded multi-modal retrieval capabilities interpret and recall visual information in line with text responses, maintaining the integrity of the original source data (e.g., charts, graphs, tables); and our new multi-AI system enables collaboration between multiple humans, AI agents and end-to-end vertical workflows.

Multi-Memory - Longitudinal Multi-Session Memory

Imagine going to your favorite restaurant, the Maître d’ not only remembers your name, but also the last dishes you ordered during that special occasion last year. Human relationships are built on memories shared and collected through multiple interactions. 

In a typical single AI system (e.g., ChatGPT, Claude), the system has access to the long-term memory of the AI, encapsulated in the LLM. The system also has access to short-term conversational memory between itself and the current conversation history with the user, yet limited to the single session. 

This “forgetfulness” between sessions limits the utility of the AI to a single session of interactions. 

In MODEL-3, we introduce an upgraded memory system capable of persisting memory across multiple conversational sessions between the AI and the user – a multi-conversational memory. The memory context of the AI is extended across a longitudinal view of each session, able to refer to previous conversations where similar context occurred. 

The multi-memory system encodes the users’ entire conversational history shared with the AI and is accessed during every session of the user. The conversational history is time-aware and context-aware, the AI responses refers to past conversations relating to the user prompt and constructs consolidated responses. 

Being able to “remember” between sessions expands the utility of the AI in use cases and critical industries such as healthcare, coaching, and customer service where relationship information carried from session to session is pertinent for ongoing decision making. The example below shows the interaction of a patient with their clinician and the clinician’s AI persona across three separate visits. In the instance of single session memory, the AI only remembers lab values (e.g., cholesterol) discussed in the same session, whereas in the instance of multi-session memory, the AI remembers the previous visits and recalls similar lab values from previous sessions and analyzes the longitudinal trends of the patient on their progress. 

Multi-Modal - Combined Memory Retrieval

In the current multi-modal system, we are familiar with two existing modes for multi-modal retrieval and generation.

In the first use case, the user’s intent is to retrieve or interpret an existing set of image/audio that matches the user prompt, where the user input can be multi-modal: text, image, or audio. This is known as semantic search with images or audio (e.g., show me pictures of dogs). In this case, the user’s query is converted to a query vector that matches against the vectorized image or audio.

In the second use case, the user’s intent is to generate or edit new images that the AI creates, where the AI output can be multi-modal: text, image, or audio. This is known as generative or stable diffusion technique. The user’s prompt is encoded by a CLIP model that is trained to score similarity between text and images (e.g., trained on images and captions) and a GAN/Diffusion model that iteratively generates images closer and closer to the user’s prompt.

In the Enterprise domain, most of the data is multi-modal with a combination of text and image/audio (e.g., Powerpoint, PDFs). Therefore, the need for a third use case is critical for Enterprises: the ability to interpret and retrieve image/audio and text in a combined context. 

Naive chunking and processing of Enterprise data introduces inaccuracies and incomplete AI responses. For an 800-page legal document, overlapping chunks would create an incorrect cross-over between different sections of legal codes that cannot be mixed. Simple text extraction from a financial deck full of tables and charts would be a string of numbers without interpreting in context of axes and row labels.  

In MODEL-3, we introduce both the ability to chunk and interpret images/audio along with corresponding text and the capability to retrieve precise images along with textual responses (e.g., information contained on the slide). 

Our multi-modal memory system first pre-processes semi-structured formats such as PDFs, slides, and CSV with structured semantic sections and chunks, each chunk includes multi-modal input such as text (e.g., steps of a user manual) and images (screenshots of a user manual) as one unified memory that maintains the strong relationships between text, image, and audio. These multi-modal chunks are embedded and processed together for retrieval from the model at a later time. The output response is a combination of text and visual response with the relevant text and images. 

Being able to retrieve multi-modal responses directly from the input document enables use cases in industries such as financial, legal, and consulting, where both text and visual information are crucial for maintaining the full context of the data when retrieved. In this example, the charts on the financial deck is an image while the title is in text, a text only extraction would extract the title of the slide, an OCR only extraction would extract raw numbers from the chart, while multi-modal extraction/retrieval maintain the context of charts by associating revenue number with the exact timeframe and also show a visual of the source slide. 

Multi-AI - Multi-Agent Collaboration System

The current AI systems are largely single player, meaning there is one human and one AI interacting for each conversation. The current agent systems evolved from linear reflex agent workflows (e.g., Zapier, if/then agents) to multi-tool systems capable of matching user intent with different agents responsible for a single function (e.g., web browsing, routing agents).

These systems are good at performing single-user collaborative tasks between human and AI agents for brainstorming, generating content, and analyzing data. Although multi-tool agent systems can perform different tasks, they are transactional in nature (e.g., get single stock price), have limited access to memory, and are heavily limited in performing complex tasks. 

Complex workflows in legal, strategy, and finance requires multiple AI and human experts to collaborate with each other towards a common goal. Each AI in a multi-agent workflow system can perform tasks before passing the output to the next agent. In many cases, human input and guidance are also needed to direct towards the desired output. 

In MODEL-3, we introduce a multi-AI, multi-human collaboration system where each AI has its independent memory and directive in addition to shared task memory that is capable of performing multiple interactions with the end user before passing the information to the next human or AI agent. This is done in a channel setting, where multiple conversations can be carried in a thread with both human and AI participants. 

Each AI has its own set of long term memory from internal sources (e.g., medical records of patients) and expert knowledge from external sources (e.g., legal text like civil code of procedures). Each AI also has its own directives (e.g., guidelines) and its own output action (e.g., produce a timeline of a patient's medical history). In addition, the multi-AI system operates on transactive memory, a shared memory system where multiple participants in the experience collectively store and access knowledge; this surpasses the individual memory of any single member. This transactive memory is represented in a conversation thread, where each AI that is mentioned operates on the continued memory of the previous interactions. 

Having humans in the loop also guarantees that the results and decisions are reviewed at each of the key steps, critical for industries such as legal, healthcare, and finance. In this example, we have a multi-AI workflow– from client intake to drafting– for a law firm involving multiple humans (client and attorney) and multiple AIs (Intake Ivy, Report Roy, Associate Alice). The flow is initiated by the client (Human) making an inquiry, Intake Ivy (AI) asks a series of questions and remembers the responses relevant to the client’s case; the attorney receives the police report and Report Roy (AI) extracts and analyzes relevant details contained within the police report before Associate Alice (AI) drafts of the complaint by combining all the previous information from client, intake, police report, medical report, and the attorney’s own analysis.

Looking Beyond MODEL-3

Building on the MODEL-3 multi-AI foundation, we are continuously streamlining product usability and experience to simplify human and AI interaction. We’ll be introducing enhanced multi-agent workflows with actions defined in natural language and an advanced system to orchestrate the hierarchy of AI agents deployed in enterprise teams. 

To learn more about how MODEL-3 can transform your enterprise, schedule a demo with us today.

About Personal AI

Personal Al enables businesses to rapidly train and deploy their own Al Personas, delivering 10x the productivity at 1/10th the cost. Each Persona has a specific function, possesses deep proprietary knowledge in that area, and has agentic capabilities for end-to-end AI workflows. For any role in a company, Personas can be trained and managed securely within the Personal AI platform.

Stay Connected
More Posts

You Might Also Like