Introduction to Generative AI

Generative AI (GenAI) has gained a lot of attention thanks to ChatGPT, but it is still a complex topic that many people find confusing. We will help you learn the basics of GenAI.

What is Generative AI?

Generative AI (GenAI) refers to deep-learning models that generate content such as text, images, and more based on their training data. These models create new data resembling their training examples. You might know LLMs or Multimodal Models—these are all forms of GenAI.


What are the use-cases of GenAI?

GenAI models can be used for many applications in different fields. Here are some examples:

  • Content Creation: These models can create new pieces of text, music, or artwork. For example, AI could make music for a video game, write a script for a movie, or generate articles or reports.

  • Chatbots, Virtual Assistants, Co-pilots: GenAI models can be used to make conversational agents that can talk to users, creating responses to user questions in a natural, human-like way.

  • Image Generation and Editing: Generative Adversarial Networks (GANs) can create realistic images, make graphics, or even change existing images in big ways, such as changing day to night or making a person's image in the style of a specific artist.

  • Data Augmentation: In situations where data is rare, GenAI models can be used to create fake data to add to real data for training other machine learning models.

  • Product Design: AI can be used to make new product designs or change existing ones, possibly making the design process faster and bringing new possibilities that human designers might not think of.

  • Medical Applications: GenAI can be used to make fake medical data, imitate patient conditions, or forecast the progress of diseases.

  • Personalized Recommendations: AI models can create personalized content or product suggestions based on user data.

  • Video Games: In the gaming industry, AI can be used to make new levels, characters, or whole environments. This can make games more varied and fun, as new content can be made on the spot.


What is an LLM?

Large Language Models (LLMs) are a kind of artificial intelligence model that use machine learning to create text that sounds like humans. There are many types of LLM model from various providers, for example models such as GPT-4 created by OpenAI.

LLMs are trained on huge amounts of text data and can generate sentences by guessing the probability of a word based on the previous words used in the text. They can be adjusted for different tasks, such as translation, question-answering, and writing help. These models are described as "Large" because they have a lot of parameters. For example, GPT-3 has 175 billion parameters. The large number of parameters lets these models learn a broad range of language patterns and nuances, but also makes them hard to train and use.

Some examples of LLMs are:

  • And more...


What is an Embedding?

An embedding is a way of representing complex data, such as words, sentences, or images, as numbers that computers can understand. Embeddings can capture the meaning and context of the data, and how it relates to other data. For example, word embeddings are numbers that show the meaning and usage of words in a language. Embeddings are often used in machine learning, such as natural language processing and computer vision, to train models that can perform tasks like translation, question-answering, or image recognition.


What is a Vector Database?

A Vector Database stores vector data. Vector data is often used in fields like natural language processing and computer vision, where high-dimensional vectors are used to show complex data like words, sentences, or images. Vector stores are usually made for operations that are common in machine learning, like nearest neighbor search, which means finding the vectors in the store that are closest to a given vector. This is very useful in tasks like recommendation systems, where you might want to find the items that are most like a given item.


What is a Multimodal Model?

A Multimodal Model in artificial intelligence is a model that can handle and combine data from different kinds of media for example text, images, audio, video, and more.

The main advantage of Multimodal Models is that they can leverage the strengths of different data types to make better predictions, or generate new content. For example, a model that takes both text and image data as input might be able to understand the context better than a model that only uses one or the other. Some examples of multimodal models are:

  • Multimodal Deep Learning: This is a machine learning subfield that aims to train AI models to process and find relationships between different types of data, such as images, video, audio, and text. Multimodal deep learning models often use complex neural network architectures, such as transformers, to fuse data from different modalities into a unified representation.

  • Generative Adversarial Networks (GANs): These are a type of GenAI model that can create realistic images, design graphics, or even modify existing images in significant ways, such as changing day to night or generating a person's image in the style of a specific artist. GANs consist of two competing neural networks, a generator and a discriminator, that learn from each other. GANs can also be extended to other modalities, such as text, audio, or video.

  • Multimodal Chatbots and Virtual Assistants: These are conversational agents that can interact with users using multiple modalities, such as speech, text, images, or gestures. Multimodal chatbots and virtual assistants can provide a more natural and engaging user experience, as they can understand and respond to user queries in a human-like manner. They can also perform tasks such as booking a flight, ordering food, or playing music.

  • Multimodal Recommendation Systems: These are systems that can generate personalized content or product recommendations based on user data from multiple modalities, such as text, images, audio, or video. Multimodal recommendation systems can improve user satisfaction and retention, as they can capture user preferences and interests more accurately and comprehensively.

  • Multimodal Video Analysis: This is a computer vision task that involves analyzing video data, which combines both visual and auditory data, for various purposes, such as object detection, face recognition, action recognition, scene understanding, or video summarization. Multimodal video analysis can benefit from the complementary information provided by different modalities, such as the sound of a car engine or the expression of a person's face


What is the memory of an LLM?

A Large Language Model (LLM) can generate text based on what it has learned before if it has memory. The term "memory" in this context refers to how much of the past text the model can recall when producing new text. Memory is a different concept than the training set used to train the model. The model can answer things from what it knows given the training set that it was given.

Since you are chatting with ChatGPT, the model will answer your questions considering the previous questions and answers as well. This "memory" of the model is crucial when dealing with long pieces of text or conversations, as it determines how much of the prior context the model can use to make accurate and coherent responses.

Last updated