Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM

Diana Cheung Nov 9, 2023 8:01:52 AM

Introduction

In July 2023, Meta took a bold stance in the generative AI space by open-sourcing its large language model (LLM) Llama 2, making it available free of charge for research and commercial use (the license limit only applies to companies with over 700 million monthly active users). In contrast, OpenAI’s GPT-n models, such as GPT-4, are proprietary – the inner workings are hidden from the public.

In this article, we’ll explore the differences between Meta’s Llama 2 and OpenAI’s GPT-4:

Model releases and architectures
LLM benchmarks
Model access methods
Deciding factors on which LLM model to use for a project

Here’s a refresher on generative AI and LLMs.

Model Releases & Architectures

Dimension	Llama 2 (Open-source)	GPT-4 (Closed-source)
Model size	7B parameters 13B parameters 70B parameters	~1.76T parameters
Max Context Length	4,096 tokens	8,192 tokens 32,768 tokens
Model Versions	Llama 2 Llama 2 Chat Code Llama	GPT-4
Modalities	Text	Text & Image

Llama 2 vs. GPT-4 summary comparison table. Source: Author

Llama 2

The model development of Llama 2 focuses on creating and enhancing AI models that are open and customizable. This approach supports various applications through innovative architectures and training techniques.

Additionally, Llama models, including versions like Llama 3.1 and 3.2, are available on platforms like Hugging Face. These models are designed for both AI and vision tasks, and collaborations with partners help integrate them into broader applications.

Llama 2 vs. GPT-4 summary comparison table. Source: Author

Llama Models

The Llama 2 model comes in three size variants (based on billions of parameters): 7B, 13B, and 70B. Recall that parameters, in machine learning, are the variables present in the model during training, resembling a “model’s knowledge bank.” The smaller-sized variants will run faster but produce lower-quality output. Additionally, lightweight models, such as the 1B and 3B versions, are designed for edge and mobile devices, offering efficient performance for on-device applications.

The model has been pre-trained with publicly available online data sources, such as Common Crawl (an archive of billions of webpages), Wikipedia (a community-based online encyclopedia), and Project Gutenberg (a collection of public domain books). At the time of writing, the pre-training data has a cutoff date of September 2022, although some fine-tuning data is more recent, up to July 2023.

These training data sources were split into tokens (sequences of characters, words, subwords, or other segments of text or code) for processing, with a context length of 4,096 tokens at maximum. The context length can be likened to the “attention span of the model.”

The model also comes in a chat version, Llama 2 Chat, for the three sizes: 7B-chat, 13B-chat, and 70B-chat. The chat version has been fine-tuned or optimized for dialogue use cases, similar to OpenAI’s ChatGPT. The fine-tuning utilizes publicly available instruction datasets and human evaluations via reinforcement learning with human feedback (RLHF). These fine-tuned models are crucial for achieving desired performance in dialogue applications and often include safety classifiers to manage unsafe inputs and outputs.

The coding version, Code Llama, is built on top of Llama 2 and fine-tuned for programming tasks. It was released in three sizes: 7B, 13B, and 34B parameters. There are three variants: Code Llama (foundational code model), Code Llama - Python (specialized for Python), and Code Llama - Instruct (fine-tuned for understanding natural language instructions).

Architecture

Llama 2 is a single-modality LLM that accepts text input only. Similar to GPT-4, Llama 2 is based on an auto-regressive, or decoder-only, transformer with modifications. In an auto-regressive model, the output variable depends linearly on its previous values and is based on a uni-directional context (either forward or backward).

autoregressive-forward

Diagram of processing words in a forward direction context. Source: https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

autoregressive-backward

Diagram of processing words in a backward direction context. Source: https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

One optimization of the Llama 2 model is utilizing grouped-query attention (GQA) to process multiple queries at once, rather than just one. Additionally, the Llama Stack enhances the development experience by introducing a standardized API, which allows for better integration of various toolchain components, simplifying the process of building applications with Llama's architecture.

Chat GPT-4

The GPT-4 model offers variants based on maximum context length: 8K and 32K tokens. The size for GPT-4 is estimated to be 1.76 trillion parameters.

GPT-4 was pre-trained using publicly available data, including internet data and data licensed to OpenAI. According to OpenAI, the collection of data includes: “correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas.” At the time of writing, the training data is current up to September 2021.

The evaluated performance of GPT-4 on various benchmarks indicates that it competes well against leading models in tasks like image recognition and visual understanding, showcasing its ability to process and reason about data from images and text effectively.

OpenAI incorporated more human evaluations in the training of GPT-4, encompassing ChatGPT user feedback and AI safety and security expert feedback.

Architecture

GPT-4 is a multimodal transformer-based decoder-only LLM that is capable of accepting image and text inputs and generating text outputs. It can perform advanced reasoning tasks, including code and math generation. Additionally, GPT-4 excels in visual reasoning, enabling it to understand and interpret images in conjunction with text, such as generating captions for images, understanding charts and graphs, and answering questions based on visual data.

The model's architecture integrates image processing with language understanding, enhancing its image understanding capabilities. This allows GPT-4 to comprehend visual content, extract relevant details, and bridge the gap between text and images, thereby improving usability in various applications.

LLM Benchmarks and Model Evaluations

Performance benchmarks for LLMs

Performance benchmarks for LLMs (higher scores indicate better performance). Source: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Model evaluations are crucial in assessing the performance of Llama 2 and GPT-4. These evaluations help in understanding their capabilities in image recognition and visual understanding tasks, providing a comparative analysis against leading alternatives like Claude 3 Haiku and GPT4o-mini.

Speed and Efficiency

As GPT-4 is a closed-source model, the inner details are undisclosed. Yet, just comparing the models’ sizes (based on parameters), Llama 2’s 70B vs. GPT-4’s 1.76T, Llama 2 is only ~4% of GPT-4’s size. Although size isn’t the only factor impacting speed and efficiency, it provides a general indication that Llama 2 may be faster than GPT-4.

The Llama 3.2 release includes lightweight models, specifically the 1B and 3B versions, which are designed for edge and mobile devices. These lightweight models are efficient in terms of speed and performance, offering capabilities such as multilingual text generation and tool calling functionalities.

Task Complexity

Measuring Massive Multitask Language Understanding (MMLU) is a broad benchmark to measure an LLM’s multi-task accuracy and natural language understanding (NLU). The test encompasses 57 tasks that cover a breadth of topics, including hard sciences like mathematics and computer science and social sciences like history and law. There are also varying topic depths, from basic to advanced levels. To score a high accuracy, the models must have comprehensive knowledge and proficient problem-solving ability.

Fine-tuned models play a crucial role in handling these complex tasks by adhering to specific formatting requirements and implementing safety classifiers to manage unsafe inputs and outputs.

Looking at the MMLU (5-shot) scores, GPT-4 is ranked higher at 86.4% compared to Llama 2 at 68.9%. Hence, GPT-4 is better at handling complex tasks with higher accuracy than Llama 2.

Coding

The HumanEval benchmark measures an LLM’s coding abilities. This test dataset by OpenAI includes 164 programming problems with a function signature, docstring, body, and several unit tests. The coding problems are written in Python and the comments and docstrings contain natural text in English.

Model development plays a crucial role in enhancing coding abilities by creating and refining AI models that are open and customizable, such as the Llama series. These models support various applications through innovative training techniques and architectures, aiming to democratize access and foster responsible AI development.

Referencing the HumanEval (0-shot) scores, GPT-4 is ranked higher at 67.0% compared to Llama 2 at 29.9%. Although the HumanEval (0-shot) score for Code Llama was better at 53.0%, GPT-4 still outperforms Code Llama and Llama 2 in programming abilities.

Math Reasoning

GSM8K is a dataset consisting of “8.5K high quality linguistically diverse grade school math word problems” released by OpenAI. It aims to evaluate an LLM’s capability to perform multi-step mathematical reasoning. The math problems involve 2-8 steps and require only basic arithmetic operations (+ - / *).

The evaluated performance of Llama 2 and GPT-4 in math reasoning tasks shows that GPT-4 outperforms Llama 2 significantly. The GSM8K (8-shot) scores show GPT-4 at 92.0% and Llama 2 at 56.8%. Thus, GPT-4 performs better than Llama 2 in math reasoning tasks.

Multilingual Support

According to the Llama 2 research paper, the model’s pre-training data is composed of 89.7% English vs. other languages. Hence, the model will likely perform best for English use cases and may not be suitable for multilingual use cases.

llama 2 language distribution

Llama 2 language distribution in its pre-training data pie chart. Source: https://slator.com/meta-warns-large-language-model-may-not-be-suitable-non-english-use/

GPT-4 has more robust support for multiple languages besides English. As a gauge, OpenAI translated the MMLU benchmark into a variety of languages and achieved high scores.

gpt-4 3-shot accuracy openai

GPT-4 MMLU benchmarks across multiple languages. Source: https://openai.com/research/gpt-4

Overall, GPT-4 offers better multilingual support than Llama 2.

Model Access Methods

With Meta’s open-source Llama 2, inner details are disclosed to the public. You can read the research paper detailing the creation and training of the model, and the model itself is available for download to your local computer where it’s possible to look at the actual code. The research paper states that the pre-training dataset comes from publicly available online data sources, but the exact pre-training dataset is unavailable. By installing some libraries, such as Hugging Face Transformers, you can even run the model on your local machine (just be sure to check the space and computation requirements beforehand). You can also deploy and run the model on cloud infrastructure, such as Microsoft Azure or Amazon Web Services, via a Machine Learning hub like Hugging Face or Amazon SageMaker. Thirdly, via Hugging Face’s Inference API, developers can make API calls to access over 100,000 shared models, including Llama 2.

Llama Guard provides access to various Llama models available on Hugging Face, including the need to accept license terms to access these models.

With OpenAI’s closed-source GPT-4, access is available via the company’s official API. Researchers and businesses have the option to fine-tune the model, but internal mechanisms are inside a black box from the public’s point of view. As of September 2023, GPT-4 is unavailable for fine-tuning, only GPT-3.5 Turbo.

Transparency, Community, and Ethical AI Advancements

By taking an open approach with Llama 2, Meta makes it substantially easier for other companies to develop AI-powered applications without the lock-in by huge players like OpenAI and Google. A transparent stance with an open-source model also allows for crowdsourcing to make generative AI safer and more helpful. This approach aligns with ethical AI advancements, fostering an environment of discovery while adhering to ethical standards in artificial intelligence.

In comparison, with a closed-source model, OpenAI has a business competitive advantage with proprietary technology. Innovation and experimentation are performed in-house which requires significant financial, computational, and human resources. OpenAI does have a developer community around its API and plugin development.

Costs

For Llama 2, the model itself has no fees. The cost depends on the cloud infrastructure platform and usage of compute resources. For example, pricing on Hugging Face (as of September 2023):

inference endpoints

Hugging Face pricing table. Source: https://huggingface.co/pricing

For GPT-4, the cost depends on the model variation and token counts of the input and output. Note that there are additional charges for fine-tuning a GPT model. Here is OpenAI’s pricing information (as of September 2023):

Model	Input	Output
8K	$0.03 / 1K tokens	$0.06 / 1K tokens
32K	$0.06 / 1K tokens	$0.12 / 1K tokens

GPT-4 pricing table. Source: https://openai.com/pricing

Data Privacy and Security

Meta has highlighted that no private or personal information has been used in the training of its Llama 2 model. In the research paper, it stated that the training data excludes data from Meta’s products or services and effort was made “to remove data from certain sites known to contain a high volume of personal information about private individuals.” The team also conducted red-teaming exercises for safety by simulating adversarial attacks. The data retention policy will depend on your cloud infrastructure provider and settings. For instance, Amazon SageMaker states that it “does not use or share customer models, training data, or algorithms … [You] maintain ownership of your content, and you select which AWS services can process, store, and host your content.”

Ethical AI advancements are crucial in ensuring data privacy and security. By adhering to ethical standards in artificial intelligence, companies can foster an environment of trust and openness, empowering individuals and industries.

On its privacy webpage, OpenAI states that it’s “committed to privacy and security” for its API Platform. The company asserts that its models don’t train on clients’ business data nor learn from clients’ usage. However, for abuse monitoring, OpenAI “may securely retain API inputs and outputs for up to 30 days.” You have the option of requesting zero data retention (ZDR) for eligible endpoints. OpenAI ensures comprehensive security compliance, such as data encryption and SOC 2 Type 2 audit.

For both models, secure and authenticate your API calls by setting up and using API keys. OpenAI and cloud infrastructure providers have standard procedures to create and manage them.

Deciding Factors on Which LLM Model to Use

Number of Requests (Price and Performance)

Based on the pricing structure presented above for Llama 2 and GPT-4, you can estimate the cost based on the anticipated amount of usage or requests.

If you project a large number of API calls, you will need more powerful computing hardware – for example, GPU over CPU, more processor cores, and more memory – for your cloud infrastructure hosting the open-source model, increasing your cost. Also, ensure that infrastructure autoscaling is set up to handle a rise in usage and maintain steady performance.

Lightweight models, such as the 1B and 3B versions included in the Llama 3.2 release, can handle a large number of requests efficiently. These models are designed for edge and mobile devices, offering capabilities like multilingual text generation and tool calling functionalities, which enable developers to create on-device applications that prioritize privacy and speed.

With the closed-source model, you need to be aware of API rate limits and usage quotas. For instance, OpenAI imposes rate limits in three ways: RPM (requests per minute), RPD (requests per day), and TPM (tokens per minute). According to OpenAI’s quota policy, “You’ll be granted an initial spend limit, or quota, and we’ll increase that limit over time as you build a track record with your application.” You can fill out a form requesting an increase in your rate limits or usage quota. You are relying on OpenAI to manage the aggregate load on its infrastructure and maintain smooth performance for all users.

An important thing to note is that if you become entirely dependent on a closed-source model, you are more susceptible to pricing changes set by the company. With an open-source model, there are more cloud infrastructure providers to choose from.

Complexity of Product/Service (Use Case)

As seen in the benchmark testing, GPT-4 scores higher for multi-task complexity than Llama 2.

Let’s say your use case is to build a chatbot for a business. For a small company, where the chatbot logic has medium complexity, the Llama 2 model will suffice. For a large enterprise, in which the chatbot needs to handle more complex conversations and provide professional responses, the more advanced GPT-4 model is suitable. Fine-tuned models play a crucial role in handling such complex use cases, ensuring that the chatbot meets specific formatting requirements and manages unsafe inputs and outputs effectively. Or perhaps the chatbot needs to process conversations in multiple languages, then the GPT-4 model is appropriate with its robust multilingual support.

Level of Risk (Accuracy and Performance)

GPT-4 has higher multi-task accuracy abilities than Llama 2, based on the MMLU benchmark. GPT-4 also outperforms Llama 2 in the areas of coding and math reasoning. Additionally, GPT-4 offers solid support for multiple languages.

If your project is business-critical and high-priority requiring advanced reasoning and problem-solving, then the GPT-4 model is a better fit – for instance, an AI assistant that deals with customer support before escalating to a human agent. You want to minimize risk with a higher-performing LLM, even though there may be tradeoffs in costs, speed, and efficiency. Evaluated performance is crucial in minimizing risk, as it ensures the model's capabilities are thoroughly assessed against various benchmarks.

A project that serves a lower-risk purpose can benefit from the Llama 2 model – for example, a tool that generates simple social media content for non-business accounts. You get moderate performance with potentially lower cost and higher speed and efficiency.

You need to evaluate your project’s risk tolerance before picking the appropriate LLM.

More on Meta Llama 2 vs. OpenAI GPT-4

Is LLaMA 2 better than ChatGPT 4?

GPT-4 outperforms LLaMA 2 in benchmarks like task complexity, coding, and multilingual support. LLaMA 2 is better for cost-efficient, customizable, and open-source projects.

What is better than LLaMA 2?

GPT-4 offers better performance in complex tasks, coding, math reasoning, and multilingual capabilities but at a higher cost and limited by proprietary access.

What is the difference between LLaMA and ChatGPT?

LLaMA 2 is open-source, customizable, and focused on text. GPT-4 is proprietary, multimodal (text and images), and more robust in task complexity, coding, and multilingual support.

What are the disadvantages of Llama2?

LLaMA 2 has lower performance in complex tasks, coding, and multilingual support compared to GPT-4. It also has shorter context lengths and single-modality input (text only).

Is Llama 2 cheaper than GPT?

Yes, LLaMA 2 is free to use, but cloud hosting costs apply. GPT-4 has usage-based pricing for API access and can be more expensive, especially for complex, large-scale tasks.

Meta Llama 2 vs. OpenAI GPT-4: Next Steps

We have examined the LLM releases for Llama 2 and GPT-4 on the dimensions of model size, context length, modalities, and architectures. The LLM benchmarks were compared between Llama 2 and GPT-4, with GPT-4 leading in most categories, including task complexity, coding, math reasoning, and multilingual support. The access methods differ between the open-source Llama 2 and proprietary GPT-4, with implications for transparency, costs, data privacy, and security. Each model has its strengths and weaknesses. When deciding on Llama 2 vs. GPT-4 for a project, it depends on a project's utilization and complexity requirements and specific use cases.

Diana Cheung (ex-LinkedIn software engineer, USC MBA, and Codesmith alum) is a technical writer on technology and business. She is an avid learner and has a soft spot for tea and meows.

Tags: Technical Blogs

Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM

Introduction