• Login
  • Apply
Back to Blog

Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM

Introduction

In July 2023, Meta took a bold stance in the generative AI space by open-sourcing its large language model (LLM) Llama 2, making it available free of charge for research and commercial use (the license limit only applies to companies with over 700 million monthly active users). In contrast, OpenAI’s GPT-n models, such as GPT-4, are proprietary – the inner workings are hidden from the public.

In this article, we’ll explore the differences between Meta’s Llama 2 and OpenAI’s GPT-4:

  • Model releases and architectures
  • LLM benchmarks
  • Model access methods
  • Deciding factors on which LLM model to use for a project

Here’s a refresher on generative AI and LLMs.

Model Releases & Architectures

Dimension

Llama 2 (Open-source)

GPT-4 (Closed-source)

Model size

7B parameters

13B parameters

70B parameters

~1.76T parameters

Max Context Length

4,096 tokens

8,192 tokens

32,768 tokens

Model Versions

Llama 2

Llama 2 Chat

Code Llama

GPT-4

Modalities

Text

Text & Image

Llama 2 vs. GPT-4 summary comparison table. Source: Author

Llama 2

The Llama 2 model comes in three size variants (based on billions of parameters): 7B, 13B, and 70B. Recall that parameters, in machine learning, are the variables present in the model during training, resembling a “model’s knowledge bank.” The smaller-sized variants will run faster but produce lower-quality output.

The model has been pre-trained with publicly available online data sources, such as Common Crawl (an archive of billions of webpages), Wikipedia (a community-based online encyclopedia), and Project Gutenberg (a collection of public domain books). At the time of writing, the pre-training data has a cutoff date of September 2022, although some fine-tuning data is more recent, up to July 2023.

These training data sources were split into tokens (sequences of characters, words, subwords, or other segments of text or code) for processing, with a context length of 4,096 tokens at maximum. The context length can be likened to the “attention span of the model.”

The model also comes in a chat version, Llama 2 Chat, for the three sizes: 7B-chat, 13B-chat, and 70B-chat. The chat version has been fine-tuned or optimized for dialogue use cases, similar to OpenAI’s ChatGPT. The fine-tuning utilizes publicly available instruction datasets and human evaluations via reinforcement learning with human feedback (RLHF).

The coding version, Code Llama, is built on top of Llama 2 and fine-tuned for programming tasks. It was released in three sizes: 7B, 13B, and 34B parameters. There are three variants: Code Llama (foundational code model), Code Llama - Python (specialized for Python), and Code Llama - Instruct (fine-tuned for understanding natural language instructions).

Architecture

Llama 2 is a single-modality LLM that accepts text input only. Similar to GPT-4, Llama 2 is based on an auto-regressive, or decoder-only, transformer with modifications. In an auto-regressive model, the output variable depends linearly on its previous values and is based on a uni-directional context (either forward or backward).

 

autoregressive-forward

Diagram of processing words in a forward direction context. Source: https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

autoregressive-backward

Diagram of processing words in a backward direction context. Source: https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-8d8fce710335

One optimization of the Llama 2 model is utilizing grouped-query attention (GQA) to process multiple queries at once, rather than just one.

GPT-4

The GPT-4 model offers variants based on maximum context length: 8K and 32K tokens. The size for GPT-4 is estimated to be 1.76 trillion parameters.

GPT-4 was pre-trained using publicly available data, including internet data and data licensed to OpenAI. According to OpenAI, the collection of data includes: “correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas.” At the time of writing, the training data is current up to September 2021.

OpenAI incorporated more human evaluations in the training of GPT-4, encompassing ChatGPT user feedback and AI safety and security expert feedback.

Architecture

GPT-4 is a multimodal transformer-based decoder-only LLM that is capable of accepting image and text inputs and generating text outputs. It can perform advanced reasoning tasks, including code and math generation.

LLM Benchmarks

 

Performance benchmarks for LLMs

Performance benchmarks for LLMs (higher scores indicate better performance). Source: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Speed and Efficiency

As GPT-4 is a closed-source model, the inner details are undisclosed. Yet, just comparing the models' sizes (based on parameters), Llama 2’s 70B vs. GPT-4’s 1.76T, Llama 2 is only ~4% of GPT-4’s size. Although size isn’t the only factor impacting speed and efficiency, it provides a general indication that Llama 2 may be faster than GPT-4.

Task Complexity

Measuring Massive Multitask Language Understanding (MMLU) is a broad benchmark to measure an LLM’s multi-task accuracy and natural language understanding (NLU). The test encompasses 57 tasks that cover a breadth of topics, including hard sciences like mathematics and computer science and social sciences like history and law. There are also varying topic depths, from basic to advanced levels. To score a high accuracy, the models must have comprehensive knowledge and proficient problem-solving ability. 

Looking at the MMLU (5-shot) scores, GPT-4 is ranked higher at 86.4% compared to Llama 2 at 68.9%. Hence, GPT-4 is better at handling complex tasks with higher accuracy than Llama 2.

Coding

The HumanEval benchmark measures an LLM’s coding abilities. This test dataset by OpenAI includes 164 programming problems with a function signature, docstring, body, and several unit tests. The coding problems are written in Python and the comments and docstrings contain natural text in English.

Referencing the HumanEval (0-shot) scores, GPT-4 is ranked higher at 67.0% compared to Llama 2 at 29.9%. Although the HumanEval (0-shot) score for Code Llama was better at 53.0%, GPT-4 still outperforms Code Llama and Llama 2 in programming abilities.

Math Reasoning

GSM8K is a dataset consisting of “8.5K high quality linguistically diverse grade school math word problems” released by OpenAI. It aims to evaluate an LLM’s capability to perform multi-step mathematical reasoning. The math problems involve 2-8 steps and require only basic arithmetic operations (+ - / *).

The GSM8K (8-shot) scores show GPT-4 at 92.0% and Llama 2 at 56.8%. Thus, GPT-4 performs better than Llama 2 in math reasoning tasks.

Multilingual Support

According to the Llama 2 research paper, the model’s pre-training data is composed of 89.7% English vs. other languages. Hence, the model will likely perform best for English use cases and may not be suitable for multilingual use cases. 

llama 2 language distribution

Llama 2 language distribution in its pre-training data pie chart. Source: https://slator.com/meta-warns-large-language-model-may-not-be-suitable-non-english-use/

GPT-4 has more robust support for multiple languages besides English. As a gauge, OpenAI translated the MMLU benchmark into a variety of languages and achieved high scores.

gpt-4 3-shot accuracy openai

GPT-4 MMLU benchmarks across multiple languages. Source: https://openai.com/research/gpt-4

Overall, GPT-4 offers better multilingual support than Llama 2.

Model Access Methods

With Meta’s open-source Llama 2, inner details are disclosed to the public. You can read the research paper detailing the creation and training of the model, and the model itself is available for download to your local computer where it’s possible to look at the actual code. The research paper states that the pre-training dataset comes from publicly available online data sources, but the exact pre-training dataset is unavailable. By installing some libraries, such as Hugging Face Transformers, you can even run the model on your local machine (just be sure to check the space and computation requirements beforehand). You can also deploy and run the model on cloud infrastructure, such as Microsoft Azure or Amazon Web Services, via a Machine Learning hub like Hugging Face or Amazon SageMaker. Thirdly, via Hugging Face’s Inference API, developers can make API calls to access over 100,000 shared models, including Llama 2.

With OpenAI’s closed-source GPT-4, access is available via the company’s official API. Researchers and businesses have the option to fine-tune the model, but internal mechanisms are inside a black box from the public’s point of view. As of September 2023, GPT-4 is unavailable for fine-tuning, only GPT-3.5 Turbo.

Transparency and Community

By taking an open approach with Llama 2, Meta makes it substantially easier for other companies to develop AI-powered applications without the lock-in by huge players like OpenAI and Google. A transparent stance with an open-source model also allows for crowdsourcing to make generative AI safer and more helpful.

In comparison, with a closed-source model, OpenAI has a business competitive advantage with proprietary technology. Innovation and experimentation are performed in-house which requires significant financial, computational, and human resources. OpenAI does have a developer community around its API and plugin development.

Costs

For Llama 2, the model itself has no fees. The cost depends on the cloud infrastructure platform and usage of compute resources. For example, pricing on Hugging Face (as of September 2023):

inference endpoints

Hugging Face pricing table. Source: https://huggingface.co/pricing

For GPT-4, the cost depends on the model variation and token counts of the input and output. Note that there are additional charges for fine-tuning a GPT model. Here is OpenAI’s pricing information (as of September 2023):

Model

Input

Output

8K

$0.03 / 1K tokens

$0.06 / 1K tokens

32K

$0.06 / 1K tokens

$0.12 / 1K tokens

GPT-4 pricing table. Source: https://openai.com/pricing

Data Privacy and Security

Meta has highlighted that no private or personal information has been used in the training of its Llama 2 model. In the research paper, it stated that the training data excludes data from Meta’s products or services and effort was made “to remove data from certain sites known to contain a high volume of personal information about private individuals.” The team also conducted red-teaming exercises for safety by simulating adversarial attacks. The data retention policy will depend on your cloud infrastructure provider and settings. For instance, Amazon SageMaker states that it “does not use or share customer models, training data, or algorithms … [You] maintain ownership of your content, and you select which AWS services can process, store, and host your content.”

On its privacy webpage, OpenAI states that it’s “committed to privacy and security” for its API Platform. The company asserts that its models don’t train on clients’ business data nor learn from clients’ usage. However, for abuse monitoring, OpenAI “may securely retain API inputs and outputs for up to 30 days.” You have the option of requesting zero data retention (ZDR) for eligible endpoints. OpenAI ensures comprehensive security compliance, such as data encryption and SOC 2 Type 2 audit. 

For both models, secure and authenticate your API calls by setting up and using API keys. OpenAI and cloud infrastructure providers have standard procedures to create and manage them.

Deciding Factors on Which LLM Model to Use

Number of Requests (Price and Performance)

Based on the pricing structure presented above for Llama 2 and GPT-4, you can estimate the cost based on the anticipated amount of usage or requests.

If you project a large number of API calls, you will need more powerful computing hardware – for example, GPU over CPU, more processor cores, and more memory – for your cloud infrastructure hosting the open-source model, increasing your cost. Also, ensure that infrastructure autoscaling is set up to handle a rise in usage and maintain steady performance.

With the closed-source model, you need to be aware of API rate limits and usage quotas. For instance, OpenAI imposes rate limits in three ways: RPM (requests per minute), RPD (requests per day), and TPM (tokens per minute). According to OpenAI’s quota policy, “You’ll be granted an initial spend limit, or quota, and we’ll increase that limit over time as you build a track record with your application.” You can fill out a form requesting an increase in your rate limits or usage quota. You are relying on OpenAI to manage the aggregate load on its infrastructure and maintain smooth performance for all users.

An important thing to note is that if you become entirely dependent on a closed-source model, you are more susceptible to pricing changes set by the company. With an open-source model, there are more cloud infrastructure providers to choose from.

Complexity of Product/Service (Use Case)

As seen in the benchmark testing, GPT-4 scores higher for multi-task complexity than Llama 2.

Let’s say your use case is to build a chatbot for a business. For a small company, where the chatbot logic has medium complexity, the Llama 2 model will suffice. For a large enterprise, in which the chatbot needs to handle more complex conversations and provide professional responses, the more advanced GPT-4 model is suitable. Or perhaps the chatbot needs to process conversations in multiple languages, then the GPT-4 model is appropriate with its robust multilingual support.

Level of Risk (Accuracy and Performance)

GPT-4 has higher multi-task accuracy abilities than Llama 2, based on the MMLU benchmark. GPT-4 also outperforms Llama 2 in the areas of coding and math reasoning. Additionally, GPT-4 offers solid support for multiple languages.

If your project is business-critical and high-priority requiring advanced reasoning and problem-solving, then the GPT-4 model is a better fit – for instance, an AI assistant that deals with customer support before escalating to a human agent. You want to minimize risk with a higher-performing LLM, even though there may be tradeoffs in costs, speed, and efficiency.

A project that serves a lower-risk purpose can benefit from the Llama 2 model – for example, a tool that generates simple social media content for non-business accounts. You get moderate performance with potentially lower cost and higher speed and efficiency.

You need to evaluate your project’s risk tolerance before picking the appropriate LLM.

Summary

We have examined the LLM releases for Llama 2 and GPT-4 on the dimensions of model size, context length, modalities, and architectures. The LLM benchmarks were compared between Llama 2 and GPT-4, with GPT-4 leading in most categories, including task complexity, coding, math reasoning, and multilingual support. The access methods differ between the open-source Llama 2 and proprietary GPT-4, with implications for transparency, costs, data privacy, and security. Each model has its strengths and weaknesses. When deciding on Llama 2 vs. GPT-4 for a project, it depends on a project’s utilization and complexity requirements and specific use cases.

Diana Cheung (ex-LinkedIn software engineer, USC MBA, and Codesmith alum) is a technical writer on technology and business. She is an avid learner and has a soft spot for tea and meows.