Cost-Effective Solutions for AI Infrastructure: Intel CPUs + Accelerators

In the past six months, everyone can intuitively feel the AI ​​heat brought by ChatGPT.

In fact, in less intuitive places, the data is also quietly changing: Stanford University’s “2023 AI Index Report” shows that the proportion of companies adopting AI in 2022 has more than doubled since 2017. These companies reported significant cost reductions and revenue increases after adopting AI.

Although the data for 2023 has not yet come out, it is not difficult to speculate based on the AIGC field that has been popularized by ChatGPT. The above figures will usher in a new inflection point this year. AIGC has the potential to set off the fourth industrial revolution.

But at the same time, these enterprises have also faced new challenges in building AI infrastructure .

First of all, in terms of computing power, the conflict between the surging demand for computing power and insufficient supply in the AI ​​​​field has become particularly intense this year . Even OpenAI CEO Sam Altman admitted that his company is plagued by computing power shortages. The reliability of its API and There have been repeated complaints about speed. In addition, a large number of companies are also facing the problem of rising computing power costs brought about by this wave of high demand.

Secondly, in terms of model selection, many companies have found that the most popular large model currently does not have a mature business model , and there are still problems in terms of security. Take Samsung’s device solutions department as an example. Within less than a month of enabling ChatGPT, three data breaches occurred, which discouraged companies that originally planned to directly call the OpenAI API. In addition, training and deploying super-large models by yourself is also very dissuasive: imagine that simply sending a request to a large model may require expensive GPU cards to perform exclusive calculations, which is unaffordable for many enterprises.

However, having said that, is an “omniscient” super-large model like ChatGPT really necessary for enterprises? Does running AI model-assisted business mean crazy expansion of GPU scale? How are companies that have already leveraged AI to improve their performance? After analyzing the best practices of some enterprises, we found some reference answers.

Companies that have already used AI: the difficult choice between performance and cost

If you want to analyze the industry that first applied artificial intelligence to improve efficiency, the Internet is one of the inseparable ones. The optimization of its typical workloads—recommendation systems, visual processing, natural language processing, etc.—is inseparable from AI. However, with the surge in business volume, they are also facing different challenges in terms of performance and cost.

First look at the recommendation system . Recommender systems are widely used in many fields such as e-commerce, social media, audio and video streaming, etc. Take e-commerce as an example. During shopping peaks such as 6.18 and Double Eleven every year, leading e-commerce companies such as Alibaba will face hundreds of millions of real-time requests from a large global customer base. Therefore, they hope to meet the requirements of AI reasoning in terms of throughput and time. Requirements for extension, while ensuring the accuracy of AI reasoning and the quality of recommendations.

Next, let’s look at visual processing . For Meituan alone, we can find multiple application scenarios such as intelligent image processing, identification of merchants’ entry certificates, scanning codes to open bicycles, and scanning medicine boxes to buy medicines. AI has become an important part of its business landscape. However, with the rapid growth of Meituan’s business and user volume, more and more applications need to build intelligent processes through visual AI. Meituan needs to improve the throughput of visual AI reasoning while ensuring the accuracy of visual AI reasoning. Support more intelligent services.

Finally look at natural language processing . Thanks to the popularity brought by ChatGPT, natural language processing is gaining unprecedented market attention and technology tracking. As a pioneer in domestic NLP technology research, Baidu has built a complete product system and technology portfolio in this field. ERNIE 3.0, as an important part of its Flying Paddle Wenxin・NLP large model, also shows excellent performance in various NLP application scenarios, especially in Chinese natural language understanding and generation tasks. However, with the commercialization of NLP in more industries, users have put forward more subdivided requirements for ERNIE 3.0, such as higher processing efficiency and wider deployment scenarios.

The solution to all these problems is inseparable from large-scale infrastructure investment, but the common problem that plagues these enterprises is: although independent GPUs can meet performance requirements, but the cost pressure is high, so blindly expanding the size of GPUs is not an optimal solution. option .

Cost-Effective Solution: Intel® 4th Generation Xeon® Scalable Processors

There is a stereotype in the AI ​​community that CPUs are not good for AI tasks. But a presentation by Hugging Face Chief Communications Officer Julien Simon shatters that stereotype. His company partnered with Intel to create a generative AI application called Q8-Chat, which provides a chat experience similar to ChatGPT, but requires only a 32-core Intel® to run .

As this example shows, using CPUs to carry AI tasks (especially reasoning tasks) is actually very common in the industry. Alibaba, Meituan, and Baidu have all used related solutions to alleviate the computing power problem .

Alibaba: Using CPU to power the next-generation e-commerce recommendation system, successfully coping with the peak load pressure of Double Eleven

As mentioned earlier, Alibaba faces multiple tests in terms of AI throughput, delay, and reasoning accuracy in the e-commerce recommendation system business. In order to achieve a balance between performance and cost, they chose to use CPUs to handle workloads such as AI inference.

So, what kind of CPU can withstand multiple tests at the same time? The answer, of course, is Intel® 4th Generation Xeon® Scalable processors.

This processor was officially released at the beginning of this year. In addition to a series of micro-architecture innovations and technical specification upgrades, the new CPU’s support for AI computing “to a higher level” has also attracted special attention, especially Intel’s in this generation of products. Added a new built-in AI accelerator – Intel Advanced Matrix Extensions (AMX).

In real-world workloads, Intel® AMX ‘s ability to support both BF16 and INT8 data types ensures that the CPU can handle DNN workloads like a high-end general-purpose graphics processing unit (GPGPU). The dynamic range of BF16 is the same as standard IEEE-FP32, but the precision is lower than FP32. In most cases, BF16 is as accurate as model inference results in FP32 format, but since BF16 only needs to process data half the size of FP32, the throughput of BF16 is much higher than that of FP32, and the memory requirement is also greatly reduced.

Of course, the architecture of AMX itself is also designed to accelerate AI computing. The architecture consists of two components: a 2D register file (TILE), which stores larger blocks of data, and a TILE Matrix Multiplication Unit (TMUL), which is an acceleration unit for TILE processing and computes in a single operation Instructions for larger matrices.

With this new architecture, Intel® AMX achieves a significant generational performance improvement. 4th Generation Intel® Xeon® Scalable Processors Running Intel® AMX Scalable Processors Running Intel® Advanced Vector Extensions 512 Neural Network Instructions (AVX-512 VNNI) ® Xeon® Scalable Processor increases the number of INT8 operations per computing cycle from 256 to 2048, and the number of BF16 operations is 1024, while the third-generation Intel® Xeon® Scalable Processor The number of times an FP32 operation is performed is only 64.

The advanced hardware features of Intel® AMX have brought Alibaba’s core recommendation model a breakthrough in AI reasoning performance and guaranteed sufficient accuracy. In addition, Alibaba uses the Intel® oneAPI Deep Neural Network Library ( Intel® oneDNN ) to fine-tune the CPU to peak efficiency.

The figure below shows that with AMX, BF16 mixed precision, 8-channel DDR5, larger cache, more cores, efficient core-to-core communication, and software optimization, the mainstream 48-core 4th generation Intel® Xeon ® Scalable Processor can increase the throughput of the agent model to 2.89 times , surpassing the mainstream 32-core third-generation Intel® Xeon® Scalable Processor, while keeping the latency strictly below 15 milliseconds, and the inference accuracy Still able to meet the demand.

The optimized software and hardware have been deployed in Alibaba’s real business environment, and they have successfully passed a series of verifications and met Alibaba’s production standards, including coping with the peak load pressure during Alibaba’s Double 11 Shopping Festival.

Moreover, Alibaba found that upgrading to the 4th generation Intel® Xeon® Scalable processors brought performance benefits far higher than hardware costs, and the return on investment was very obvious.

Meituan: Using the CPU to carry low-traffic long-tail visual AI reasoning, the service cost has dropped by 70%

As mentioned earlier, Meituan faces the challenge of high cost of visual AI reasoning services in business expansion. In fact, this problem is not monolithic: the load pressure and latency requirements of some low-traffic long-tail model inference services are relatively low, and they can be carried by CPUs.

In multiple visual AI models, Meituan uses Intel® AMX acceleration technology to dynamically convert the model data type from FP32 to BF16, thereby increasing throughput and accelerating inference with acceptable loss of precision.

In order to verify the performance improvement after optimization, Meituan compared the inference performance of the BF16 model converted using Intel® AMX Acceleration Technology with the baseline FP32 model . As shown in the figure below of the test data, after converting the model to BF16, the inference performance of the model can be improved by 3.38-4.13 times, and the accuracy loss of Top1 and Top5 can be mostly controlled at 0.01%-0.03% .

Thanks to the improved performance, Meituan can more fully release the potential of existing infrastructure, reduce the high cost of GPU deployment and operation and maintenance, and save 70% of service costs .

Baidu: Run the distilled model on the CPU to unlock more industries and scenarios

As we all know, more layers and parameters in the model mean larger model size, stronger computing resource requirements, and longer inference time. For users who are sensitive to business response speed and construction cost, it will undoubtedly improve the introduction and usage thresholds. Therefore, in the field of NLP, model miniaturization is a common optimization direction.

Baidu also adopted this approach, using model lightweight technology to distill and compress the ERNIE 3.0 large model, thereby extending it to more industries and scenarios. These lightweight models (ERNIE-Tiny) not only respond quickly, but also have an important advantage: they can be deployed without expensive dedicated AI computing power equipment. Therefore, the introduction of a stronger general-purpose computing platform and optimization scheme has become another important means to help ERNIE-Tiny achieve better efficiency.

To this end, Baidu has launched in-depth technical cooperation with Intel: on the one hand, it has introduced the fourth generation Intel® Xeon® Scalable processors into the inference calculation process of ERNIE-Tiny; on the other hand, it has also promoted a number of optimization measures, such as Call Intel® AMX instructions through the Intel® oneAPI deep neural network library to ensure that ERNIE-Tiny can more fully utilize the performance acceleration bonus brought by AMX.

The data from the comparative test shows that compared to the 3rd generation Intel® Xeon® Scalable processors for single-socket and dual-socket for AI acceleration through Intel® AVX – 512_VNNI technology , ERNIE -Tiny is After upgrading the 4th generation Intel® Xeon® , the overall performance has been increased by up to 2.66 times , which has achieved satisfactory results.

At present, each ERNIE-Tiny has not only been deployed in the zero-threshold AI development platform EasyDL, the full-featured AI development platform BML, and ERNIEKit (flagship version) products, but they will also cooperate with other capabilities of the platform and products. Based on the Intel® Xeon® Scalable processor infrastructure, it provides users with capabilities such as text classification, relation extraction, text generation, and question answering.

From the practical experience of Alibaba, Meituan, and Baidu, we can see that in a real production environment, it is still some AI models that are not so large in scale that really play a role. The deployment of these models already has mature solutions that can be used for reference, and significant cost benefits can be obtained with the help of Intel® Xeon® CPU and supporting software and hardware acceleration solutions.

Of course, with the strong rise of AIGC, many companies have also set their sights on such larger models. But as discussed above, whether calling the super-large model API or training and deploying by yourself has its own problems, how to choose an economical, efficient and safe solution is a thorny problem facing enterprises.

The era of AIGC has come, how should enterprises respond?

Does the embrace of AIGC by enterprises mean that there must be an “omniscient” super-large model? In this regard, the answer given by Boston Consulting Group (BCG) is no.

The solution they chose was to train an industry-specific model on their own data. The model may not be that big, but it provides insight into BCG’s highly confidential and proprietary data from the past 50+ years. At the same time, all AI training and reasoning are fully compliant with BCG’s safety standards.

Behind this solution is an Intel AI supercomputer equipped with Intel® 4th Generation Xeon® Scalable processors and Habana ® Gaudi2 ® AI hardware accelerator, the former has the highest AI training performance on PyTorch 10 times better than the previous generation, which outperformed the Nvidia A100 in computer vision (ResNet-50) and natural language processing (BERT fine-tuning), and was almost on par with the H100 in computer vision. The combination of the two provides BCG with a cost-effective AIGC solution.

On a chatbot interface, BCG employees were able to retrieve, extract and aggregate useful information through semantic search from lengthy multi-page document lists. BCG reported that this resulted in a 41% increase in user satisfaction, a 25% increase in result accuracy, and a 39% increase in job completion compared to existing keyword search solutions .

It can be seen that whether it is traditional small and medium-scale AI or the current promising AIGC industry model, GPU is not the only choice for AI acceleration. But no matter what the size of the model, Intel has given a cost-effective combination of software and hardware solutions.

For enterprises that want to apply AI to improve efficiency, there is no standard answer for what size model to choose and what kind of software and hardware infrastructure to build. The so-called super-large model and super-large GPU computing power cluster may not be necessary. Choosing a technical solution that suits you according to your business characteristics and attributes is an important factor in achieving the optimal solution.

error: Content is protected !!