The amount of data is growing at an unprecedented rate, from terabytes to petabytes or even exabytes. Traditional local data analysis methods do not have sufficient scalability and are too expensive to handle such a large amount of data. Enterprises need to extract all the data from multiple islands, and then concentrate these data in the data lake, so as to directly analyze and machine learning processing on this data.
However, under such a trend of data “bursting”, the challenge for companies to tap the value of data has become more and more obvious: from structured to semi-structured to unstructured data exponential growth, complex usage scenarios and rapid real-time data Decision-making capacity.
”Before this, it was normal for a data warehouse to run a report every few days, but now the entire business and scenario changes of the enterprise will drive the enterprise’s decision-making ability to speed up, and even many decisions will be made in minutes, which requires some real-time Decisions are made during stream analysis.” Gu Fan, general manager of Amazon Cloud Technology’s Service Products Department in Greater China, said in an interview with the author that in the face of segmented application scenarios, the current single and general data solutions on the market are in terms of performance. There will be compromises and it is difficult to meet the real needs of customers. Users urgently need a new generation of data management architecture that integrates easy-to-use, easy-to-expand, high-performance, specialized construction, security, and intelligence.
On June 24, 2021, Amazon Cloud Technology will continue to focus on services such as data and data analysis, launching the “Smart Lake Warehouse” architecture that leads the future of big data. Around the “Smart Lake Warehouse”, Amazon Cloud Technology provides data analysis services. Including: At the bottom level, DMS that injects data from the database into the data lake, the Amazon Snowball that moves data from a weak network environment to the data lake, and the Amazon Kinesis series of real-time streaming data services.
In the middle layer, Amazon S3 is a key component of the core data lake. Data enters Amazon S3. It supports structured, semi-structured, and unstructured data at the EB level and in the case of high availability and high expansion. S3 will also have a storage layer optimized for analysis.
The upper layer is the real data processing and data consumption layer. In the field of data analysis, there will be different analysis engines for different analysis scenarios-Amazon Redshift, Amazon EMR, Amazon Athena, etc. At the same time, in addition to the data processing and analysis engine, there are also business intelligence BI services like Amazon QuickSight and a large number of machine learning services.
”The smart warehouse is not a product, but an architecture designed to solve the real challenges of customers and to deal with complex scenarios.” In Gu Fan’s view, Amazon Cloud Technology has three aspects in the design of data analysis of the entire product. Considerations: First, optimize for the cloud. Whether it is Amazon Aurora or Amazon Redshift, these are cloud-native databases and data warehouses, which are naturally unlimited in terms of flexibility and can achieve very good linear expansion.
Secondly, it is specially constructed. Data analysis scenarios are becoming more and more diversified, and the people using them will become more and more diversified. Therefore, it is necessary to build different analysis engines specifically.
Finally, fully managed. This is a principle that runs through cloud computing and will never change. The heavy work for customer business without distinction is done by Amazon Cloud Technology, and there is no need to repeat the management, construction, and even data of the entire warehouse and lake. Move seamlessly.
In the concept of smart lake warehouse, there is one aspect that has been mentioned repeatedly-seamless movement of data.
In the customer’s business scenario, data movement can be roughly divided into three categories: from the outside to the inside, and the data enters the lake. For example, in the Amazon Redshift data warehouse, the sales of this year are divided by region. After the query runs out, the data will not only stay in the data warehouse, but will be re-injected from the data warehouse into the data lake. Because the Amazon SageMaker of machine learning directly connects to the data lake, data can be injected from the data lake to Amazon SageMaker. Amazon SageMaker builds the model through the analysis data of product sales by region, that is, the lake warehouse completes the query first, and the query data is imported Lake, machine learning calls data.
From the inside out, the data is out of the lake. When the customer uses the real-time data stream service, the data of the customer’s Web click stream on the website is injected into the lake, and the data is already in the lake.
Move around the lake. Simply put, data does not only go in and out from the outside. Whether it is a database, a data warehouse, or a different analysis engine, the lake has dedicated data storage for different purposes.
“It has been several years since Amazon Cloud Technology’s smart warehouse architecture was proposed. When we talked about how to iterate such a smart warehouse architecture at re:Invent in 2020, we placed great emphasis on the future data for better support. “Gu Fan said.
From the perspective of Amazon Cloud Technology, the architecture of the smart lake warehouse must have a fast-building and scalable data lake, namely Amazon S3. Around Amazon S3, customers use these specially constructed data analysis service collections, such as the complex query of structured data, Amazon Redshift and Amazon Aurora transaction database, etc., to move between lakes, warehouses and specially constructed data services. Data, including functions such as Amazon Glue and Amazon Glue Elastic view. Manage the security, access control and auditing of the data in the lake in a unified way. Finally, expand the system at a low cost without compromising performance.
”To build a data lake, we must have specially constructed data analysis services, and it must be able to achieve seamless data movement, unified management, and low cost of data, lakes, warehouses, and specially constructed data services. This is what we define as Amazon Cloud Technology. Intelligent lake warehouse architecture.” In Gu Fan’s view, the intelligent lake warehouse is not only the connection between the lake and the warehouse, but also the integrated data service connection between the lake and the warehouse.
Write at the end
The advantages of Amazon Cloud Technology’s “smart lake warehouse” architecture are reflected in five aspects: First, flexible expansion, safety and reliability. The most important part of this architecture is the basic component of the Amazon S3 data lake, which has an unparalleled durability of 11 9s. Its availability can not only replicate data across three available zones, but its scalability can even be EB-level. More importantly, the cost can be well controlled under the condition of high scalability and high availability of the data lake.
Second, it is specially constructed and has extreme performance. Any technology has its own advantages and disadvantages, so there is no single technology that can achieve a product that can dominate the world in terms of function, performance, and scalability.
Third, data integration and unified governance. In the future of the entire smart lake warehouse architecture, data will move between various points. Amazon Cloud Technology summarizes data movement into several methods: one is traditional ETL, extraction, conversion and loading; the other is visual data preparation. For example, Data Wrangler in Amazon SageMaker can quickly extract features from data.
Fourth, agile analysis and deep intelligence. When it comes to data, there are always three topics: how to modernize the data infrastructure and use cloud-native databases on the cloud; how to truly generate value from data; how to use machine learning to better assist decision-making, and even drive decision-making.
Therefore, under the architecture of the smart warehouse, the first integration of Amazon Cloud Technology is the integration of Amazon SageMaker and the warehouse. Then there is the further expansion of machine learning-not only data scientists and machine learning data development engineers are using machine learning, but also today’s DBAs and data analysts are encouraged to use machine learning.
Fifth, embrace open source, open and win-win. Regardless of Amazon EMR, Amazon Elasticsearch and Amazon MSK, it is a comprehensive support and compatibility for open source APIs.
It is worth mentioning that hundreds of thousands of customers are currently using Amazon cloud technology global services to build data lakes and carry workloads such as data analysis and machine learning.
”Being customer first, let Amazon Cloud Technology promote the continuous evolution of the data architecture, at the same time, it is also absorbing the source of innovation through customer feedback-90% of Amazon Cloud Technology’s innovation comes from directly listening to customer suggestions. In the future, Amazon Cloud Technology will continue to accelerate its business layout in China through technological innovation and practical innovation, while helping customers easily deal with massive business data and fully tap the value of data.” Gu Fan said.