Links to other parts
A discussion between a CxO and a senior Data Architect Part 1
A discussion between a CxO and a senior Data Architect Part 2
A discussion between a CxO and a senior Data Architect Part 3
A discussion between a CxO and a senior Data Architect Part 5
Background: We have been going through a discussion that took place between senior leadership and a data architect. Part – 4 continues.
Discussion Follows:
Alison: Can you tell me the most complex migration that you have done till today?
Vasumat: Sure! We have migrated workloads for one of the European-based engineering and construction firms. They operate from 85 offices across the globe with a staff of 13000 (IT Staff 4300) and revenues over $6.5 billion with a customer count of 0.4 million. We have created the business case by considering two major challenges. A) Data management on expensive storage – They are generating enormous data and struggling to maintain, and manage data and in handling disaster recovery solutions at their data centers. B) Scalability – Systems are not scalable to handle heavy workloads. We proposed the cloud solution by addressing these two challenges. Projected Azure as the suitable target by considering the TCO, ROI, and compatibility (majority of their workloads are using Microsoft products .Net, Windows OS, SQL Server, Office suite including Email, SharePoint, etc.).
Alison: Great! Can you summarize the migration?
Vasumat: Sure!
Alison: That’s amazing, I think we are looking for similar background. Out of curiosity, how did you handle the legacy systems?
Vasumat: Legacy systems are always a big deal. I’ll tell you an example, our client has been using a simple application based on SQL Server 2000 running on Windows Server 2003. They are neither retiring that application nor ready to invest in refactoring. Hence, we are left with only the option of moving it as is to the Cloud VM. We spent some time preparing a detailed analysis report with the possible problems and issues and projected Infront of the business team. Ex: SQL Server 2000 is a 32-bit application hence it has compatibility issues with the OS, limitations utilizing the memory, also most importantly product support has been expired years ago. Hence it can’t fit the modern data world. To make it simpler, we suggested re-platforming SQL Server 2000 to Azure SQL Database. So, we do not need to invest more money, it’s just a few extra efforts to migrate schema and data to Azure SQL Database. After checking the possibilities, cost ratio, performance gain, and risk involved, we got the go-ahead for re-platforming and we successfully did it too. Likewise, for migrating Windows Server 2003 VMs to Azure:
Alison: Do you remember the TCO differences and improvements after the cloud migration? Approximate figures would suffice.
Vasumat: Yes, I do! After successful cloud migration, we monitored for some time (a few weeks/months depending on criticality) and captured the crucial parameters, predicted the future values, and compared it with the on-premises numbers. A detailed KPI report was given to the business. The major advantage of the cloud is less or no Capital Expenditure (money we spend to purchase and maintain the IT infrastructure). KPI values comparing with on-premises:
Alison: Let’s suppose, I need a machine with 32 core, 64GB RAM, and 16 TB HDD configurations. From the cost point which one is the best option, purchasing a physical server or an Azure VM / AWS EC2?
Vasumat: For the physical server, we may need to spend $9000 approx. as a one-time investment. To get an Azure VM with a similar configuration we may need to pay $900 per month. By the second month, we invest $1800 and by the 10th month, we pay $9000 in total for an Azure VM.
Alison: Which means owning a physical server is the winner from the cost point. Isn’t it?
Vasumat: Certainly! But before concluding, we should consider the facts:
Alison: Got you. What was the time taken to complete this migration project?
Vasumat: It took 24 months out of which 8 months for analysis and planning and 16 months for execution and optimization.
Alison: You might have worked with many clients till today. What is the highest business impact that you have seen due to unplanned downtime?
Vasumat: One of the manufacturing clients projected their business impact as $2.3 million for unplanned downtime for 4 hours (It was down due to hardware failure).
Alison: Vasumat, you must have handled VLDB (Very Large Database) migration, right?
Vasumat: Yes! Till now the largest single database I migrated is 8.5 TB
Alison: That’s a real VLDB. Can you describe a few tips for handling VLDB migration?
Vasumat: For VLDB, there are three major things that we need to worry about A) Backup B) Consistency check C) Index maintenance. These operations may run from hours to days depending on the database size and available resources. For us, VM with configuration 32 Core, 448 GB RAM, 20000 MBPS network bandwidth, it took 12.3 hrs. to complete the full backup.
Best practices in dealing with VLDB are:
Alison: So, you migrated that 8.5 TB database to Azure VM. Is it?
Vasumat: Yes! We migrated it to an Azure VM, but issues were remaining the same. It’s been taking a half day for a full backup, and differential backup (4 days after full backup) was also taking 8 hours. Also, the maximum Blob size is limited to 195 GB, hence we had to take the split backup to ensure we do not hit the maximum size of 195 GB. When I approached an expert engineer from the Azure team, he suggested me the file snapshot approach, we performed a POC with success and then implemented the same for our production database.
Alison: Can you explain how did you do that?
Vasumat: We have a feature using which we can place SQL Server Database files as Page Blobs in Azure Blob Storage. With this, we get the benefit of Azure Storage which is not available in the traditional storage system. For example, consider Instant File Initialization (IFI), in traditional storage if we do not enable IFI when new space is requested, the system performs zeroing process (put zeros on the disk to avoid security risk). But it is only applicable for the Data file and not for the Log file. When it comes to Azure Blob Storage, without enabling IFI, it automatically takes care of the zeroing process and most importantly IFI is applicable for Log files as well. When we call Put Page API, it simply clears the range and guarantees that reading from a cleared range will always return zeroes. We tried allocating the logfile from 50 MiB to 1 TiB and it took 1 second.
Alison: But how it resolved your backup issue?
Vasumat: I am coming to that; We have a feature called File Snapshot Backup. Unlike traditional streaming backups, a File snapshot Backup creates a storage snapshot for each SQL Server Database file. Can you believe, for the 8.5 TiB production database while it was operating where queries were running with 72% CPU utilization, full backup was completed in 2 minutes 30 seconds?
Alison: That’s amazing. Even with 10 GBPS speed, it can hardly take 1 TB backup in 2 min. How come it’s been taking 8.5 TB?
Vasumat: Because no data movement has occurred, instead it simply stores pointers to these snapshots and other backup metadata. Hence it is faster and maintains backup consistency. And coming to running data consistency checks (DBCC), we can quickly restore (taking 5 seconds) the latest file snapshot backup into other SQL instances and run DBCC commands. Two benefits, A) We are fully checking the consistency (No physical only or partial checks) B) Checking on the latest copy without distributing the production workload.
Alison: Okay! That sounds fascinating. So, do you prefer the same approach for all SQL workloads?
Vasumat: Certainly not! Every feature is designed for a specific cause hence it has its limitations, and we use it only when our database is fit into this feature. For example, it doesn’t support FileStream data, geo-replication for a storage account, In-Memory OLTP, and, Always On failover cluster instances. Also, standard storage (500 TB limit) can’t be used due to performance issues. But the premium storage account is limited to 35 TB.
Alison: I understand. I specifically asked you about VLDB as we have SAP systems running on-premises. Do you see any SAP offerings from Azure?
Vasumat: I didn’t have experience in migrating SAP workloads to the cloud. Nevertheless, I participated in a case study discussion in a customer place, where they explained how SAP systems with VLDB were migrated to Azure. Ideally, I can say that I know the theory.
Alison: Then I would like to know about your theory.
Vasumat: Sure! SAP systems mainly require (at least from the data point) A) Larger storage B) Elastic Compute Capacity. By addressing these parameters, Azure is offering two optimized solutions A) SAP on Azure Virtual Machine; B) SAP HANA on Azure Large Instances. Both options are certified to host SAP workloads and it is the Best of Both Worlds (SAP and Azure Cloud). Because, along with the SAP, we can utilize the latest features of Azure Ex: Automation, High Availability with guaranteed SLA, security, scalability, cost & operational efficiency, etc. Azure supports SAP deployment including SAP NetWeaver, SAP S/4HANA, BW/4HANA, BI, and HANA in scale-up and scale-out scenarios.
Alison: That’s interesting. In the case study that you are talking about, from the customer business point, what was the top benefit that they gained by migrating their SAP to Azure? Only if you remember.
Vasumat: I may forget the process and approach but always try to remember the KPI because that tells us the success rate of our migration strategy. There were two major factors A) Projected 5-year cost savings as between 60 to 65% B) Time for provisioning and building new components for Very Large Workloads drastically improved from 1 month (on-prem) to 1 week (Azure).
Alison: Do you know how they migrated large SAP databases to the cloud?
Vasumat: Again, it’s a theory as I didn’t experiment on it.
Alison: Fine, your theory is from a real use case. So, for me, it’s more than a theory. Please go ahead.
Vasumat: If it is a homogenous system (Oracle to Oracle, SQL Server to SQL Server) then we can migrate them with relatively low or near zero downtime using the native HA & DR features. But when it comes to heterogenous systems, we need downtime and duration depends on our planning strategy.
For larger databases, they used AzCopy, R3Load (import/export data in a platform-independent format), and Migmon (Migration Monitor) tools. At on-premises provisioned 4 physical servers for running R3Load with accelerated networking. Server -1 for loading the top 5 tables, servers 2 & 3 for loading tables with table splits, and server 4 for migrating remaining all tables. Since R3Load exports in parallel, controlling the sequence of Import happens via the Signal File (SGN) that is automatically generated when all Export packages are completed. Here I am speaking in a very generic and simple language, but implementation is highly challenging and needs to consider a lot of facts:
Alison: That’s very informative, thanks for taking me into the SAP world. I liked that cloning thought.
Alison: I think it’s already late. Do you have 30 more min? We’ll quickly wrap it up.
Vasumat: Absolutely no issue.
Alison: Wonderful! We are currently holding stakeholder financial portfolios, customer personal identities, and other sensitive information which is classified as confidential and restrictive. When I say moving to the cloud, the first thing that comes to my mind was data security. We are going to store our corporate data in a public cloud data center like Azure or AWS. Since you are the data owner, you need to convince me about the cloud migration by explaining the public cloud security capabilities. Considering I have zero knowledge about cloud security, can you list out all possible security risks and how Cloud providers can handle them?
In the final part, we will see how Vasumat deals with Cloud Data Security.