A discussion between a CxO and a senior Data Architect Part 4

A discussion between a CxO and a senior Data Architect Part 5

Links to other parts

A discussion between a CxO and a senior Data Architect Part 1

A discussion between a CxO and a senior Data Architect Part 2

A discussion between a CxO and a senior Data Architect Part 3

A discussion between a CxO and a senior Data Architect Part 5

.

Background: We have been going through a discussion that took place between senior leadership and a data architect. Part – 4 continues.

.

Discussion Follows:

.

Alison: Can you tell me the most complex migration that you have done till today?

Vasumat: Sure! We have migrated workloads for one of the European-based engineering and construction firms. They operate from 85 offices across the globe with a staff of 13000 (IT Staff 4300) and revenues over $6.5 billion with a customer count of 0.4 million. We have created the business case by considering two major challenges. A) Data management on expensive storage – They are generating enormous data and struggling to maintain, and manage data and in handling disaster recovery solutions at their data centers. B) Scalability – Systems are not scalable to handle heavy workloads. We proposed the cloud solution by addressing these two challenges. Projected Azure as the suitable target by considering the TCO, ROI, and compatibility (majority of their workloads are using Microsoft products .Net, Windows OS, SQL Server, Office suite including Email, SharePoint, etc.).

.

Alison: Great! Can you summarize the migration?

Vasumat: Sure!

 Migration challenges include legacy systems (SQL Server 2000, Windows 2003), migrating huge data sets, Oracle to Azure SQL Database (Customer requirement), different migration strategies for various applications (refactoring, re-platforming, lift-and-shift, phased), critical application migration (with near zero downtime), hybrid cloud for 5 applications (some components at on-premises and some at Azure), heterogenous database systems (SQL Server, Oracle, MySQL, MongoDB), etc. diversified feature selection (PaaS, SaaS, IaaS, Serverless), etc.
 Tools used: Azure Migrate service (to migrate VMs & SQL Server – Discovery & Assessment, Server Migration Services); Azure Data Box (Shipping over 40 TB of data in a physical device); DMA-DMS (Data Migration Assistant – Database Migration Service for migrating databases); Azure Data Factory (Migrating huge datasets); AzCopy (Migrating storage); Virtual Machine Converter (MVMC) / Virtual Machine Manager (VMM) (for converting VMs on VMware hosts or physical computer to VM running on Microsoft Hyper-V; Azure Site Recovery (ASR) Azure DRaaS – Disaster Recovery as a Service for VMs, in some scenarios we also used it for VM migration; Recovery Services vault for Azure (For storing VM backups); VPN Gateway/ExpressRoute (To establish proper communication channel between on-prem and Azure); Azure Synapse Pathway (while migrating from data warehouse to Synapse, it Converts DDL and DML statements to compliant with Azure Synapse Analytics); SQL Server Migration Assistant (SSMA – to migrate from other RDBMS like MySQL, Oracle etc. to SQL Server and Synapse Analytics); Azure AD Connect (To synchronize on-prem active directory to Azure Active Directory) etc.
 Migration Outcome:
 As part of the analysis, we Identified unused/unnecessary applications and retired 180 VMs on-premises. Successfully migrated 260 applications, 1400+ servers, and VMs (1100 Windows, 300+ Linux),
 650+ database instances (SQL Server, Oracle, MySQL, MongoDB),
 Project environments migrated – DEV, TEST, STAGE, PRE-PROD, PROD,
 38 file shares, 12000+ active users, 25+ domains, 30+ messaging services,
 Data warehouses, reporting, and ETL solutions (SQL, Informatica, SSIS, SSRS),
 1.2 petabytes (1200 TiB) of data was migrated.
 Finally, we closed/decommissioned 97% of on-premises legacy data center resources.
 Auto-scaling cloud infrastructure helped them to handle peak loads with zero issues. At on-premises they used to experience an average of 3 to 4 outages/downtime per month due to infrastructure issues. After cloud migration, they had zero outages for the first 8 months and 1 outage in the first 12 months.
 Also, based on the utilization frequency, we leverage the storage data tiers (Hot, Cool & Archive) which saved a lot of cost to the business. Faster and more efficient application development, management, and support.

.

Alison: That’s amazing, I think we are looking for similar background. Out of curiosity, how did you handle the legacy systems?

Vasumat: Legacy systems are always a big deal. I’ll tell you an example, our client has been using a simple application based on SQL Server 2000 running on Windows Server 2003. They are neither retiring that application nor ready to invest in refactoring. Hence, we are left with only the option of moving it as is to the Cloud VM. We spent some time preparing a detailed analysis report with the possible problems and issues and projected Infront of the business team. Ex: SQL Server 2000 is a 32-bit application hence it has compatibility issues with the OS, limitations utilizing the memory, also most importantly product support has been expired years ago. Hence it can’t fit the modern data world. To make it simpler, we suggested re-platforming SQL Server 2000 to Azure SQL Database. So, we do not need to invest more money, it’s just a few extra efforts to migrate schema and data to Azure SQL Database. After checking the possibilities, cost ratio, performance gain, and risk involved, we got the go-ahead for re-platforming and we successfully did it too. Likewise, for migrating Windows Server 2003 VMs to Azure:

 Converted Windows 2003 Server to VMware VM using VMware tool
 Converted VMWare VM to Microsoft Hyper-V one
 Migrated Hyper-V VM to Azure VM.

.

Alison: Do you remember the TCO differences and improvements after the cloud migration? Approximate figures would suffice.

Vasumat: Yes, I do! After successful cloud migration, we monitored for some time (a few weeks/months depending on criticality) and captured the crucial parameters, predicted the future values, and compared it with the on-premises numbers. A detailed KPI report was given to the business. The major advantage of the cloud is less or no Capital Expenditure (money we spend to purchase and maintain the IT infrastructure). KPI values comparing with on-premises:

 Unplanned downtime was reduced by: 83.5%
 New equipment deployment time was reduced by: 98%
 Software Application deployment time was improved by: 35%
 Number of applications deployed per year increased by: 109 %
 5 Years TCO was decreased by: 51% (Ex: Consider On-prem cost: $1 M | Cloud cost: $0.49 M)
 5 Years ROI was increased by: 220 %

.

Alison: Let’s suppose, I need a machine with 32 core, 64GB RAM, and 16 TB HDD configurations. From the cost point which one is the best option, purchasing a physical server or an Azure VM / AWS EC2?

Vasumat: For the physical server, we may need to spend $9000 approx. as a one-time investment. To get an Azure VM with a similar configuration we may need to pay $900 per month. By the second month, we invest $1800 and by the 10th month, we pay $9000 in total for an Azure VM.

Alison: Which means owning a physical server is the winner from the cost point. Isn’t it?

Vasumat: Certainly! But before concluding, we should consider the facts:

 Owing a private data center: Along with the computer price, we must consider a lot of other factors including the installations, real estate, internet facility, power and energy management, cooling equipment, networking and other hardware equipment, labor & IT staff, repair & servicing, disaster recovery management, sending backups to offsite (out of datacenter), software upgrades, asset depreciation (asset/equipment value and performance will be decreased with time and becomes zero or negligible at the end of its life span), etc.
 Cloud: With reserved instances (when we know VM runs for the next 1 or 3 years) and we get a 30 to 50% discount, in the Pay-as-you-go model we can switch off VMs when not required (TEST, DEV machines over the weekend), scalability – highly scalable to handle the peek workloads; Automatic backup service, Disaster Recovery as a Service, modernization features, serverless compute options, scalability and automation features make it as a stable environment which improves the productivity hence more business and more revenue.

    .

Alison: Got you. What was the time taken to complete this migration project?

Vasumat: It took 24 months out of which 8 months for analysis and planning and 16 months for execution and optimization.

.

Alison: You might have worked with many clients till today. What is the highest business impact that you have seen due to unplanned downtime?

Vasumat: One of the manufacturing clients projected their business impact as $2.3 million for unplanned downtime for 4 hours (It was down due to hardware failure).

.

Alison: Vasumat, you must have handled VLDB (Very Large Database) migration, right?

Vasumat: Yes! Till now the largest single database I migrated is 8.5 TB

Alison: That’s a real VLDB. Can you describe a few tips for handling VLDB migration?

Vasumat: For VLDB, there are three major things that we need to worry about A) Backup B) Consistency check C) Index maintenance. These operations may run from hours to days depending on the database size and available resources. For us, VM with configuration 32 Core, 448 GB RAM, 20000 MBPS network bandwidth, it took 12.3 hrs. to complete the full backup.

Best practices in dealing with VLDB are:

 Perform data cleansing and remove/archive all unnecessary, unused, and historical data.
 Index maintenance – Do not run index rebuild or reorg for all indexes in one go. Instead, based on the fragmentation level statistics divide it into three categories, daily, weekly, and monthly. Schedule index maintenance accordingly.
 Consistency checks – run only filegroup or table level checks with the Physical_Only option. D) Backup – Leverage compression feature, utilize differential backups, see if file snapshot backup works, and always take the split backup
 Datafiles, File Groups & Partitions – Considering the disk-wise IOPS and throughput limitations, based on the data distribution, partition the table, map partitions to separate filegroups, add separate datafile for each file group and place those data files in separately managed disks (Azure VM Disks). To make it simple Table > Partitions > Separate File Group > Separate NDF > Separate VM Disk

.

Alison: So, you migrated that 8.5 TB database to Azure VM. Is it?

Vasumat: Yes! We migrated it to an Azure VM, but issues were remaining the same. It’s been taking a half day for a full backup, and differential backup (4 days after full backup) was also taking 8 hours. Also, the maximum Blob size is limited to 195 GB, hence we had to take the split backup to ensure we do not hit the maximum size of 195 GB. When I approached an expert engineer from the Azure team, he suggested me the file snapshot approach, we performed a POC with success and then implemented the same for our production database.

Alison: Can you explain how did you do that?

Vasumat: We have a feature using which we can place SQL Server Database files as Page Blobs in Azure Blob Storage. With this, we get the benefit of Azure Storage which is not available in the traditional storage system. For example, consider Instant File Initialization (IFI), in traditional storage if we do not enable IFI when new space is requested, the system performs zeroing process (put zeros on the disk to avoid security risk). But it is only applicable for the Data file and not for the Log file. When it comes to Azure Blob Storage, without enabling IFI, it automatically takes care of the zeroing process and most importantly IFI is applicable for Log files as well. When we call Put Page API, it simply clears the range and guarantees that reading from a cleared range will always return zeroes. We tried allocating the logfile from 50 MiB to 1 TiB and it took 1 second.

Alison: But how it resolved your backup issue?

Vasumat: I am coming to that; We have a feature called File Snapshot Backup. Unlike traditional streaming backups, a File snapshot Backup creates a storage snapshot for each SQL Server Database file. Can you believe, for the 8.5 TiB production database while it was operating where queries were running with 72% CPU utilization, full backup was completed in 2 minutes 30 seconds?

Alison: That’s amazing. Even with 10 GBPS speed, it can hardly take 1 TB backup in 2 min. How come it’s been taking 8.5 TB?

Vasumat: Because no data movement has occurred, instead it simply stores pointers to these snapshots and other backup metadata. Hence it is faster and maintains backup consistency. And coming to running data consistency checks (DBCC), we can quickly restore (taking 5 seconds) the latest file snapshot backup into other SQL instances and run DBCC commands. Two benefits, A) We are fully checking the consistency (No physical only or partial checks) B) Checking on the latest copy without distributing the production workload.

Alison: Okay! That sounds fascinating. So, do you prefer the same approach for all SQL workloads?

Vasumat: Certainly not! Every feature is designed for a specific cause hence it has its limitations, and we use it only when our database is fit into this feature. For example, it doesn’t support FileStream data, geo-replication for a storage account, In-Memory OLTP, and, Always On failover cluster instances. Also, standard storage (500 TB limit) can’t be used due to performance issues. But the premium storage account is limited to 35 TB.

.

Alison: I understand. I specifically asked you about VLDB as we have SAP systems running on-premises. Do you see any SAP offerings from Azure?

Vasumat: I didn’t have experience in migrating SAP workloads to the cloud. Nevertheless, I participated in a case study discussion in a customer place, where they explained how SAP systems with VLDB were migrated to Azure. Ideally, I can say that I know the theory.

Alison: Then I would like to know about your theory.

Vasumat: Sure! SAP systems mainly require (at least from the data point) A) Larger storage B) Elastic Compute Capacity. By addressing these parameters, Azure is offering two optimized solutions A) SAP on Azure Virtual Machine; B) SAP HANA on Azure Large Instances. Both options are certified to host SAP workloads and it is the Best of Both Worlds (SAP and Azure Cloud). Because, along with the SAP, we can utilize the latest features of Azure Ex: Automation, High Availability with guaranteed SLA, security, scalability, cost & operational efficiency, etc. Azure supports SAP deployment including SAP NetWeaver, SAP S/4HANA, BW/4HANA, BI, and HANA in scale-up and scale-out scenarios.

 SAP on Azure Virtual Machine: are on-demand VMs and suitable for small to medium workloads. Offers up to 4 TB RAM and 128 vCPUs.
 SAP HANA on Azure Large Instances: Specially designed for SAP very large workloads. Azure runs our workloads on non-shared bare-metal hardware that is dedicatedly assigned to us. This means our VM is built with its own compute/server, networking, and storage infrastructure. Offers up to 24 TB RAM and 480 Intel CPU cores.

.

Alison: That’s interesting. In the case study that you are talking about, from the customer business point, what was the top benefit that they gained by migrating their SAP to Azure? Only if you remember.

Vasumat: I may forget the process and approach but always try to remember the KPI because that tells us the success rate of our migration strategy. There were two major factors A) Projected 5-year cost savings as between 60 to 65% B) Time for provisioning and building new components for Very Large Workloads drastically improved from 1 month (on-prem) to 1 week (Azure).

.

Alison: Do you know how they migrated large SAP databases to the cloud?

Vasumat: Again, it’s a theory as I didn’t experiment on it.

Alison: Fine, your theory is from a real use case. So, for me, it’s more than a theory. Please go ahead.

Vasumat: If it is a homogenous system (Oracle to Oracle, SQL Server to SQL Server) then we can migrate them with relatively low or near zero downtime using the native HA & DR features. But when it comes to heterogenous systems, we need downtime and duration depends on our planning strategy.

For larger databases, they used AzCopy, R3Load (import/export data in a platform-independent format), and Migmon (Migration Monitor) tools. At on-premises provisioned 4 physical servers for running R3Load with accelerated networking. Server -1 for loading the top 5 tables, servers 2 & 3 for loading tables with table splits, and server 4 for migrating remaining all tables. Since R3Load exports in parallel, controlling the sequence of Import happens via the Signal File (SGN) that is automatically generated when all Export packages are completed. Here I am speaking in a very generic and simple language, but implementation is highly challenging and needs to consider a lot of facts:

 Optimizing the source system Ex: Data cleansing, network improvements, etc.
 Enabling Jumbo Frames: between the source server and R3Load server improves the export performance. Jumbo Frames are Network layer PDUs (Protocol Data Unit) slightly larger than the typical 1500 Byte Ethernet MTU (Maximum Transmission Unit) size Ex. 1998 Byte.
 Calculating the Net impact (Sorting, number of parallel exports/imports, table split support, network bandwidth, etc.)
 Cloning the source system: Create multiple source database copies and export different entities (tables) from each copy in parallel.
 Dealing with the transactional log at the target server: Place it in the local SSD (so it excludes IOPS and bandwidth from VM quota) during the migration and move it to a persisted disk post the migration.
 Dealing with heterogeneous systems Ex: Cross OS or DB migrations. Windows to Linux and SQL to Oracle etc.
 Dealing with the failed parallel processes: Truncate and restart gives a better performance rather than row-by-row delete.
 Special recommendations: For example, never use UNIX, or Virtualized Servers for R3Load Exports as it negatively impacts the performance. Without calculating the net impact do not conclude the number of parallel exports. Before concluding, verify the network speed for AzCopy via public internet and ExpressRoute. Etc.
 With proper planning, prior testing, and using optimization techniques they were able to achieve 2.5 TiB per hour speed and successfully migrated SAP workloads to Azure.

.

Alison: That’s very informative, thanks for taking me into the SAP world. I liked that cloning thought.

.

Alison: I think it’s already late. Do you have 30 more min? We’ll quickly wrap it up.

Vasumat: Absolutely no issue.

.

Alison: Wonderful! We are currently holding stakeholder financial portfolios, customer personal identities, and other sensitive information which is classified as confidential and restrictive. When I say moving to the cloud, the first thing that comes to my mind was data security. We are going to store our corporate data in a public cloud data center like Azure or AWS. Since you are the data owner, you need to convince me about the cloud migration by explaining the public cloud security capabilities. Considering I have zero knowledge about cloud security, can you list out all possible security risks and how Cloud providers can handle them?

.

In the final part, we will see how Vasumat deals with Cloud Data Security.

.

.

.

.

Posted in Interview Q&A | Tagged , , , , , , , , , , , , , , | Leave a comment
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments