Links to other parts
A discussion between a CxO and a senior Data Architect Part 1
A discussion between a CxO and a senior Data Architect Part 2
A discussion between a CxO and a senior Data Architect Part 4
A discussion between a CxO and a senior Data Architect Part 5
.
Background: We have been going through a discussion that took place between senior leadership and a data architect. They took 45 min of lunch break and here Part – 3 continues.
.
Discussion Follows:
.
Alison: Hope you enjoyed your food.
Vasumat: Yup, more than that, I liked that elegant garden, especially the spectacular mountain view from the corner.
Alison: Hmm, yes, that’s a favorite coffee spot for everyone on the floor.
Alison: Coming to the business, now I would like to discuss your area of expertise. I believe you have extensive experience in migrating data platforms. Isn’t it?
Vasumat: True!
.
Alison: Can you describe the steps involved in migrating database systems from on-premises to the cloud? Well, let me simplify my question, let’s suppose we have some heterogeneous database systems like Microsoft, Oracle, etc. I would like to know how you handle the database migrations. Since we already discussed phases, you can explain the series of steps involved. Appreciate it if you can explain it with whiteboarding. Take your time and explain as much detail as possible.
Vasumat: Sure! Assuming we already have an approved business case and talent is assigned.
.
Migration Planning:
• Assess and Discover: Collect every possible information and prepare a data-estate / data-sheet / metadata about Data (file shares, network shares, data lakes, etc.) and Database systems (SQL, NoSQL, etc.) from the source environment. It includes database instances (RDBMS, versions, edition, license etc.), instance type (database engine, reporting, analytics, operational etc.), configurations (trace flags, collation etc.), features (FileStream, PolyBase etc.), host details (virtualization, physical servers, OS type, license, etc.), computing power (memory, disk, CPU, throughput etc.), storage – data volume (data size, data growth rate etc.), partition info (keys, distribution rate), data type (Transactional, analytical, Archive, warehouse, reporting, Staging etc.), environment (DEV, Testing, QA, Staging, Pre-Prod, Prod), tools (development, deployment, monitoring), networking, security (NSG, firewalls, SSL, encryption, login – passwords, service accounts, permissions etc.), backup strategy, high-availability, disaster recovery, load balancing, data dependencies, Database IT operations (Change request, incident, problem request management etc.), Key Performance Baselines and Indicators (IO, CPU, Memory, Query timeouts, errors ) etc.
• Integration Matrix: A data entity can be a “data provider” “data consumer” or both. We must prepare a matrix that can project all dependencies of our data entity/database. This means identifying the list of applications, services, and programs that are using our databases. Also defining and tagging application with priority (based on the revenue impact), complexity (number of components involved), route (source, destination, IP, port, port type, inbound/outbound, access credentials, etc.) information. It plays a vital role in determining the migration scope.
• Define the Migration Scope: Based on the assessment details, we can classify the data entities between, moving to the cloud, retaining at on-premises, and retiring/decommissioning.
• Define the Cloud solution: Need to identify
◦ Cloud model: Public, private, Hybrid, or Multi-Cloud
◦ Cloud Service provider:
Azure,
AWS, Google Cloud, Oracle Cloud Infrastructure, etc.
◦ Cloud Service model: Platform-as-a-Service, Infrastructure-as-a-Service, Software-as-a-Service. From the DB side, we deal with PaaS and IaaS. We conclude this in the resource mapping step.
• Resource mapping: Map on-premises resources with the cloud resources. If we plan to re-platform or refactor, check compatibility, and perform a POC (Proof of Concept) in the cloud. Sizing the cloud resources must be handled by considering the expected performance and allocated budget. In general, we allocate a little higher compute resources than required, so it helps us for faster migrations. Once migration is done, we can downgrade the service tier to match the exact required size.
◦ For example, Map Database systems to Azure SQL Database, Database Pool, Managed Instance,
AWS RDS/Aurora,
Azure SQL Serverless, AWS Aurora Serverless, Azure VM, AWS EC2, etc.; Servers or Virtual Machines to
Azure VM, AWS EC2; Network drives and shared folders to Azure Files, AWS Elastic File System (EFS); Web Applications to Azure App Service,
AWS Elastic Beanstalk; Virtual server disks to Azure Managed Disks, AWS Elastic Block Storage (EBS); Storage solutions to Azure Blob Storage, Data Lake, AWS Simple Storage Services (S3); NoSQL to
Azure Cosmos DB API’s, AWS DynamoDB, Simple DB, Document DB; Load balancer to Azure Load Balancer, AWS Network Load Balancer; Programs to Elastic Jobs, Azure Functions; Workflows and jobs to Logic Apps; ETL solutions to Data Factory, Databricks, Synapse pipelines, AWS Glue, etc.; Datawarehouse to Azure Synapse Analytics, AWS Redshift, etc.; CI/CD pipelines to Azure DevOps release pipelines, AWS CodePipeline, etc. Data shares to Azure Data Share, AWS Lake Formation, etc.
• Define the cloud migration approach and tools: For each component marked for migration tag it with one of the cloud migration strategies (Rehost, Refactor, re-architect, etc.). From the database aspect mostly we do either rehost (IaaS) or refactor (PaaS). Based on the migration requirement (Online, Offline) we need to select the right migration tool. Ex: Database Migration Service, Data Migration Assistant, Backup & restore, replication, AlwaysOn, export/import, Visual Studio, bulk load, ETL tools () third party tools, etc.
• Define the rollback plan: We need to prepare a Rollback plan and typically for 4 scenarios.
◦ A) Migrated, application testing failed, rollback.
◦ B) Migrated, started using the system in the cloud (data changes, schema changes), realized something is not working or failing after a few days, rollback
◦ C) Rollback from PaaS deployment to On-premises
◦ D) Rollback from IaaS to On-premises.
◦ Summary: Scenario A is easy as we simply need to update the connection string back to on-premises. Scenario B & C requires a lot of effort as we need to compare the databases between source and target and identify the changes and apply those changes to the on-premises database. Scenario D will be a little easier compared to B & C options as we can easily perform backup-restore from cloud to on-premises or we can configure HA&DR solution between Cloud VM and on-premises Database Instance.
.
Pre-Migration Activity:
• Ready with the “To-be Architecture”: Considering the migration scope and resource mapping, we must have a clear path for starting the cloud journey.
• Identify and document the solutions: From compatibility validations and POC results, identify the possible challenges, errors, issues, and solutions (short-term and long-term). Also, document all test books including smoke, functional, performance, and security.
• Establish cloud KPIs: Post-migration we need to measure the performance of our applications. For that, we need to define the key performance indicator (KPI). So that we can compare these KPI values with the on-premises workloads and measure the cloud migration success from the performance aspect. Using Azure event hubs and Azure functions we can simulate the production load on Azure SQL Database and measure the resource utilization. We can also perform load testing using Visual Studio or any other third-party tools. Common KPIs include DB: Compute usage (CPU, IO, Memory, Wait-Statistics, throughput, etc.), average query/SP response time, blocking, deadlocks, error rates, etc. Application: Page loading time, Response time, Lag, Session length, error rates, etc.
• Establish performance baselines: Baselining is the process of measuring the current (pre-migration) performance of your application or service to determine if its future (post-migration) performance is acceptable. Set a baseline metric for each KPI that we’ve decided to measure. Choosing the baseline period (how long we collect data) is a key aspect. We usually consider the metrics captured during the business peak hours.
• Prioritize data entities (databases, file shares, etc.): As we already captured the revenue impact, the number of dependencies, and the data size, we can tag each database with a priority that determines the migration sequence. In general, we choose the easiest database/data entity first. Also, migrate non-production workloads before touching the production.
• Establish Timelines: We should carefully evaluate the available details (captured from POC, migration assessment, etc.) and define the timelines. Need to consider the development planned release cycle, business (peek/low) hours, maximum allowed downtime, downtime impact, estimated migration times, maintenance window at cloud, etc.
• Perform Data Cleansing: As we collected the data points from the assessment, we need to perform the cleanup accordingly. For example, history (where modifications are not allowed/required) data can be archived and migrated separately. Remove or clean all unnecessary objects including duplicate data, incomplete data, corrupted data, backup tables created during the change request executions and staging tables, etc. Also, if we are collecting any specific data from code objects (SP, agent jobs, etc.) into tables Ex: Error logs, audit logs, monitoring logs, etc. We may need to seek business approval and delete older (<6 months) logs. This step will reduce the pressure and improves the speed of the migration process and performance.
• Go for the business approval: Project the data/database migration plan to the business owners and get all required approvals. We may need to address and update the plan if any concerns from the business.
• Freeze the Database (Schema or both Schema and Data changes): Perform a health check on the source database and freeze the schema change activities which means we do not allow any change requests or new releases till the migration is completed. If it is offline migration, we freeze both schema and data changes, for online migration schema changes are restricted but the database will be available for operations and the data changes will be replicated to the target database automatically using DMS, replication, AlwaysOn, etc.
• Prepare scripts and backup for necessary objects: Backup for encryption certificates, scripts for logins, users, roles, permissions, system database objects (if any), agent jobs, drop and recreation scripts for constraints, triggers, foreign keys, etc. It varies based on the target instance type (Single Database, managed instance, etc.) and type of migration (backup & restore, export/import, DMS, etc.).
• Prepare a backup copy: Before initiating the migration process, ensure that we have the latest backup available at the remote location. It will be crucial to mitigating the surprises.
• Create the target environment: It’s time to create the cloud resources Ex: Azure SQL Database, VM, VPN, ExpressRoute, or VPN Gateway to connect on-premises network with cloud, DMS service, Blob storage, etc.
.
Execute Migration Project:
• Perform refactoring at on-premises (when applicable): For example, legacy products may not support the direct upgrade to the latest versions, thus perform the upgrade on-premises and then migrate. For example, SQL Server 2000 to 2019. If we are not using backup-restore or HA-DR features in migration, then no upgrade is required. We simply migrate schema and data to the target using import/export, BACPAC, etc.
• Migrate Schema & Data: Process might vary based on the migration approach (Online, offline, lift-and-shift, refactor, etc.), but these are the general steps involved in cloud data migration. If we do backup & restore it’s more like the simplest approach and we don’t need separate steps for data & schema migration.
• Migrate Encryption certificates (If applicable): If we are using TDE on-premises and using the backup and restore option to migrate the database to Azure SQL Managed Instance or SQL on Azure VM, we need to manually transfer/migrate (backup & upload) the encryption certificate to the cloud instance. If we are using the DMS service, this part will be taken care of by the DMS.
• Backup-Restore & synchronize data (if applicable): If it is online migration, DMS will take care of everything. But, if we need to use the native High Availability feature to migrate data, copy the database backup to the cloud storage (Azure storage, AWS S3) and restore it to the cloud instance (Azure SQL Managed Instance, SQL on Azure VM, Amazon RDS, SQL on AWS EC2), and configure the data sync option. (Ex: replication, AlwaysOn, etc.).
• Schema migration: Migrate all relations (tables), views, stored procedures, user-defined functions, PL/SQL packages, user-defined data types, constraints, triggers, etc.
• Switch off/ mute all constraints at target: As a best practice we mute all constraints (foreign key, check, triggers, etc.) and indexes at the target environment. So that we can skip all validations while porting data from on-premises to the cloud. Also, it helps for parallel data processing. to make it simple, at the source database itself we generate the script to drop and recreate all constraints and indexes and we simply use them at the target database.
• Data Migration: On a successful schema migration, initiate the data transfer activity for databases. Likewise, we can also initiate the transfer from the shared folder/network drives to cloud solutions (Azure Storage or AWS S3) Ex: Using Azure AzCopy, Azure storage explorer, AWS DataSync, AWS S3 browser premium service, etc.
• Switch-on/enable all constraints and indexes: Since we already got the script to create the constraints and indexes, just run it and validate the output log.
• Create logins, users, and permissions: Since we already have the script ready, SQL Server logins, users and passwords can be created quickly. For windows authentication, we may need to map them to Azure Active Directory accounts for which we need some manual efforts.
• Data Migration cutover: Review migration logs, make sure all data is transferred, and initiate the cutover.
• Validate Data Quality and Accuracy / Database testing: We consider A) Data Quality/functionality B) Connectivity/Integration C) Security and D) Performance E) HA&DR. We use tools for
◦ Data Quality: Compare schema and data between source and target environments. For example, we can compare the row counts, and metadata (sys.objects, sys.all_columns, sys.key_constraints, sys.check_constraints, sys.default_constraints, sys.check_constraints, sys.tables, sys.foreign_keys, sys.foreign_key_columns, sys.indexes, sys.index_columns, sys.triggers etc.) or we can use tools like SQL Server Data Tools (SSDT) schema and data compare option, or third party solutions like Redgate SQL Compare etc. Also, make sure application functional tests are returning the same data (type, count) in the functional test cases as in the on-premises.
◦ Connectivity: Validate database/file share accessibility and permissions from all integrated components (applications, services, interfaces, programs, etc.) and make sure it works as in the on-premises.
◦ Security: Need to identify any loose ends in the cloud architecture. For example, by default restrict access to all and enable access only to the required users, and applications update firewalls and network security groups accordingly. Use strong passwords and enable multi-factor authentication wherever possible. Validate access to sensitive data, compliance requirements (like classification), etc.
◦ Performance: Need to test scalability and performance by comparing the baselines and KPIs. We can monitor the database performance manually (using pre-defined scripts with DMV and system objects) or using advanced analytics tools available in Azure (Azure Monitor SQL insights). Compare the performance for the top 5 costly queries (as per the on-prem).
◦ HA&DR: Validate to ensure backups and HA&DR are configured and working as per the SLA.
• Final cutover & Go-live: Switch workload traffic to Cloud. Ex: Map DNS to cloud environment etc.
.
Post-Migration Activity:
• Documentation: One of the mandatory steps which can help us in improving the migration activity. We document all essential things Ex: timelines, common challenges faced, failures, applied resolutions, logs, etc.
• Learn, improve, and repeat: We learn things from the pre-production workload migrations, based on the predictive analysis and adjust the estimations for production workloads. We repeat the activity till the last database (within the scope) is migrated to the cloud.
• Monitor & optimize: Monitor the resource utilization and adjust the sizing and other configurations if required. Also, seek automation opportunities and integration with the cloud features.
• Report to the business: We process and analyze the complete migration activity (logs, documents, etc.) and prepare reports with the maximum possible KPI information. Ex: Estimated and actual TCO & ROI, migration time is taken for individual databases, total instance, application level, and business vertical level, etc. Success rate where we achieved the target in the terms of time and cost, failure cases, the average number of errors, data volume, etc. Reports must be projected to the business owners to get migration approval.
• Handover and Training: We do hand over the cloud environment to the support team chosen by the customer. We hand over documentation and explain SLAs (service level agreements), configurations, special functions, integrations, monitoring items, maintenance activities, etc.
• Final Sign-off & release: Since we already projected the migration activity, will seek official approval from the customer. On the final sign-off, we call it “Migration Project Completion”
.
This is the overall migration process that I experienced and successfully migrated 4 major enterprise database workloads to Azure and 2 workloads to AWS.
.
.
Alison: I understand your proficiency. But does “successfully migrated” mean the complete process went on without any issues?
Vasumat: Well, not exactly. It is almost impossible to execute a migration without dealing with errors or issues. What I meant to say was, we made the data migration activity completed on time by maintaining the data quality. While migrating we faced issues, applied fixes, learned from them, noted the resolutions, predicted, and prevented the issues for the next set of workloads.
.
Alison: Great. Can you describe the most common challenges we may face during the migration? Perhaps talk about the top 2 or 3 issues.
Vasumat: We can prevent 90-95% of problems with a proper migration plan and execution. However, the most common challenges are:
• Bad planning: It is the most common reason that causes migration failures. We need to allocate quality time, clearly define, well-research, and properly map source and target environments.
• Bypassing 6R strategy: We must remember that every App or part of the infrastructure is not designed for the cloud. Sometimes we need to refactor and rearchitect or still, it may not fit the cloud. A simple example is if Government says that all sensitive data must be in a specific region where there is no cloud data center available, we can’t move that application and dependent apps to Cloud. Likewise, some legacy apps require customized infrastructure which may not be possible with the Cloud.
• Starting with big deals: I’ve seen customers rush to start moving all applications into the cloud in one go without proper testing, POC, addressing the problems, and identifying the solutions. If it fails, the impact would be huge. The gradual approach may need some time and cost investment, but it works well.
• Failing in cost and timeline estimations: Unrealistic timelines and cost estimations build a highway to the cloud migration disaster. Improper sizing and resource allocation; If not enough resources, it obviously increases the migration timeline. In another case, if we overspend on resources (allocating more than required), we end up paying more costs. Also, most people miss capturing the hidden costs (data transfer, extra backup, cross-region replication, etc.) during the planning phase.
• Not having Plan-B: We must anticipate the challenges and get ready with the risk mitigation plan. To achieve that, need to check compatibility issues, provide all resolutions, perform POC if required, learn from non-production loads, etc. We should have answers and solutions ready for the most anticipated issues. Ex: Data is corrupted during the migration (need a proper backup), unexpected downtime (DR secondary at on-prem), the network fails (our migration solution must retry automatically from the point it left), if data migration is partially or completely failed (must have a rollback plan), etc.
• Lack of Cloud adoption & Skill Shortage: It’s a major threat to any cloud implementation. We are moving the legacy IT model to modernized cloud systems. IT operations become decentralized, and roles will change. Apart from our applications and databases, we must consider processes, operations, human resources, automation, technical support, training, education, etc.
• Lack of testing: We must remember that testing is not a one-time activity. It should be done at every stage of our migration plan, allowing us to spot potential issues at the earliest possible stage. The more we test and address problems early on, the smoother our migration will go.
• Common issues in data migration: Data migration depends on three components, A) Source system B) Target system and C) Communication channel – Network. If one component fails, migration comes to halt.
◦ Validation failures – Running parallel data import processes and not disabling the constraints may lead to validation failures.
◦ Network failures – if we misconfigure the communication channel it highly impacts the data transfer speed if there is no direct channel (DIRECT CONNECT, EXPRESS ROUTE) between source and target, no guarantee on the consistent bandwidth, etc.
◦ Unexpected outages – power failures, region-wide outages in the cloud, etc.
◦ Resource max capacity reached – for example if the data migration service is reached maximum capacity for ongoing tasks that concurrently run, new tasks will be added to the queue until the capacity become available.
◦ Playing with large datasets – If we select more databases than the allowed limit per single migration instance, the migration service throws an error. Also, we should migrate bigger databases separately.
◦ Data Loss – Migrating wrong/corrupted/inconsistent data to the target instance might happen because of many reasons. Ex: Missing encryption certificate, manually restarting the failed data transfer without proper precautions, datatype mismatch, collation issues, not choosing the right migration solution to migrate incremental transactions (data changed at source after the initial full data migration), applying wrong data filters, etc.
◦ Compatibility issues – we are supposed to resolve all compatibility issues and showstoppers before initiating the migration process, missing this step can impact our entire migration activity.
◦ Security & Compliance – Not establishing a proper security layer can cause data leaks and put our enterprise at risk. Missing data classification, and policies, and not following certain standards can lead to non-compliant and failure in an audit.
.
Alison: That’s a huge list. Have you ever experienced dealing with all those failure cases?
Vasumat: Not exactly. As I mentioned before, I prefer spending quality time in the analysis and planning phase so that we can prevent the majority of the issues. Fixing compatibility issues and the issues that occurred during POC and non-prod workload migrations will help us to anticipate and mitigate the risk in production workloads.
.
Alison: So, Vasumat, it sounds like data migration requires a lot of attention and effort. Do we have any shortcuts? or can we make it simple?
Vasumat: Let me put it this way “Data migration is an extremely high-risk and most critical activity in any migration”. I’ll add some points to prove my statement is correct. A famous global survey says that 83% of data migrations exceed budgets and timelines or fail altogether. And most of the failures are because of bad planning. My experience has taught me the same.
Alison: What was the survey that you were talking about?
Vasumat: Gartner survey. It’s a global research and advisory firm
Alison: Yes, I know, it’s a leader in global research.
.
.
Alison: Can you tell me the most complex migration that you have done till today?
.
In the next part, we will see how Vasumat explains about a business case where he faced a lot of challenges in migrating the workloads to cloud.
.
.