ERP Data Cleanup: Essential Steps Before Any Migration

Q: How long does data cleanup take?

For a mid-size business (5,000–50,000 customer records, 1,000–10,000 products), plan for 6–10 weeks of dedicated effort. This assumes one full-time data analyst plus part-time involvement from data owners in each department. Larger enterprises with hundreds of thousands of records or complex multi-system landscapes may need 12–16 weeks.

Q: Should we clean data in the old system or in staging files?

Clean in staging files (exported CSVs or a staging database), not in the live system. This preserves your production data as a fallback, allows parallel cleanup by multiple people, and avoids disrupting daily operations. Your live system continues running untouched until the clean data is imported into the new ERP.

Q: What if we cannot reach the minimum quality threshold?

If a specific entity cannot reach the minimum score, investigate the root cause. If it is a data volume problem (too many records to clean manually), consider importing only the most recent or most active subset and archiving the rest. If it is a structural problem (the data was never designed to support what the new ERP needs), you may need to enrich data from external sources or accept that some records will require manual attention post-migration.

Q: Who should be responsible for data cleanup?

Data cleanup is a business responsibility, not an IT responsibility. IT provides the tools and infrastructure, but business users must make the decisions: which duplicate record to keep, whether an orphan order should be reconnected or archived, and what the correct product name format should be. Assign data owners from each department and hold them accountable for their entity quality scores.

Q: Can we automate data cleanup?

Partially. Automated tools handle format standardization (phone numbers, addresses, dates), exact-match deduplication, and validation rule checking. But merging fuzzy-match duplicates, resolving orphan records, and verifying data accuracy require human judgment. Plan for 60% automated / 40% manual effort.

Q: What if we discover data quality issues after migration?

Post-migration cleanup is 3–5 times more expensive than pre-migration cleanup because you are now dealing with a live system where changes affect active workflows. If you discover issues after go-live, prioritize by business impact: fix records that affect financial accuracy first, then customer-facing records, then internal operational records.

Q: Does ECOSIRE help with data cleanup?

Yes. Data cleanup is a core component of ECOSIRE's migration services. We provide data profiling, automated deduplication, quality scoring, and cleanup scripting as part of every migration project. Our team works alongside your data owners to ensure business context drives every cleanup decision. Contact us to discuss your data quality challenges.

Data cleanup is the unglamorous foundation that determines whether your ERP migration succeeds or becomes an expensive exercise in moving garbage from one system to another. Every migration consultant will tell you that 30–40% of total project effort should go to data cleanup, yet most organizations rush through it because cleaning data feels like a detour from the main goal. The result is predictable: duplicate customer records causing confused sales teams, orphan transactions that break financial reports, and inconsistent product data that derails inventory management. This guide provides a systematic framework for cleaning your data before any ERP migration, regardless of your source or target system.

Key Takeaways

Data cleanup should consume 30–40% of total migration timeline — plan for it explicitly in your project schedule

Start with master data (customers, products, vendors) before transactional data — master data errors cascade

Duplicate detection algorithms combining exact match, fuzzy match, and business rule match catch 95% of duplicates

Orphan records (transactions referencing deleted master data) are the most common cause of import failures

Data quality scoring gives objective metrics to track cleanup progress and define "done" criteria

Archive rather than delete — you may need historical data for tax, compliance, or trend analysis

Assign data owners per entity type — cleanup without ownership devolves into finger-pointing

Why Clean Data Matters More Than You Think

The cost of dirty data in a new ERP is not theoretical. Here are the concrete consequences:

Financial errors. Duplicate customer records mean duplicate invoices, split payment applications, and incorrect aging reports. A customer appears to owe $50,000 when they actually owe $25,000 across two records. Your collections team wastes time chasing phantom balances.

Inventory inaccuracy. Duplicate product records with slightly different names mean stock is split across records. Your system shows 10 units of "Widget Blue, Large" and 15 units of "Blue Widget - LG" when you actually have 25 units of the same product. Reorder points trigger incorrectly.

Broken automation. ERP automation rules reference specific records. A workflow that sends a payment reminder to customers with overdue invoices will send two reminders to customers with duplicate records. Automated reorder rules will trigger for each duplicate product.

Report distortion. Sales reports show inflated customer counts. Product reports show fragmented inventory. Financial reports double-count revenues or expenses associated with duplicate records.

User frustration. The quickest way to kill ERP adoption is for users to see dirty data in the new system. If a salesperson searches for a customer and finds three near-identical records, their confidence in the system — and the migration project — evaporates immediately.

Step 1: Duplicate Detection

Three Levels of Duplicate Detection

Level 1: Exact match. Records that are identical across key fields. Easy to detect, but catches only the most obvious duplicates.

Same email address
Same phone number (after normalizing format)
Same tax ID / company registration number
Same SKU / product code

Level 2: Fuzzy match. Records that are similar but not identical. Requires algorithms like Levenshtein distance, Soundex, or Jaro-Winkler similarity.

"ECOSIRE Pvt Ltd" vs. "ECOSIRE Private Limited" vs. "Ecosire Pvt. Ltd."
"123 Main Street" vs. "123 Main St." vs. "123 Main St, Suite 100"
"Blue Widget (Large)" vs. "Widget - Blue, L" vs. "BLU-WDGT-LG"

Level 3: Business rule match. Records that look different but represent the same entity based on business context.

Same company name + same city (likely the same customer even with different addresses)
Same product dimensions + same material (likely the same product with different naming)
Same vendor + same bank account (likely a duplicate vendor record)

Duplicate Detection Process

Step	Action	Tool/Method
1	Export all records from the entity	CSV or API export
2	Normalize text fields (lowercase, remove punctuation, trim whitespace)	Script or ETL tool
3	Run exact match on unique identifiers (email, tax ID, SKU)	SQL GROUP BY + HAVING COUNT > 1
4	Run fuzzy match on name + address combinations	Python (fuzzywuzzy library) or dedicated dedup tool
5	Apply business rules for context-based matching	Custom rules per entity type
6	Generate duplicate groups with confidence scores	Review queue for human decision
7	Merge or archive duplicates (never delete outright)	Merge tool or manual merge

Merge Rules by Entity Type

Customer merge rules:

Keep the record with the most recent transaction activity
Consolidate all addresses (mark primary, keep others as shipping/billing alternatives)
Merge all contact persons under the surviving record
Reassign all orders, invoices, and payments to the surviving record
Preserve the oldest creation date (for customer tenure calculations)

Product merge rules:

Keep the record with the active SKU that matches your catalog
Consolidate stock quantities across duplicate records
Reassign all order lines and invoice lines to the surviving record
Archive the duplicate SKU with a note pointing to the surviving record

Vendor merge rules:

Keep the record with current bank details and payment terms
Merge all purchase orders and bills under the surviving record
Consolidate vendor contacts
Verify tax information is current on the surviving record

Step 2: Orphan Record Identification

Orphan records are transactions that reference master data that no longer exists or was incorrectly linked. They are the second most common cause of import failures after duplicates.

Common Orphan Patterns

Orphan Type	Example	Impact
Order without customer	Sales order references a customer ID that was deleted	Import fails or creates anonymous order
Invoice line without product	Invoice line references a product SKU that does not exist	Import fails or creates blank line item
Payment without invoice	Payment record references an invoice number that was deleted	Payment cannot be applied, distorts AR/AP
Employee without department	Employee references a department code that was removed	Employee record incomplete in new system
BOM without product	Bill of materials references a product that was discontinued	Manufacturing data incomplete
Timesheet without project	Timesheet entry references a project that was closed and deleted	Time data lost or unattributable

Orphan Detection Query Pattern

For each transactional entity, run a cross-reference check against its parent master data:

For every sales order line:
  → Does the customer_id exist in the customers table?
  → Does the product_id exist in the products table?
  → Does the salesperson_id exist in the employees table?

For every invoice:
  → Does the customer_id exist in the customers table?
  → Does each line's product_id exist in the products table?
  → Does the payment_term reference exist in the payment terms table?

For every purchase order:
  → Does the vendor_id exist in the vendors table?
  → Does each line's product_id exist in the products table?

Orphan Resolution Strategies

Strategy 1: Reconnect. If the master record was deleted but should exist, recreate it and link the orphan transactions. This is common for products that were discontinued but have historical orders.

Strategy 2: Reclassify. Assign orphan transactions to a catch-all master record. Create a "Legacy Customer" contact or an "Archived Product" record and reassign orphans there. This preserves financial totals while acknowledging the data quality issue.

Strategy 3: Archive. Move orphan transactions to an archive table outside the migration scope. Include them in a separate historical data export for reference but do not import them into the new ERP.

Step 3: Data Validation Rules

Field-Level Validation

Apply these validation rules to every record before export:

Text fields:

No leading or trailing whitespace
No double spaces within text
Consistent capitalization (Title Case for names, UPPERCASE for codes)
No special characters in fields that should be alphanumeric (SKUs, codes)
Character encoding is consistent (UTF-8 throughout)

Email fields:

Contains exactly one @ symbol
Domain has at least one dot after @
No spaces in the email address
Lowercase (email addresses are case-insensitive)
Not a placeholder ([email protected], [email protected])

Phone fields:

Consistent format (choose one: +1-555-123-4567 or +15551234567)
Country code included for international numbers
No letters or special characters other than +, -, (, )
Valid length for the country

Date fields:

Consistent format (ISO 8601: YYYY-MM-DD)
No future dates where logically impossible (e.g., invoice date in 2030)
No unreasonably old dates (e.g., order date of 1900-01-01, the default for many systems)
Date ranges are logical (start date before end date)

Numeric fields:

No text in numeric fields (commas as thousands separators cause import failures)
Consistent decimal precision (2 places for currency, 4 places for unit prices with small values)
No negative values where logically impossible (quantities, prices)
Currency values in expected range (no $999,999,999 invoices unless you are Boeing)

Required fields:

Customer name is never blank
Product name and SKU are never blank
Invoice number is never blank and never duplicated
All foreign key references point to existing records

Cross-Record Validation

Beyond individual field checks, validate consistency across related records:

Sum of invoice line amounts equals invoice total
Sum of payments applied to an invoice does not exceed invoice total
Inventory on-hand does not show negative quantities (unless system allows it)
Employee start date is before any associated timesheet entries
Product creation date is before any associated sales order lines

Step 4: Archiving Strategy

Not all data needs to migrate. Define an archiving policy that balances compliance requirements, business needs, and migration complexity.

Archiving Decision Framework

Data Type	Migrate to New ERP	Archive Outside ERP	Delete
Active customers (transaction in last 24 months)	Yes	—	—
Inactive customers (no transaction in 24+ months)	No (unless compliance requires)	Yes — CSV + secure storage	—
Open orders and invoices	Yes	—	—
Closed orders (last 24 months)	Yes	—	—
Closed orders (24+ months)	No	Yes	—
Current inventory levels	Yes	—	—
Historical inventory movements (24+ months)	No	Yes	—
Active products	Yes	—	—
Discontinued products (with order history)	Yes (as archived/inactive)	—	—
Discontinued products (no order history)	No	No	Yes
Employee records (active)	Yes	—	—
Employee records (terminated 7+ years ago)	No	Yes (legal retention)	—
Test/sample/dummy data	No	No	Yes
System audit logs	No	Yes (compliance)	—

Archive Format Recommendations

For data you archive outside the ERP:

Export to CSV with clear column headers and UTF-8 encoding
Include a data dictionary that defines each column, its data type, and valid values
Store in a versioned, immutable location (S3 with versioning, or encrypted backup)
Set a retention schedule (7 years for financial data in most jurisdictions, longer for some industries)
Document the archive in your compliance records, including contents, date range, and retention policy

Step 5: Master Data Governance

Data cleanup is not a one-time event. Without governance, your shiny new ERP will accumulate the same data quality issues within 12–18 months.

Data Ownership Matrix

Data Entity	Data Owner (Role)	Responsibilities
Customers	Sales Manager	Approve new customer creation, quarterly duplicate review, merge requests
Products	Product Manager	SKU standards, new product approval, discontinuation process
Vendors	Procurement Manager	Vendor onboarding standards, annual vendor review, duplicate prevention
Chart of Accounts	Finance Controller	Account creation approval, period-end review, structure changes
Employees	HR Manager	Employee data accuracy, lifecycle management (hire to termination)
Pricing	Commercial Director	Price list maintenance, discount authority matrix

Data Entry Standards

Document and enforce standards for each entity:

Customer creation standards:

Company name: Official legal name (verify against registration documents)
Trading name: Stored separately if different from legal name
Address: Use postal service format for the country
Primary contact: Name + email + phone required
Payment terms: Default set at creation, require approval to change
Credit limit: Set by finance, not sales

Product creation standards:

Product name: [Brand] [Product] [Variant] [Size] (e.g., "ECOSIRE Widget Blue Large")
SKU: [Category]-[Sequence]-[Variant] (e.g., "WDG-001-BL")
Description: Minimum 50 characters, no HTML formatting in descriptions
Category: Must select from existing categories (no free-text categories)
Unit of measure: Must use standard UoM from the approved list
Images: Minimum one image, maximum dimensions 2048x2048, white background

Automated Data Quality Rules

Configure these rules in your new ERP to prevent dirty data from the start:

Duplicate prevention: Warn on save if a record with the same email, phone, or tax ID already exists
Required field enforcement: Block creation if mandatory fields are empty
Format validation: Reject invalid email formats, phone formats, and date formats
Approval workflows: New customer and vendor creation requires manager approval
Periodic review: Automated reports highlighting records not updated in 12+ months

Step 6: Data Quality Scoring

Scoring Methodology

Score each data entity on four dimensions, each rated 1–5:

Dimension	Score 1	Score 3	Score 5
Completeness	>30% of required fields blank	10–30% blank	<5% blank
Consistency	No standards, wildly varying formats	Some standards, partial compliance	Clear standards, >95% compliance
Accuracy	>20% of sample records have errors	5–20% errors	<2% errors (verified sample)
Uniqueness	>10% duplicate rate	3–10% duplicates	<1% duplicates

Scoring Process

Sample: Random 5% of records (minimum 100, maximum 500)
Check completeness: Count blank required fields as a percentage
Check consistency: Review format compliance for text, date, phone, and email fields
Check accuracy: Verify sampled records against external sources (website, registration databases, physical inventory count)
Check uniqueness: Run duplicate detection on full dataset, calculate rate

Minimum Quality Thresholds for Migration

Entity	Minimum Average Score	Recommended
Customers	3.5	4.0+
Products	3.5	4.0+
Vendors	3.0	3.5+
Chart of Accounts	4.0	4.5+
Open Orders	3.5	4.0+
Open Invoices	4.0	4.5+
Employees	3.5	4.0+

Do not proceed with migration for any entity scoring below the minimum threshold. The cost of cleaning data after import is 3–5 times higher than cleaning before import.

Data Cleanup Timeline Template

Week	Activity	Deliverable
1	Initial quality assessment and scoring	Quality score report per entity
2	Duplicate detection run + merge planning	Duplicate groups with proposed merge actions
3	Orphan record identification	Orphan report with resolution recommendations
4	Data owner assignment and standards documentation	Data governance document
5–6	Bulk cleanup: duplicates, orphans, format standardization	Cleaned master data exports
7	Validation rule execution and exception handling	Validation exceptions report
8	Re-scoring and certification	Final quality scores (all above thresholds)
9	Archive old data, document retention policies	Archive files + retention schedule
10	Final export for migration import	Clean, validated, migration-ready data files

Tools and Resources

Open-Source Data Cleanup Tools

OpenRefine: Powerful data cleaning tool for clustering, faceting, and transforming messy data
dedupe.io: Machine learning-based deduplication library for Python
Great Expectations: Data validation framework for automated quality checks
pandas (Python): Flexible data manipulation for custom cleanup scripts
csvkit: Command-line tools for CSV inspection and validation

Commercial Data Quality Platforms

Informatica Data Quality: Enterprise-grade cleansing and matching
Talend Data Quality: Profiling, cleansing, and standardization
Melissa Data: Address verification, email validation, duplicate detection
IBM InfoSphere QualityStage: Master data matching and standardization

Frequently Asked Questions

How long does data cleanup take?

For a mid-size business (5,000–50,000 customer records, 1,000–10,000 products), plan for 6–10 weeks of dedicated effort. This assumes one full-time data analyst plus part-time involvement from data owners in each department. Larger enterprises with hundreds of thousands of records or complex multi-system landscapes may need 12–16 weeks.

Should we clean data in the old system or in staging files?

Clean in staging files (exported CSVs or a staging database), not in the live system. This preserves your production data as a fallback, allows parallel cleanup by multiple people, and avoids disrupting daily operations. Your live system continues running untouched until the clean data is imported into the new ERP.

What if we cannot reach the minimum quality threshold?

If a specific entity cannot reach the minimum score, investigate the root cause. If it is a data volume problem (too many records to clean manually), consider importing only the most recent or most active subset and archiving the rest. If it is a structural problem (the data was never designed to support what the new ERP needs), you may need to enrich data from external sources or accept that some records will require manual attention post-migration.

Who should be responsible for data cleanup?

Data cleanup is a business responsibility, not an IT responsibility. IT provides the tools and infrastructure, but business users must make the decisions: which duplicate record to keep, whether an orphan order should be reconnected or archived, and what the correct product name format should be. Assign data owners from each department and hold them accountable for their entity quality scores.

Can we automate data cleanup?

Partially. Automated tools handle format standardization (phone numbers, addresses, dates), exact-match deduplication, and validation rule checking. But merging fuzzy-match duplicates, resolving orphan records, and verifying data accuracy require human judgment. Plan for 60% automated / 40% manual effort.

What if we discover data quality issues after migration?

Post-migration cleanup is 3–5 times more expensive than pre-migration cleanup because you are now dealing with a live system where changes affect active workflows. If you discover issues after go-live, prioritize by business impact: fix records that affect financial accuracy first, then customer-facing records, then internal operational records.

Does ECOSIRE help with data cleanup?

Yes. Data cleanup is a core component of ECOSIRE's migration services. We provide data profiling, automated deduplication, quality scoring, and cleanup scripting as part of every migration project. Our team works alongside your data owners to ensure business context drives every cleanup decision. Contact us to discuss your data quality challenges.

Start with a Data Quality Assessment

The first step in any migration is understanding the current state of your data. A data quality assessment takes 3–5 days and produces a detailed report showing duplicate rates, completeness scores, format inconsistencies, and orphan record counts for every major entity.

ECOSIRE offers complimentary data quality assessments as part of our migration planning services. We will analyze your current data, identify the highest-impact cleanup tasks, and provide a realistic timeline and effort estimate for achieving migration-ready quality.

Request your free data quality assessment and take the first step toward a clean, successful migration.

Key Takeaways

Data cleanup should consume 30–40% of total migration timeline — plan for it explicitly in your project schedule

Start with master data (customers, products, vendors) before transactional data — master data errors cascade

Duplicate detection algorithms combining exact match, fuzzy match, and business rule match catch 95% of duplicates

Orphan records (transactions referencing deleted master data) are the most common cause of import failures

Data quality scoring gives objective metrics to track cleanup progress and define "done" criteria

Archive rather than delete — you may need historical data for tax, compliance, or trend analysis

Assign data owners per entity type — cleanup without ownership devolves into finger-pointing

Why Clean Data Matters More Than You Think

The cost of dirty data in a new ERP is not theoretical. Here are the concrete consequences:

Report distortion. Sales reports show inflated customer counts. Product reports show fragmented inventory. Financial reports double-count revenues or expenses associated with duplicate records.

Step 1: Duplicate Detection

Three Levels of Duplicate Detection

Level 1: Exact match. Records that are identical across key fields. Easy to detect, but catches only the most obvious duplicates.

Same email address
Same phone number (after normalizing format)
Same tax ID / company registration number
Same SKU / product code

Level 2: Fuzzy match. Records that are similar but not identical. Requires algorithms like Levenshtein distance, Soundex, or Jaro-Winkler similarity.

"ECOSIRE Pvt Ltd" vs. "ECOSIRE Private Limited" vs. "Ecosire Pvt. Ltd."
"123 Main Street" vs. "123 Main St." vs. "123 Main St, Suite 100"
"Blue Widget (Large)" vs. "Widget - Blue, L" vs. "BLU-WDGT-LG"

Level 3: Business rule match. Records that look different but represent the same entity based on business context.

Same company name + same city (likely the same customer even with different addresses)
Same product dimensions + same material (likely the same product with different naming)
Same vendor + same bank account (likely a duplicate vendor record)

Duplicate Detection Process

Step	Action	Tool/Method
1	Export all records from the entity	CSV or API export
2	Normalize text fields (lowercase, remove punctuation, trim whitespace)	Script or ETL tool
3	Run exact match on unique identifiers (email, tax ID, SKU)	SQL GROUP BY + HAVING COUNT > 1
4	Run fuzzy match on name + address combinations	Python (fuzzywuzzy library) or dedicated dedup tool
5	Apply business rules for context-based matching	Custom rules per entity type
6	Generate duplicate groups with confidence scores	Review queue for human decision
7	Merge or archive duplicates (never delete outright)	Merge tool or manual merge

Merge Rules by Entity Type

Customer merge rules:

Keep the record with the most recent transaction activity
Consolidate all addresses (mark primary, keep others as shipping/billing alternatives)
Merge all contact persons under the surviving record
Reassign all orders, invoices, and payments to the surviving record
Preserve the oldest creation date (for customer tenure calculations)

Product merge rules:

Keep the record with the active SKU that matches your catalog
Consolidate stock quantities across duplicate records
Reassign all order lines and invoice lines to the surviving record
Archive the duplicate SKU with a note pointing to the surviving record

Vendor merge rules:

Keep the record with current bank details and payment terms
Merge all purchase orders and bills under the surviving record
Consolidate vendor contacts
Verify tax information is current on the surviving record

Step 2: Orphan Record Identification

Orphan records are transactions that reference master data that no longer exists or was incorrectly linked. They are the second most common cause of import failures after duplicates.

Common Orphan Patterns

Orphan Type	Example	Impact
Order without customer	Sales order references a customer ID that was deleted	Import fails or creates anonymous order
Invoice line without product	Invoice line references a product SKU that does not exist	Import fails or creates blank line item
Payment without invoice	Payment record references an invoice number that was deleted	Payment cannot be applied, distorts AR/AP
Employee without department	Employee references a department code that was removed	Employee record incomplete in new system
BOM without product	Bill of materials references a product that was discontinued	Manufacturing data incomplete
Timesheet without project	Timesheet entry references a project that was closed and deleted	Time data lost or unattributable

Orphan Detection Query Pattern

For each transactional entity, run a cross-reference check against its parent master data:

For every sales order line:
  → Does the customer_id exist in the customers table?
  → Does the product_id exist in the products table?
  → Does the salesperson_id exist in the employees table?

For every invoice:
  → Does the customer_id exist in the customers table?
  → Does each line's product_id exist in the products table?
  → Does the payment_term reference exist in the payment terms table?

For every purchase order:
  → Does the vendor_id exist in the vendors table?
  → Does each line's product_id exist in the products table?

Orphan Resolution Strategies

Step 3: Data Validation Rules

Field-Level Validation

Apply these validation rules to every record before export:

Text fields:

No leading or trailing whitespace
No double spaces within text
Consistent capitalization (Title Case for names, UPPERCASE for codes)
No special characters in fields that should be alphanumeric (SKUs, codes)
Character encoding is consistent (UTF-8 throughout)

Email fields:

Contains exactly one @ symbol
Domain has at least one dot after @
No spaces in the email address
Lowercase (email addresses are case-insensitive)
Not a placeholder ([email protected], [email protected])

Phone fields:

Consistent format (choose one: +1-555-123-4567 or +15551234567)
Country code included for international numbers
No letters or special characters other than +, -, (, )
Valid length for the country

Date fields:

Consistent format (ISO 8601: YYYY-MM-DD)
No future dates where logically impossible (e.g., invoice date in 2030)
No unreasonably old dates (e.g., order date of 1900-01-01, the default for many systems)
Date ranges are logical (start date before end date)

Numeric fields:

No text in numeric fields (commas as thousands separators cause import failures)
Consistent decimal precision (2 places for currency, 4 places for unit prices with small values)
No negative values where logically impossible (quantities, prices)
Currency values in expected range (no $999,999,999 invoices unless you are Boeing)

Required fields:

Customer name is never blank
Product name and SKU are never blank
Invoice number is never blank and never duplicated
All foreign key references point to existing records

Cross-Record Validation

Beyond individual field checks, validate consistency across related records:

Sum of invoice line amounts equals invoice total
Sum of payments applied to an invoice does not exceed invoice total
Inventory on-hand does not show negative quantities (unless system allows it)
Employee start date is before any associated timesheet entries
Product creation date is before any associated sales order lines

Step 4: Archiving Strategy

Not all data needs to migrate. Define an archiving policy that balances compliance requirements, business needs, and migration complexity.

Archiving Decision Framework

Data Type	Migrate to New ERP	Archive Outside ERP	Delete
Active customers (transaction in last 24 months)	Yes	—	—
Inactive customers (no transaction in 24+ months)	No (unless compliance requires)	Yes — CSV + secure storage	—
Open orders and invoices	Yes	—	—
Closed orders (last 24 months)	Yes	—	—
Closed orders (24+ months)	No	Yes	—
Current inventory levels	Yes	—	—
Historical inventory movements (24+ months)	No	Yes	—
Active products	Yes	—	—
Discontinued products (with order history)	Yes (as archived/inactive)	—	—
Discontinued products (no order history)	No	No	Yes
Employee records (active)	Yes	—	—
Employee records (terminated 7+ years ago)	No	Yes (legal retention)	—
Test/sample/dummy data	No	No	Yes
System audit logs	No	Yes (compliance)	—

Archive Format Recommendations

For data you archive outside the ERP:

Export to CSV with clear column headers and UTF-8 encoding
Include a data dictionary that defines each column, its data type, and valid values
Store in a versioned, immutable location (S3 with versioning, or encrypted backup)
Set a retention schedule (7 years for financial data in most jurisdictions, longer for some industries)
Document the archive in your compliance records, including contents, date range, and retention policy

Step 5: Master Data Governance

Data cleanup is not a one-time event. Without governance, your shiny new ERP will accumulate the same data quality issues within 12–18 months.

Data Ownership Matrix

Data Entity	Data Owner (Role)	Responsibilities
Customers	Sales Manager	Approve new customer creation, quarterly duplicate review, merge requests
Products	Product Manager	SKU standards, new product approval, discontinuation process
Vendors	Procurement Manager	Vendor onboarding standards, annual vendor review, duplicate prevention
Chart of Accounts	Finance Controller	Account creation approval, period-end review, structure changes
Employees	HR Manager	Employee data accuracy, lifecycle management (hire to termination)
Pricing	Commercial Director	Price list maintenance, discount authority matrix

Data Entry Standards

Document and enforce standards for each entity:

Customer creation standards:

Company name: Official legal name (verify against registration documents)
Trading name: Stored separately if different from legal name
Address: Use postal service format for the country
Primary contact: Name + email + phone required
Payment terms: Default set at creation, require approval to change
Credit limit: Set by finance, not sales

Product creation standards:

Product name: [Brand] [Product] [Variant] [Size] (e.g., "ECOSIRE Widget Blue Large")
SKU: [Category]-[Sequence]-[Variant] (e.g., "WDG-001-BL")
Description: Minimum 50 characters, no HTML formatting in descriptions
Category: Must select from existing categories (no free-text categories)
Unit of measure: Must use standard UoM from the approved list
Images: Minimum one image, maximum dimensions 2048x2048, white background

Automated Data Quality Rules

Configure these rules in your new ERP to prevent dirty data from the start:

Duplicate prevention: Warn on save if a record with the same email, phone, or tax ID already exists
Required field enforcement: Block creation if mandatory fields are empty
Format validation: Reject invalid email formats, phone formats, and date formats
Approval workflows: New customer and vendor creation requires manager approval
Periodic review: Automated reports highlighting records not updated in 12+ months

Step 6: Data Quality Scoring

Scoring Methodology

Score each data entity on four dimensions, each rated 1–5:

Dimension	Score 1	Score 3	Score 5
Completeness	>30% of required fields blank	10–30% blank	<5% blank
Consistency	No standards, wildly varying formats	Some standards, partial compliance	Clear standards, >95% compliance
Accuracy	>20% of sample records have errors	5–20% errors	<2% errors (verified sample)
Uniqueness	>10% duplicate rate	3–10% duplicates	<1% duplicates

Scoring Process

Sample: Random 5% of records (minimum 100, maximum 500)
Check completeness: Count blank required fields as a percentage
Check consistency: Review format compliance for text, date, phone, and email fields
Check accuracy: Verify sampled records against external sources (website, registration databases, physical inventory count)
Check uniqueness: Run duplicate detection on full dataset, calculate rate

Minimum Quality Thresholds for Migration

Entity	Minimum Average Score	Recommended
Customers	3.5	4.0+
Products	3.5	4.0+
Vendors	3.0	3.5+
Chart of Accounts	4.0	4.5+
Open Orders	3.5	4.0+
Open Invoices	4.0	4.5+
Employees	3.5	4.0+

Do not proceed with migration for any entity scoring below the minimum threshold. The cost of cleaning data after import is 3–5 times higher than cleaning before import.

Data Cleanup Timeline Template

Week	Activity	Deliverable
1	Initial quality assessment and scoring	Quality score report per entity
2	Duplicate detection run + merge planning	Duplicate groups with proposed merge actions
3	Orphan record identification	Orphan report with resolution recommendations
4	Data owner assignment and standards documentation	Data governance document
5–6	Bulk cleanup: duplicates, orphans, format standardization	Cleaned master data exports
7	Validation rule execution and exception handling	Validation exceptions report
8	Re-scoring and certification	Final quality scores (all above thresholds)
9	Archive old data, document retention policies	Archive files + retention schedule
10	Final export for migration import	Clean, validated, migration-ready data files

Tools and Resources

Open-Source Data Cleanup Tools

OpenRefine: Powerful data cleaning tool for clustering, faceting, and transforming messy data
dedupe.io: Machine learning-based deduplication library for Python
Great Expectations: Data validation framework for automated quality checks
pandas (Python): Flexible data manipulation for custom cleanup scripts
csvkit: Command-line tools for CSV inspection and validation

Commercial Data Quality Platforms

Informatica Data Quality: Enterprise-grade cleansing and matching
Talend Data Quality: Profiling, cleansing, and standardization
Melissa Data: Address verification, email validation, duplicate detection
IBM InfoSphere QualityStage: Master data matching and standardization

Frequently Asked Questions

How long does data cleanup take?

Should we clean data in the old system or in staging files?

What if we cannot reach the minimum quality threshold?

Who should be responsible for data cleanup?

Can we automate data cleanup?

What if we discover data quality issues after migration?

Does ECOSIRE help with data cleanup?

Start with a Data Quality Assessment

Request your free data quality assessment and take the first step toward a clean, successful migration.

ERP Data Cleanup: Essential Steps Before Any Migration

Why Clean Data Matters More Than You Think

Step 1: Duplicate Detection

Three Levels of Duplicate Detection

Duplicate Detection Process

Merge Rules by Entity Type

Step 2: Orphan Record Identification

Common Orphan Patterns

Orphan Detection Query Pattern

Orphan Resolution Strategies

Step 3: Data Validation Rules

Field-Level Validation

Cross-Record Validation

Step 4: Archiving Strategy

Archiving Decision Framework

Archive Format Recommendations

Step 5: Master Data Governance

Data Ownership Matrix

Data Entry Standards

Automated Data Quality Rules

Step 6: Data Quality Scoring

Scoring Methodology

Scoring Process

Minimum Quality Thresholds for Migration

Data Cleanup Timeline Template

Tools and Resources

Open-Source Data Cleanup Tools

Commercial Data Quality Platforms

Frequently Asked Questions

Start with a Data Quality Assessment

Transform Your Business with Odoo ERP

Related Articles

ERP for Clothing & Fashion Brands: Size-Color Matrix, Seasonal Planning, and Compliance (2026 Guide)

How Much Does an ERPNext Implementation Cost in 2026? (License-Free, But Not Free)

ERPNext for Manufacturing: BOM, Work Orders, and Shop Floor — Complete 2026 Guide

ERP Data Cleanup: Essential Steps Before Any Migration

Why Clean Data Matters More Than You Think

Step 1: Duplicate Detection

Three Levels of Duplicate Detection

Duplicate Detection Process

Merge Rules by Entity Type

Step 2: Orphan Record Identification

Common Orphan Patterns

Orphan Detection Query Pattern

Orphan Resolution Strategies

Step 3: Data Validation Rules

Field-Level Validation

Cross-Record Validation

Step 4: Archiving Strategy

Archiving Decision Framework

Archive Format Recommendations

Step 5: Master Data Governance

Data Ownership Matrix

Data Entry Standards

Automated Data Quality Rules

Step 6: Data Quality Scoring

Scoring Methodology

Scoring Process

Minimum Quality Thresholds for Migration

Data Cleanup Timeline Template

Tools and Resources

Open-Source Data Cleanup Tools

Commercial Data Quality Platforms

Frequently Asked Questions

Start with a Data Quality Assessment

Transform Your Business with Odoo ERP

Related Articles

ERP for Clothing & Fashion Brands: Size-Color Matrix, Seasonal Planning, and Compliance (2026 Guide)

How Much Does an ERPNext Implementation Cost in 2026? (License-Free, But Not Free)

ERPNext for Manufacturing: BOM, Work Orders, and Shop Floor — Complete 2026 Guide