ERP Data Cleanup: Essential Steps Before Any Migration
Data cleanup is the unglamorous foundation that determines whether your ERP migration succeeds or becomes an expensive exercise in moving garbage from one system to another. Every migration consultant will tell you that 30–40% of total project effort should go to data cleanup, yet most organizations rush through it because cleaning data feels like a detour from the main goal. The result is predictable: duplicate customer records causing confused sales teams, orphan transactions that break financial reports, and inconsistent product data that derails inventory management. This guide provides a systematic framework for cleaning your data before any ERP migration, regardless of your source or target system.
Key Takeaways
- Data cleanup should consume 30–40% of total migration timeline — plan for it explicitly in your project schedule
- Start with master data (customers, products, vendors) before transactional data — master data errors cascade
- Duplicate detection algorithms combining exact match, fuzzy match, and business rule match catch 95% of duplicates
- Orphan records (transactions referencing deleted master data) are the most common cause of import failures
- Data quality scoring gives objective metrics to track cleanup progress and define "done" criteria
- Archive rather than delete — you may need historical data for tax, compliance, or trend analysis
- Assign data owners per entity type — cleanup without ownership devolves into finger-pointing
Why Clean Data Matters More Than You Think
The cost of dirty data in a new ERP is not theoretical. Here are the concrete consequences:
Financial errors. Duplicate customer records mean duplicate invoices, split payment applications, and incorrect aging reports. A customer appears to owe $50,000 when they actually owe $25,000 across two records. Your collections team wastes time chasing phantom balances.
Inventory inaccuracy. Duplicate product records with slightly different names mean stock is split across records. Your system shows 10 units of "Widget Blue, Large" and 15 units of "Blue Widget - LG" when you actually have 25 units of the same product. Reorder points trigger incorrectly.
Broken automation. ERP automation rules reference specific records. A workflow that sends a payment reminder to customers with overdue invoices will send two reminders to customers with duplicate records. Automated reorder rules will trigger for each duplicate product.
Report distortion. Sales reports show inflated customer counts. Product reports show fragmented inventory. Financial reports double-count revenues or expenses associated with duplicate records.
User frustration. The quickest way to kill ERP adoption is for users to see dirty data in the new system. If a salesperson searches for a customer and finds three near-identical records, their confidence in the system — and the migration project — evaporates immediately.
Step 1: Duplicate Detection
Three Levels of Duplicate Detection
Level 1: Exact match. Records that are identical across key fields. Easy to detect, but catches only the most obvious duplicates.
- Same email address
- Same phone number (after normalizing format)
- Same tax ID / company registration number
- Same SKU / product code
Level 2: Fuzzy match. Records that are similar but not identical. Requires algorithms like Levenshtein distance, Soundex, or Jaro-Winkler similarity.
- "ECOSIRE Pvt Ltd" vs. "ECOSIRE Private Limited" vs. "Ecosire Pvt. Ltd."
- "123 Main Street" vs. "123 Main St." vs. "123 Main St, Suite 100"
- "Blue Widget (Large)" vs. "Widget - Blue, L" vs. "BLU-WDGT-LG"
Level 3: Business rule match. Records that look different but represent the same entity based on business context.
- Same company name + same city (likely the same customer even with different addresses)
- Same product dimensions + same material (likely the same product with different naming)
- Same vendor + same bank account (likely a duplicate vendor record)
Duplicate Detection Process
| Step | Action | Tool/Method |
|---|---|---|
| 1 | Export all records from the entity | CSV or API export |
| 2 | Normalize text fields (lowercase, remove punctuation, trim whitespace) | Script or ETL tool |
| 3 | Run exact match on unique identifiers (email, tax ID, SKU) | SQL GROUP BY + HAVING COUNT > 1 |
| 4 | Run fuzzy match on name + address combinations | Python (fuzzywuzzy library) or dedicated dedup tool |
| 5 | Apply business rules for context-based matching | Custom rules per entity type |
| 6 | Generate duplicate groups with confidence scores | Review queue for human decision |
| 7 | Merge or archive duplicates (never delete outright) | Merge tool or manual merge |
Merge Rules by Entity Type
Customer merge rules:
- Keep the record with the most recent transaction activity
- Consolidate all addresses (mark primary, keep others as shipping/billing alternatives)
- Merge all contact persons under the surviving record
- Reassign all orders, invoices, and payments to the surviving record
- Preserve the oldest creation date (for customer tenure calculations)
Product merge rules:
- Keep the record with the active SKU that matches your catalog
- Consolidate stock quantities across duplicate records
- Reassign all order lines and invoice lines to the surviving record
- Archive the duplicate SKU with a note pointing to the surviving record
Vendor merge rules:
- Keep the record with current bank details and payment terms
- Merge all purchase orders and bills under the surviving record
- Consolidate vendor contacts
- Verify tax information is current on the surviving record
Step 2: Orphan Record Identification
Orphan records are transactions that reference master data that no longer exists or was incorrectly linked. They are the second most common cause of import failures after duplicates.
Common Orphan Patterns
| Orphan Type | Example | Impact |
|---|---|---|
| Order without customer | Sales order references a customer ID that was deleted | Import fails or creates anonymous order |
| Invoice line without product | Invoice line references a product SKU that does not exist | Import fails or creates blank line item |
| Payment without invoice | Payment record references an invoice number that was deleted | Payment cannot be applied, distorts AR/AP |
| Employee without department | Employee references a department code that was removed | Employee record incomplete in new system |
| BOM without product | Bill of materials references a product that was discontinued | Manufacturing data incomplete |
| Timesheet without project | Timesheet entry references a project that was closed and deleted | Time data lost or unattributable |
Orphan Detection Query Pattern
For each transactional entity, run a cross-reference check against its parent master data:
For every sales order line:
→ Does the customer_id exist in the customers table?
→ Does the product_id exist in the products table?
→ Does the salesperson_id exist in the employees table?
For every invoice:
→ Does the customer_id exist in the customers table?
→ Does each line's product_id exist in the products table?
→ Does the payment_term reference exist in the payment terms table?
For every purchase order:
→ Does the vendor_id exist in the vendors table?
→ Does each line's product_id exist in the products table?
Orphan Resolution Strategies
Strategy 1: Reconnect. If the master record was deleted but should exist, recreate it and link the orphan transactions. This is common for products that were discontinued but have historical orders.
Strategy 2: Reclassify. Assign orphan transactions to a catch-all master record. Create a "Legacy Customer" contact or an "Archived Product" record and reassign orphans there. This preserves financial totals while acknowledging the data quality issue.
Strategy 3: Archive. Move orphan transactions to an archive table outside the migration scope. Include them in a separate historical data export for reference but do not import them into the new ERP.
Step 3: Data Validation Rules
Field-Level Validation
Apply these validation rules to every record before export:
Text fields:
- No leading or trailing whitespace
- No double spaces within text
- Consistent capitalization (Title Case for names, UPPERCASE for codes)
- No special characters in fields that should be alphanumeric (SKUs, codes)
- Character encoding is consistent (UTF-8 throughout)
Email fields:
- Contains exactly one @ symbol
- Domain has at least one dot after @
- No spaces in the email address
- Lowercase (email addresses are case-insensitive)
- Not a placeholder ([email protected], [email protected])
Phone fields:
- Consistent format (choose one: +1-555-123-4567 or +15551234567)
- Country code included for international numbers
- No letters or special characters other than +, -, (, )
- Valid length for the country
Date fields:
- Consistent format (ISO 8601: YYYY-MM-DD)
- No future dates where logically impossible (e.g., invoice date in 2030)
- No unreasonably old dates (e.g., order date of 1900-01-01, the default for many systems)
- Date ranges are logical (start date before end date)
Numeric fields:
- No text in numeric fields (commas as thousands separators cause import failures)
- Consistent decimal precision (2 places for currency, 4 places for unit prices with small values)
- No negative values where logically impossible (quantities, prices)
- Currency values in expected range (no $999,999,999 invoices unless you are Boeing)
Required fields:
- Customer name is never blank
- Product name and SKU are never blank
- Invoice number is never blank and never duplicated
- All foreign key references point to existing records
Cross-Record Validation
Beyond individual field checks, validate consistency across related records:
- Sum of invoice line amounts equals invoice total
- Sum of payments applied to an invoice does not exceed invoice total
- Inventory on-hand does not show negative quantities (unless system allows it)
- Employee start date is before any associated timesheet entries
- Product creation date is before any associated sales order lines
Step 4: Archiving Strategy
Not all data needs to migrate. Define an archiving policy that balances compliance requirements, business needs, and migration complexity.
Archiving Decision Framework
| Data Type | Migrate to New ERP | Archive Outside ERP | Delete |
|---|---|---|---|
| Active customers (transaction in last 24 months) | Yes | — | — |
| Inactive customers (no transaction in 24+ months) | No (unless compliance requires) | Yes — CSV + secure storage | — |
| Open orders and invoices | Yes | — | — |
| Closed orders (last 24 months) | Yes | — | — |
| Closed orders (24+ months) | No | Yes | — |
| Current inventory levels | Yes | — | — |
| Historical inventory movements (24+ months) | No | Yes | — |
| Active products | Yes | — | — |
| Discontinued products (with order history) | Yes (as archived/inactive) | — | — |
| Discontinued products (no order history) | No | No | Yes |
| Employee records (active) | Yes | — | — |
| Employee records (terminated 7+ years ago) | No | Yes (legal retention) | — |
| Test/sample/dummy data | No | No | Yes |
| System audit logs | No | Yes (compliance) | — |
Archive Format Recommendations
For data you archive outside the ERP:
- Export to CSV with clear column headers and UTF-8 encoding
- Include a data dictionary that defines each column, its data type, and valid values
- Store in a versioned, immutable location (S3 with versioning, or encrypted backup)
- Set a retention schedule (7 years for financial data in most jurisdictions, longer for some industries)
- Document the archive in your compliance records, including contents, date range, and retention policy
Step 5: Master Data Governance
Data cleanup is not a one-time event. Without governance, your shiny new ERP will accumulate the same data quality issues within 12–18 months.
Data Ownership Matrix
| Data Entity | Data Owner (Role) | Responsibilities |
|---|---|---|
| Customers | Sales Manager | Approve new customer creation, quarterly duplicate review, merge requests |
| Products | Product Manager | SKU standards, new product approval, discontinuation process |
| Vendors | Procurement Manager | Vendor onboarding standards, annual vendor review, duplicate prevention |
| Chart of Accounts | Finance Controller | Account creation approval, period-end review, structure changes |
| Employees | HR Manager | Employee data accuracy, lifecycle management (hire to termination) |
| Pricing | Commercial Director | Price list maintenance, discount authority matrix |
Data Entry Standards
Document and enforce standards for each entity:
Customer creation standards:
- Company name: Official legal name (verify against registration documents)
- Trading name: Stored separately if different from legal name
- Address: Use postal service format for the country
- Primary contact: Name + email + phone required
- Payment terms: Default set at creation, require approval to change
- Credit limit: Set by finance, not sales
Product creation standards:
- Product name: [Brand] [Product] [Variant] [Size] (e.g., "ECOSIRE Widget Blue Large")
- SKU: [Category]-[Sequence]-[Variant] (e.g., "WDG-001-BL")
- Description: Minimum 50 characters, no HTML formatting in descriptions
- Category: Must select from existing categories (no free-text categories)
- Unit of measure: Must use standard UoM from the approved list
- Images: Minimum one image, maximum dimensions 2048x2048, white background
Automated Data Quality Rules
Configure these rules in your new ERP to prevent dirty data from the start:
- Duplicate prevention: Warn on save if a record with the same email, phone, or tax ID already exists
- Required field enforcement: Block creation if mandatory fields are empty
- Format validation: Reject invalid email formats, phone formats, and date formats
- Approval workflows: New customer and vendor creation requires manager approval
- Periodic review: Automated reports highlighting records not updated in 12+ months
Step 6: Data Quality Scoring
Scoring Methodology
Score each data entity on four dimensions, each rated 1–5:
| Dimension | Score 1 | Score 3 | Score 5 |
|---|---|---|---|
| Completeness | >30% of required fields blank | 10–30% blank | <5% blank |
| Consistency | No standards, wildly varying formats | Some standards, partial compliance | Clear standards, >95% compliance |
| Accuracy | >20% of sample records have errors | 5–20% errors | <2% errors (verified sample) |
| Uniqueness | >10% duplicate rate | 3–10% duplicates | <1% duplicates |
Scoring Process
- Sample: Random 5% of records (minimum 100, maximum 500)
- Check completeness: Count blank required fields as a percentage
- Check consistency: Review format compliance for text, date, phone, and email fields
- Check accuracy: Verify sampled records against external sources (website, registration databases, physical inventory count)
- Check uniqueness: Run duplicate detection on full dataset, calculate rate
Minimum Quality Thresholds for Migration
| Entity | Minimum Average Score | Recommended |
|---|---|---|
| Customers | 3.5 | 4.0+ |
| Products | 3.5 | 4.0+ |
| Vendors | 3.0 | 3.5+ |
| Chart of Accounts | 4.0 | 4.5+ |
| Open Orders | 3.5 | 4.0+ |
| Open Invoices | 4.0 | 4.5+ |
| Employees | 3.5 | 4.0+ |
Do not proceed with migration for any entity scoring below the minimum threshold. The cost of cleaning data after import is 3–5 times higher than cleaning before import.
Data Cleanup Timeline Template
| Week | Activity | Deliverable |
|---|---|---|
| 1 | Initial quality assessment and scoring | Quality score report per entity |
| 2 | Duplicate detection run + merge planning | Duplicate groups with proposed merge actions |
| 3 | Orphan record identification | Orphan report with resolution recommendations |
| 4 | Data owner assignment and standards documentation | Data governance document |
| 5–6 | Bulk cleanup: duplicates, orphans, format standardization | Cleaned master data exports |
| 7 | Validation rule execution and exception handling | Validation exceptions report |
| 8 | Re-scoring and certification | Final quality scores (all above thresholds) |
| 9 | Archive old data, document retention policies | Archive files + retention schedule |
| 10 | Final export for migration import | Clean, validated, migration-ready data files |
Tools and Resources
Open-Source Data Cleanup Tools
- OpenRefine: Powerful data cleaning tool for clustering, faceting, and transforming messy data
- dedupe.io: Machine learning-based deduplication library for Python
- Great Expectations: Data validation framework for automated quality checks
- pandas (Python): Flexible data manipulation for custom cleanup scripts
- csvkit: Command-line tools for CSV inspection and validation
Commercial Data Quality Platforms
- Informatica Data Quality: Enterprise-grade cleansing and matching
- Talend Data Quality: Profiling, cleansing, and standardization
- Melissa Data: Address verification, email validation, duplicate detection
- IBM InfoSphere QualityStage: Master data matching and standardization
Frequently Asked Questions
How long does data cleanup take?
For a mid-size business (5,000–50,000 customer records, 1,000–10,000 products), plan for 6–10 weeks of dedicated effort. This assumes one full-time data analyst plus part-time involvement from data owners in each department. Larger enterprises with hundreds of thousands of records or complex multi-system landscapes may need 12–16 weeks.
Should we clean data in the old system or in staging files?
Clean in staging files (exported CSVs or a staging database), not in the live system. This preserves your production data as a fallback, allows parallel cleanup by multiple people, and avoids disrupting daily operations. Your live system continues running untouched until the clean data is imported into the new ERP.
What if we cannot reach the minimum quality threshold?
If a specific entity cannot reach the minimum score, investigate the root cause. If it is a data volume problem (too many records to clean manually), consider importing only the most recent or most active subset and archiving the rest. If it is a structural problem (the data was never designed to support what the new ERP needs), you may need to enrich data from external sources or accept that some records will require manual attention post-migration.
Who should be responsible for data cleanup?
Data cleanup is a business responsibility, not an IT responsibility. IT provides the tools and infrastructure, but business users must make the decisions: which duplicate record to keep, whether an orphan order should be reconnected or archived, and what the correct product name format should be. Assign data owners from each department and hold them accountable for their entity quality scores.
Can we automate data cleanup?
Partially. Automated tools handle format standardization (phone numbers, addresses, dates), exact-match deduplication, and validation rule checking. But merging fuzzy-match duplicates, resolving orphan records, and verifying data accuracy require human judgment. Plan for 60% automated / 40% manual effort.
What if we discover data quality issues after migration?
Post-migration cleanup is 3–5 times more expensive than pre-migration cleanup because you are now dealing with a live system where changes affect active workflows. If you discover issues after go-live, prioritize by business impact: fix records that affect financial accuracy first, then customer-facing records, then internal operational records.
Does ECOSIRE help with data cleanup?
Yes. Data cleanup is a core component of ECOSIRE's migration services. We provide data profiling, automated deduplication, quality scoring, and cleanup scripting as part of every migration project. Our team works alongside your data owners to ensure business context drives every cleanup decision. Contact us to discuss your data quality challenges.
Start with a Data Quality Assessment
The first step in any migration is understanding the current state of your data. A data quality assessment takes 3–5 days and produces a detailed report showing duplicate rates, completeness scores, format inconsistencies, and orphan record counts for every major entity.
ECOSIRE offers complimentary data quality assessments as part of our migration planning services. We will analyze your current data, identify the highest-impact cleanup tasks, and provide a realistic timeline and effort estimate for achieving migration-ready quality.
Request your free data quality assessment and take the first step toward a clean, successful migration.
Written by
ECOSIRE TeamTechnical Writing
The ECOSIRE technical writing team covers Odoo ERP, Shopify eCommerce, AI agents, Power BI analytics, GoHighLevel automation, and enterprise software best practices. Our guides help businesses make informed technology decisions.
Related Articles
Back Market Integration: Connect Refurbished Products to Odoo ERP
Guide to integrating Back Market with Odoo ERP for refurbished electronics sellers. Automate grading, orders, inventory, and quality compliance.
Best ERP for E-commerce Business in 2026: Top 8 Compared
Compare the top 8 ERPs for e-commerce in 2026: Odoo, NetSuite, SAP B1, Acumatica, Brightpearl, Cin7, Dear Inventory, and QuickBooks Commerce with pricing.
Best ERP Software in 2026: Comprehensive Buyer's Guide
Top 12 ERP systems ranked for 2026: Odoo, SAP, Oracle NetSuite, Microsoft Dynamics, Acumatica, ERPNext, Sage, Epicor, Infor, QAD, Syspro, and Brightpearl.