Introduction

Clean data is essential for accurate analysis and processing. CSV files often contain inconsistencies, duplicates, and formatting issues. This guide covers techniques for cleaning CSV data effectively.

Common CSV Issues

1. Inconsistent Formatting

Name,Email,Age
John Doe,john@example.com,30
Jane Smith,  jane@example.com  ,25
Bob,BOB@EXAMPLE.COM,35

Problems:

  • Extra whitespace
  • Inconsistent capitalization
  • Mixed formats

2. Missing Values

Name,Email,Age
John Doe,,30
Jane Smith,jane@example.com,
Bob,bob@example.com,35

3. Duplicates

Name,Email
Alice,alice@example.com
Bob,bob@example.com
Alice,alice@example.com

4. Encoding Issues

Name,Description
José,Español text
Müller,German text

Cleaning Techniques

1. Trim Whitespace

function cleanCSV(csv) {
  return csv
    .split("\n")
    .map((row) => {
      return row
        .split(",")
        .map((cell) => cell.trim())
        .join(",");
    })
    .join("\n");
}

2. Remove Duplicates

const seen = new Set();
const unique = rows.filter((row) => {
  const key = row.join(",");
  if (seen.has(key)) return false;
  seen.add(key);
  return true;
});

3. Normalize Text

function normalize(row) {
  return {
    name: row.name.trim().toLowerCase(),
    email: row.email.trim().toLowerCase(),
    age: parseInt(row.age),
  };
}

Tools

Use our tools:

Conclusion

CSV cleaning ensures:

Benefits:

  • Accurate analysis
  • Consistent data
  • Better processing
  • Fewer errors

Key steps:

  • Trim whitespace
  • Remove duplicates
  • Normalize values
  • Validate data
  • Handle encoding

Next Steps