CSV files are everywhere to define structured data for software applications or data sets. Despite being ubiquitous, CSV manipulation is hard and often requires technical knowledge.
For simple manipulations, spreadsheets are enough. However, they are limited when it comes to:
- Consolidate CSV files using a unique identifier
- Remove duplicate rows from a CSV
- Merge CSV rows that have a similar identifier column
In this guide, we will focus on the last two manipulations: removing and/or merging duplicates from CSV files.
If you have several CSV files that you want to join together using a unique column, go to the join csv files guide.
Remove duplicates from CSV files
You have exported your data from an application in a CSV file and want to get rid of the duplicates? You have concatenated several CSV files into a big one and now need to clean it? Or maybe you have several sources of data no longer in sync and want to merge all those versions into a single listing. Microsoft Excel or Google Sheets will not help you with your problem because they don't allow you to set up a unique constraint on a column.
On a small CSV file, this task can be done manually. On bigger files, it would take hours to process the entries one by one. And you risk human errors.
What you need is a tool to automatically detect CSV entries with similar values in a specific column. Then depending on your settings, when several entries with the same unique value are detected, the entries are either dropped or merged.
Datablist is great to perform simple data operations that are not possible with spreadsheets.
In this guide, we will work with a single CSV file that contains thousands of entries. Not only we will remove duplicate entries, but we will merge data from the matching entries and consolidate them.
You can download the tutorial CSV file here!
The CSV file contains 4 columns:
Job Title. We want to merge entries that share the same email address. The process can be summed up with:
- Load your CSV file into a Datablist collection
- Create properties matching CSV columns
- Define a unique constraint on the email property
- Import the CSV file
Step 1: Load your CSV file into a Datablist collection
When creating a new collection, we can select data types with built-in properties (see also our article on common data models) or create an empty collection with no property.
For this example, we will select an empty collection so the properties will be created from the CSV columns.
Once the collection is created, click on the "Import CSV" button.
Step 2: Create properties for your CSV columns
With the CSV file now loaded, we can create properties for our collection using the CSV column names. Datablist lists every column found in the CSV so we can create a property for each of them.
Step 3: Define your unique constraint
Here is the important step! We want to remove duplicates entries from the CSV! In our example, duplicates are entries with the same email address. To add this constraint, we add the
Unique values option on the
Unique values constraint, Datablist will ensure two entries can't share the same
Step 4: Review and import
The last step is to select the merging mode. What should Datablist do when several entries have the same
The default mode is "Soft Merge". When an entry from the CSV has a value (
We click the "Import items" button and we are done! 💪 Duplicate CSV entries are merged and we have a clean collection! If needed, we are now able to import other CSV files into this collection.
Export to CSV if needed
If you need to use the result in another tool, click on the "Export" button to export the collection as another CSV file.
If you have any questions, please contact us. And if you want to join the beta, please fill in the form bellow 👇.