CSV files are everywhere to define structured data for software applications or data sets. Despite being ubiquitous, CSV manipulation is hard and often requires technical knowledge.
For simple manipulations, spreadsheets are enough. However, they are limited when it comes to:
- Consolidate CSV files using a unique identifier
- Dedupe rows in a CSV
In this guide, we will focus on the last manipulation: detecting and removing duplicates from CSV files.
If you have several CSV files that you want to join together using a unique column, go to the join csv files guide.
Merge duplicates from CSV files
You have exported your data from an application in a CSV file and want to get rid of the duplicates? You have concatenated several CSV files into a big one and now need to clean it? Or maybe you have several sources of data no longer in sync and want to merge all those versions into a single listing. Microsoft Excel or Google Sheets will not help you with your problem because they don't allow you to set up a unique constraint on a column.
On a small CSV file, this task can be done manually. On bigger files, it would take hours to process the entries one by one. And you risk human errors.
What you need is a tool to automatically detect CSV entries with similar values in one or several columns. Then, once the duplicates are found, you can edit or merge them to consolidate their data and remove the duplicates.
Datablist is great to perform simple data operations that are not possible with spreadsheets. Use it when you are looking for a great online CSV editor.
In this guide, we will work with 2 CSV files that contains thousands of entries. We will load them into a single collection and de-duplicate entries based on 1 of the 4 columns.
The CSV file contains 4 columns:
Job Title. We want to merge entries that share the same email address.
The process can be summed up with:
- Load your CSV files into a Datablist collection
- Create properties matching CSV columns
- Select the properties to perform the duplicates check
- Merge or delete the duplicates
Step 1: Load your CSV files into a Datablist collection
Create a new collection
The first step in our process is to load the CSV file into Datablist. Open Datablist (No signup required) to get started.
To create a new collection, click the "New collection" button with the +. Once the collection is created, give it a name and an icon.
Then click on the Import CSV button.
Create properties for your CSV file
With the CSV file now loaded, you can create properties for your collection using the CSV column names. Datablist lists every column found in the CSV so you can create a property for each of them.
CSV data is not typed. When reading a CSV file, everything is a text. To provide better filtering and sorting features, Datablist tries to detect data types by looking at the first 100 lines. For example, when it finds only numbers, it set the property data type to number. Same with date, email, checkboxes (true or false values), etc.
Review and import
In the review step, the CSV rows will be listed directly from the CSV file. Be sure the data is well formatted and is consistent. Then click the "Import items" button and you are done! 💪
Do it again for your other CSV files
Now you have a collection with properties configured, perform the "Import CSV/Excel" process to import your other CSV files or Excel files into the same collection.
Step 2: Find duplicates
After loading the CSV files, the second step is to search for duplicate values. In the collection data listing, click the "Duplicates" button in the header (top right).
Two modes are available:
- All Properties - In this mode, Datablist looks for items with similar values for all properties. Two items would be considered similar when all of their properties match.
- Selected Properties - For this second mode, you will select the properties to be used for similarity check. Two items would be considered similar when they have similar values on all the selected properties.
Once the analysis is done, Datablist lists all your duplicate items based on the
- Edit the item - Use values from incomplete items to merge the data into a single item.
- Merge duplicates - Merge values from secondary items to a selected primary item
- Delete the extra items - When duplicates don't bring value, just remove the duplicates
Step 3: Merge duplicates
Usually, you want to merge all your CSV duplicate rows into a single item and consolidate the data. Thus, merging duplicates without losing data in the process.
To merge duplicates, click on the "Merge Items" button on the left of every duplicates group.
It opens a merging tool. On the right, a "Primary Item" is shown and on the left the remaining duplicate items are called "Secondary Items". Datablist elects the item with the most data as "Primary item".
When possible, property values from secondary items are auto selected to be merged into the primary items. If several values conflict, you will have to make a decision and select which value to keep.
If the resulting "Primary item" suits you, click the Merge button to confirm the merge process. All the secondary items will be deleted to keep only one combined item.
Export to CSV if needed
Congrats, you have successfully deduped your CSV files! If you need to use the result in another tool, click on the "Export" button to export the collection as another CSV file.
If you have any questions, please contact us.