The CSV file format is a standard to exchange structured data between companies or different software applications. It is text-based and uses punctuation delimiters to separate columns.
Here is an example of CSV file data:
firstName, lastName, email Joe, Vaughn, email@example.com Bob, Cunnighan, firstname.lastname@example.org
CSV files can be exported from most tools dealing with data, including CRM, Order Management System, Spreadsheets (Google Sheets or Microsoft Excel), and financial products. In the future, we might have unified data models to transfer structured items between applications. In the meantime, we must rely on CSV files.
When it comes to manipulating CSV files, the go-to solution is spreadsheet. Loading a CSV file into Google Sheets or Microsoft Excel is straightforward. However, those tools are limited when it comes to two simple operations:
- Joining CSV files by a unique column
- Deduplicate and merge CSV entries on a column unique constraint
Spreadsheets cannot define a unique constraint on a column, therefore provide limited support for joining CSV or deduplication.
This guide contains 2 parts:
- Join CSV files on a common column with Datablist
- Join CSV files using Google Sheet (or Microsoft Excel)
In this tutorial, we use 2 demo CSV files:
Solution 1: Join CSV files on a common column with Datablist
Step1: Load your first CSV file
The first step is to create a collection to pour in all your CSV data. Datablist main idea is to use common data models to benefit from built-in properties and actions. If you only need to join CSV files, select "Start from scratch" to create a collection with no property.
Once your collection is created, go to the "Import CSV" section.
Note: Your CSV file first line must contain column names.
Drag and drop a CSV file or click to select a file from your computer. Once the file is loaded, check that the number of lines and columns preview displayed is correct before moving to the next step.
Map your CSV columns with collection properties or create new properties.
The last step is to click on the "Import" button to launch the import process. Your first CSV file is imported!
Step2: Define which column to use as a unique identifier
Now that you have imported your first CSV file, you can define a "unique values" constraint on a collection property. With this information, Datablist will merge new CSV imports to respect the constraint. Go to the columns configuration and edit the property that will be your unique identifier. Check the "Unique Values" attribute and save.
Step3: Load one or more CSV files
Once you have set a unique constraint on a collection property, import your other CSV files one by one in the existing collection. Create new collection properties if needed during the CSV mapping step.
When your collection has items and has a unique constraint, you can select a merging option during a CSV file import.
Select how data would be merged into the collection:
- Soft Merge: If data exists for the item property, don't update it. This is the default setting.
- Hard Merge: If data exists for the item property, update it
The "Skip item" option skips the row when an entry with the same identifier value is found in the collection, it must not be selected in order to join CSV files.
Step 4: Export to CSV if needed
Congratulations! 🎉 You've successfully combined CSV files using a common column! If you need to use the result in another tool, click on the "Export" button to export the collection as another CSV file.
Full step by step video to join CSV files with Datablist
In the video below, the property "unique values" settings is set directly during the property creation.
Solution 2: Join CSV files using Google Sheet (or Microsoft Excel)
Spreadsheet tools have limited support for merging CSV files with a common column. However, a cell formula can find a row in another table that matches a value. When applied to each row of a table, it can search in another table and return the value from any column in the matched row.
- With spreadsheets, one CSV file is used as a master table and must contain all the possible values for the join column.
- In all the secondary tables, the join column must be the first column.
Step 1: Load your CSV files
In this tutorial, we use Google Sheets (the formula VLOOKUP is similar in Microsoft Excel).
Among your CSVs, choose the one with the most values to be used as the master table. The other will be called secondary CSV files.
First load your master CSV file with
File -> Import and select your CSV (go to the
Upload tab to use a CSV file from your computer).
Import Location, select
Insert new sheet(s).
Load a CSV file in Google Sheets
Perform this operation for your secondary CSV file(s). Each CSV file must be uploaded to a dedicated table in the the spreadsheet.
CSV files imported in Google Sheets
Step 2: Create new columns in the master sheet
The table with your master CSV is your master table and it will receive values from other tables. In this master table, create new columns to store data from your other tables.
For this tutorial, we want to have the
Job Title data from our secondary table in our master table, so we add an empty
Job Title column.
New Job Title Column
Step 3: Move your unique column as the first column in your secondary tables
VLOOKUP formula works by searching the first column in the table it searches in. In all your secondary tables, (no need to do it on the master table), move the column to use as join column in the first position.
Identifier Column must be the first column
Step 3: Use the VLOOKUP formula
The final step is to use the
VLOOKUP formula to find rows in the other tables and to display a column from the matched row.
The formula takes 3 arguments:
- search_key - The value to search for. It will be the unique identifier row's value.;
- range - The range to consider for the search. The first column in the range is searched for the key specified in search_key. Define your whole secondary table as the range (see video).
- index - The column index in the row to be returned. The first column in the range is numbered 1.
- is_sorted - [TRUE by default] - Indicates whether the column to be searched (the first column of the specified range) is sorted. FALSE is recommended in most cases. If set to True and your data is not sorted, your results will be wrong!
VLOOKUP(search_key, range, index, [is_sorted])
See the video below to understand how to use
VLOOKUP to join data on a unique column:
Repeat the operation for every other column from your secondary tables 💪.