Changelog

New features, improvements and fixes to Datablist.

November 1st, 2023

Calculations

You can now run calculations on property values. Calculations are accessible from a property column menu.

Datablist runs the calculation in the "current view". It takes the items in this order:

  • If you have selected items in your collection, it will process them.
  • If you have a filter or a full-text search term, it will process the filtered items.
  • Otherwise, it will process all your collection items.

Calculations available for all data types:

  • Count Empty - How many items with an empty value for the property.
  • Count Filled - How many items with a value for the property.

Other calculations depend on the property data types such as Text or Number.

Calculation available for text-based data types:

  • Characters count - Return the sum of all characters. Leading and trailing spaces are not counted. Spaces in between words are.
  • Words count - Return the number of words found in the texts.
  • Count distinct values - Return facets for a property with how many times each value appears. This is great for aggregation of limited choice values (countries, status, etc.).

For number-based data types:

  • Min - Return the lowest value for the property.
  • Max - Return the highest value for the property.
  • Average - Return the sum of values divided by the number of non-empty values.

Check the calculations documentation to know more.

Filter Groups

Data Filtering has been improved with "Filter Groups".

With Filter Groups, you can create complex filters with different filtering operations. Filtering operations define how filters are combined. With "AND", an item must pass all conditions. With "OR", an item passes once one of the filters returns true.

Filter Groups are compatible with Saved Filters.

Duplicate Finder Improvements

Select a different algorithm for each property

Until now, a single data-matching algorithm was selected before the deduplication process. Internally, Datablist checked each property data type to apply the selected algorithm on compatible properties. And it fell back to Exact matching on the other properties (e.g. Date, Checkbox, Number).

Now, each property used for deduplication is listed in the data-matching algorithm step.

Compatible algorithms are listed according to their data type. And options only apply to the property.

For example, two properties might use a fuzzy matching algorithm and have different distance thresholds.

Ignore the case in the Exact algorithm

By default, Datablist Duplicates Finder is case-insensitive. But in some cases, you need to match duplicate values only when they have a similar case.

A new option is available for the "Exact" Algorithm to be case-sensitive.

Master Item Rule selection

After the data matching step, an important part of deduplication is duplicate merging. With the auto-merge algorithm, Datablist selects a master item, merges the values from the other items in it, and deletes all but the master item.

By default, the elected master item is the one with the most data.

A new setting has been added in the auto-merging assistant to change this master item selection.

Two new rules are now available:

  • Last Updated - This rule chooses the item based on the newest modified date.
  • First Created - This rule chooses the item based on the oldest creation date.

During this development cycle, the "Most Complete" default rule has also been improved. Until now, the rule checked how many properties had data. When two items had the same number of properties with data, it took the last created item.

Now, for two items with the same number of properties with data, it also checks the text length.

For two items such as:

First Name | Last Name | Notes

John...... | Doe ..... | A great man.

John...... | Doe ..... | A great man. Remember to contact him.

The second one will be selected as the master item. The "Notes" text is longer for the second item.

Normalize street names

In Data Cleaning, normalization ensures you have a uniform format across all your data. Normalization reduces errors during deduplication and you get a consistent view of your data.

I have several built-in normalizations in mind for later:

  • Company name normalization to remove suffixes such as "Inc." or "GmbH".
  • People name normalization to clean nicknames, deal with initials, etc.

Last month, I released the first normalization algorithm to deal with street names written in English.

The "Normalize Street Name" algorithm deals with abbreviations (St. == St == Street), directional words (N 45 == North 45), etc.

Other Improvements & Fixes

  • Option to auto-generate column names during import for files without headers.
  • Fix Excel export in selected items (and duplicate groups download).
  • Fix auto merging on properties with punctuation differences.
  • Show how many duplicate groups have been merged during the auto-merge process.
  • Auto updated disposable provider domain list and added Stop Forum Spam as a new source.
  • Fix anonymous collection import for collections with more than 10k items.
  • Auto open Datetime picker on cell edition.
  • Show data loss warning every 48 hours for collections not synced to the cloud (anonymous, or free account with more than 1000 items per collection).


August 22, 2023

Datablist Extractor: Extract domains, email addresses, mentions, etc.

With Datablist Extractor, you can now extract the domains from a list of email addresses, or find all URLs in texts.

Domains, Emails, URLs, mentions (@xx), tags (#xx), etc. are structured entities to use later to enrich a company, a contact, or websites.

This was ranked high in the requested features. And it will play nice with future enrichments (see "Notes on enrichments" below).

For the first release, the following extractors are available:

  • Extract the domain from an email address
  • Extract the domain from an URL
  • Extract URL(s) from a text
  • Extract mentions (ex: @name) from a text
  • Extract tags (ex: #string) from a text
  • Extract emails from a text

Feel free to contact me if you need other extractors.

Datablist Extractor is available from the "Edit" button.

Deduplication with Fuzzy Matching

Datablist Duplicates Finder is getting better with fuzzy matching. Fuzzy comparisons work by calculating the similarity between two strings with a distance function. And a threshold lets you decide when the strings must be considered similar.

Fuzzy matching is perfect to find duplicate leads with people or company name typos. Or to find items with the same postal addresses written with variations.

Datablist implements two distances algorithms:

The threshold goes from 20 to 100. 100 for an exact match. The default value is set to 80.

Apollo.io People and Company enrichments

This summer, I've added two enrichments connected to the Apollo.io API. One for people and the other for companies.

Apollo.io People Enrichment

The enrichment is connected to Apollo.io People Enrichment. With at least a name and a company domain (or email address), Apollo returns all the business data for your contacts.

Among the returned values, you find:

  • Email Address
  • Phone Number
  • Title
  • Seniority
  • LinkedIn Profile URL
  • Address (city, state, country)
  • Company name, website, LinkedIn URL

Apollo free tiers in generous for API calls. You get 600 enrichment per day using their API. Create an account on Apollo.io, and get an API Key at https://developer.apollo.io/keys/.

Apollo.io Company Enrichment

In addition to the Apollo.io People Enrichment, Datablist now has an enrichment for company data using the Apollo.io API.

It takes a company domain (or URL) and returns:

  • Company Name
  • Website
  • LinkedIn URL
  • Twitter URL
  • Facebook URL
  • Crunchbase URL
  • AngelList URL
  • Address/Country
  • Phone Number
  • Industry
  • Founded Year
  • Number of employees

Notes on enrichments

Datablist Enrichments will be my next focus. Now that the foundation for data cleaning and data consolidation is done, I can move to the next layer.

For enrichments, first I see a revamp of the "Enrichment Runner" to make it simpler to use and to better handle errors. Datablist will get connected to more third-party APIs to enrich people, email addresses, and companies. As well as some native premium enrichments to be used with Datablist Credits System.

Each data provider has some specificity, some can work with LinkedIn URLs, others with email addresses, and some are best suited for the USA or Europe. Costs add up when you have to subscribe to each provider. Datablist will help you save money with those integrations.

Contact me if you want to share ideas and/or suggest integrations.

Generate PDF for a list of URLs

This enrichment takes an URL, opens a headless Chrome browser, and triggers a print. The result is saved and the download link is returned for each URL.

You can specify the page orientation.

Improvements

New domain output for the Free Email Validator

Datablist free email validation service now returns the domain from the list of email addresses.

Combined with the "Business Email" output (returns True if the domain is not from a generic email provider (Gmail, Yahoo, etc.)), you can get company data from your email list with the Apollo.io Company Enrichment.

Convert timestamp to Datetime

A new data type conversion is available to get a Datetime from a Unix timestamp. A timestamp is a way to represent a date using the number of seconds from the Unix Epoch on January 1st, 1970 at UTC. Datablist detects timestamps in seconds or milliseconds and returns a formatted Datetime.

Improvement with Copy-Pasting

In spreadsheet tools, pasting tabulated data overwrites the cell's values. With Datablist, and its structured data and items, pasting data creates new items.

This is what users are expecting 90% of the time (I think). And still, copy-pasting to edit multiple cell values in bulk is great.

Datablist should be able to perform both. A first iteration has been deployed to edit several cells after pasting tabulated data when the data contains only one column.

For now, it only works when the pasted data has one column. Datablist shows a confirmation dialog to know if it must create new items or edit the current cells.

Another change has been released to improve what text is set to the clipboard on a "copy" action. If you perform a copy to clipboard (ctrl+c) and get something that doesn't feel right, please tell me.

Other Improvements & Fixes

  • Show memory error notification. To get fast interactions, Datablist uses a local database that lives inside your web browser. When importing a CSV file, Datablist stores the data on this database and synchronizes it with Datablist servers (when Cloud Syncing is enabled). Web browsers may prevent Datablist to store data. This happens during private browsing with some web browsers, or when your hard drive is full. Datablist now shows an error notification when it can't store data locally.
  • Improve value unicity processing. Now after the cell edition and copy-pasting.
  • Fix import for CSV files with multiple similar headers
  • Import TXT files with a single line and only comma-separated values
  • Skip deleted properties during full-text search


May 2023

Clean and enrich your data with ChatGPT

ChatGPT is amazing. It's cheap and it brings real value for data cleaning, segmentation, or summarisation. I'm still scratching the surface of its potential with Datablist.

In May, I added 2 new enrichments with ChatGPT: "Ask ChatGPT" and "Classification with ChatGPT".

I'm curious about how to integrate it more with Datablist. If you have ideas, please share them with me 🙂

Ask ChatGPT

The "Ask ChatGPT" is simple: write a prompt and select an input property. Datablist sends a request for each of your items with a message using the prompt and the text from your item.

It uses the GPT-3.5 Turbo model and handles retries on ChatGPT rate errors.

Text Classification with ChatGPT

My favorite use of ChatGPT is text classification. Given a text, ask ChatGPT to assign it to a label.

For job titles, ChatGPT performs well to segment them between tech, marketing, sales, and operation. For locations, ChatGPT can classify them between continents.

The "Classification with ChatGPT" enrichment brings two interesting improvements over the "Ask ChatGPT":

  • First, you just need to write the list of labels separated with commas, and Datablist writes the prompt for you
  • Second, it has a cache on top of ChatGPT. If you run it on items with the same input texts, it saves you some processing time (ChatGPT is slow), and it saves you ChatGPT tokens.

Improved Email Address Validation

In May, I worked on the Email Address Validation enrichment.

I noticed the disposable domains list was not exhaustive. I've added a lot of new temp email providers. The enrichment now compares each email domain with a list of more than 50k junk domains.

Also, I've added two new outputs data:

  • Business Email - A checkbox that returns true if the email domain doesn't belong to generic email providers (such as Gmail, Yahoo, etc.)
  • Processed - A checkbox that is set to true once the validation algorithm has processed the item. This is useful to filter your email list to avoid re-validating email addresses again.

Data Synchronization Improvements

I've improved the cloud synchronization process:

  • After an import, your data will be synchronized faster to Datablist Cloud API.
  • The number of saved items is now visible during the synchronization with Datablist Cloud API (see image below).
  • When you connect to Datablist on a new web browser, or if you are a new user, an initial synchronization occurs to fetch data from Datablist Cloud API. Before, an "empty collection" message was displayed until the end of the data fetching. On large collections, with fetching taking some time, the "empty collection" message felt not right. From now on, the collection items are refreshed directly after the first items are fetched.
  • Several synchronization issues have been fixed. And conflicts saving are better handled.
  • And other bugs with data syncing have been fixed.

Improvements

Export duplicate groups

A top requested feature: exporting the duplicate items in a CSV or Excel file!

For some use cases, removing or merging duplicates in Datablist doesn't make sense. When you want to have a list of item ids to remove them from an external system (a database, a CRM, etc.), you expect a CSV with the list of item ids to delete.

Copy data from one property to another

This is a new data manipulation action. It copies values from one property to another one with an option to prevent the copy if the destination property already contains data.

Improvements & Fixes

  • On large collections, the Undo/Redo caused page crashes following a memory limit. Datablist keeps the previous data value in memory on bulk edit actions to allow undo operation. On a 1 million items collection, that means keeping the previous values for 1 million items in memory... To prevent this, the Undo/Redo manager discards old undo operations when it takes too much memory. This is not perfect. At some point, a real revision system will be implemented.
  • Fix the "Export Ready" counter when exporting selected items

April 2023

Deduplicate items across collections

I use Datablist to create lists of prospects. I have lists of companies from LinkedIn, a list from my user base, lists from scraping, company databases, etc.

All those lists have different properties. So, it doesn't make sense to create a single list to manage all my prospects. I like to keep them in different collections.

Until now, I couldn't check duplicate leads across all of my prospect's collections. From all the feedback I received, I was not alone to have this issue.

In April, I made big changes to the Duplicates Finder. I enabled deduplication across multiple collections and I moved the Duplicates Finder from an exact match algorithm to a probabilistic one.

I'm very confident this feature will help you deal with your lists of contacts the way it helps me. It's great to find engaged leads who appear in several communities. And to cross-check it with your user base.

You can check our updated Duplicates Finder documentation to learn more.

Improved deduplication algorithm

Match duplicate items that have empty values

Building a deduplication algorithm is complex. A brut force algorithm doesn't scale well. A list of 200 000 items generates 200 000*199 999/2 = 19 999 900 000 unique item pairs.

The previous "Duplicates Finder" algorithm was fast but worked only for exact matches. If you had a collection with leads and you ran the algorithm on the "names", "email addresses" and "company websites". It found duplicate items that had the same values.

If a lead had an empty company website, or no email address, the lead was often ignored.

With the new deduplication algorithm, the Duplicates Finder finds duplicate items even with some empty values. It computes a similarity score between items that work with incomplete data.

You can check our updated Duplicates Finder documentation to learn more.

Probabilistic similarity score

As I said above, the Duplicates Finder now uses a similarity score to find duplicate items. Datablist takes two items and calcules the similarity between them.

It opens a lot of possibilities to compare items that are not 100% similar. I've released two new algorithms to find duplicate items with minor differences.

The first one is the "Smart Algorithm":

  • It removes all spaces and punctuation characters (before, after, between words)
  • It matches words in different orders
  • It removes URL protocol for URL comparaison

For example:

Item Id | Full Name | Company Website
00001 | James-Bond | https://www.acme.com
00002 | bond james | http://www.acme.com
00003 | james bond |

Would all pop up as duplicate items.

The second algorithm uses the "Metaphone" phonetic algorithm. It converts texts to codes to match similar-sounding words.

For example:

Item Id | Full Name | Company
00001 | Filip Dupon | google
00002 | Dupont-Philip | GOOGL
00003 | Dupond philippe | gogle

Would be flagged as duplicate items.

You can check our updated Duplicate Finder documentation to learn more.

Optimized duplicate group listing and merging for large lists

And one more thing, I've improved the Duplicate Finder results page to scale with thousands of duplicate groups. The page could freeze before when you had a lot of items flagged in duplicate groups.

The new page load the items on demand so it scales up to thousands of items.

A new "Don't process" action was added. It removes the duplicate group from the results listing. Skipped groups are ignored during the "Auto Merge" action.

New enrichments

Name Parser

Return the gender, country, and all name parts (First Name, Last Name, Title, etc.) from a person's full name.

Extract the name from an email address

Use probabilistic analysis to parse an email address and extract a first name and a last name.

Location Lookup

Return the City, Country, Latitude, and Longitude for a location. Read our new guide to extract the City and Country from a list of addresses.

Improvements & Fixes

  • Fix auto detect of data type for numbers with more than 22 digits. They will now be imported as Text.
  • Fix the issue with running enrichments before the credits balance is loaded
  • Fix the issue with running enrichments before the enrichment options are loaded
  • Change Payment Method and Password directly in your Datablist account

March 2023

Move items between collections

In March, I released a new feature to move items between two collections. Moving items is useful to clean and segment your data. You can move items once they are enriched, or split your master collection into sub-collections.

Read our documentation to learn how to move items between collections.

JavaScript code

Save JavaScript code into your code library

Writing JavaScript code is both complex and powerful. You can write JavaScript code to fill a property using data from the other properties (for example to set a "valid" property based on the value of other properties). Or you can edit your data with complex operations that would be impossible with simple spreadsheet formulas.

But re-writing every time your JavaScript code is error-prone. With the new "Code Library" released in March, you can save your JavaScript code in your account and run it directly.

Read our documentation or contact me if you need help writing JavaScript.

Call APIs from your JavaScript code

I've disabled the limitations on JavaScript code for standard users. You can now write JavaScript code to interact with external APIs using the fetch interface.

Check our documentation or contact me to discuss your use case.

Datablist API for standard users

Another new release to help you build complex workflow on Datablist with the opening of Datablist API. Datablist API is restricted to standard users.

It works with "Personal API Keys" that let you get access tokens to interact with Datablist API.

Please check our Developers' Documentation and our Postman collection.

Enrichments improvements

Save enrichment configuration

Previously, you had to set the enrichment settings and configuration every time you opened the enrichment drawer.

This was not ideal for day-to-day use. And you could make mistakes during the mapping.

Your settings and properties mappings are now saved in your browser. When you open an enrichment, the configuration will be automatically filled based on your previous run.

Settings with text values can be sensitive. Some enrichment use settings to pass "API Key" for example. To avoid your setting values to be accessed, they are encrypted with a 256 bits key.

This feature is enabled by default. You can disable it by clicking the setting icon at the bottom of the enrichment drawer.

Overwrite items with enrichment results

Another improvement with Enrichments is the "Overwrite value" option. By default, Datablist doesn't edit your cell if it already contains data.

With this option enabled, the enrichment results will overwrite existing values.

New enrichments

Moz.com

If you are managing company leads, you will like the new "Moz.com" integration. It lets you process domains to get domain authority, the number of backlinks, etc. from Moz.

Entities Extractor

Extract company names, person names, or locations from any text. This action uses machine learning to process your data automatically.

The model is trained in Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese, and Chinese.

GPS Coordinates Finder

This enrichment uses Bing Maps API to get Latitude and Longitude coordinates from an address.

Improve export for large collections

Datablist has a 1.5 million rows limit for CSV files. But you can import big CSV files by splitting them and performing multiple imports. There is no hard limit on the number of items a collection can store. It depends on your browser database.

I improved the export mechanism to work with collections containing several millions of items. You will now see a process notification showing how many items have been collected for the export file.

And two options have been added to deal with exports of large collections. You can now set a count and an offset parameter to export your collection into several files.

Improvements & Fixes

  • Improve LinkedInProfileFinder and fix throttling errors
  • Show how many items are currently processing during an action/enrichment run
  • Fix copy-pasting when the drawer is open
  • New Number to Text conversion in "Clean -> Text <=> Number"
  • Add "Line Break" delimiter for "Merge Properties"
  • New mathematics operation for numbers in BulkEdit. Add, Subtract, Multiply, Divide.
  • Fix sorting on native collection properties "createdAt" and "updatedAt"
  • Prevent running Javascript Code if the preview raises an error
  • New Search engine in the documentation
  • New documentation page for "Run JavaScript"
  • Fix filtering on equal DateTime comparison