Changelog

New features, improvements and fixes to Datablist.

March 13th, 2024

New Enrichments Experience

After building strong foundations for dealing with CSV files and data cleaning (deduplication, etc.), it's time to work on enrichments.

The vision is simple: the web is overwhelmed with external services to enrich companies/people, verify email addresses, guess the gender from a name, scrape URLs, etc. But it's a mess to combine all those services.

A way to do it is to use workflow automation tools (Zapier, Make, n8n) on top of spreadsheets. Yet, it is complex, error management is a mess, and it's mostly suited for event-based workflows.

Datablist aims to replace spreadsheet tools for list management (lead generation, lead scoring, product catalogs, customer management, company screening, etc.). A central hub with built-in enrichment integration.

Here are the latest developments to get enrichments as a first-class citizen in Datablist.

Enrichments Listing

Enrichments are listed in a new drawer. The top bar lets you filter between enrichments for "Companies", "People", "Translations", "Places", "AI (Artificial Intelligence)", and "URLs".

For each enrichment, the inputs and output properties are visible. The cost is displayed directly in the listing.

And a bookmark flag moves your favorite enrichments to the top.

Enrichment Runner

The "Enrichment Runner" is the screen to configure and run an enrichment. I've heard your feedback and the runner has been revamped.

Custom inputs with RichText editor

Imagine you have an enrichment with a "Full Name" input and you have "First Name" and "Last Name" in your collection. That's when you will be happy to use the new "Custom Input" feature.

You can write custom texts with variables from your properties. In the previous example, you would write "{{firstName}} {{lastName}}" to build "Full Name" input values.

Auto-skip items with existing data

I want the default behavior to be the least risky. So you don't lose or overwrite data. With the new runner, the default behavior is to skip your items when there is already some data in the output properties.

For example, if you use a translation enrichment. You have a "Source" property with the text to translate. And a "Target" property to store the translated text.

When you run the translation, it will translate and populate the "Target" property.

Then, you add new items, etc. The second time you run the translation enrichment, it will skip all the items with a text in the "Target" property. So only the new items will be sent to be translated.

This setting is available in the "Existing Data Rule". Other options are available to only edit data for the empty cells, or to overwrite data.

See new properties to be created

It can be complex to understand how the enrichments work. With "Settings", "Inputs", "Outputs", etc.

The outputs section is even more complex. You can ignore an output, map it with an existing property, or create a new property to store the data.

With this new runner UI, I've made some visual changes to better understand what will happen with your enrichment outputs.

New properties are shown in green.

New properties are not created until you run the enrichment. You can change the output configuration without messing with your collection data structure.

Test on first 10 items

Enrichments can mess with your data and some of them cost credits. You need to be sure the enrichment works like you expect it to do.

Before running on all your current items, the enrichment will be run on the first 10 items. Once you have validated the results, it runs on the remaining items.

Better errors management

Dealing with external APIs can give headaches! You can get server errors (for example throttling errors).

Datablist stops the enrichment when an error occurs. To prevent any collateral damage.

But once you have seen the error message, you might want to retry the enrichment on the remaining items. This is now possible. The runner keeps track of the item IDs that have been processed. A "Retry" button is available after an error happens. On Retry, Datablist will skip the already processed items.

Async Enrichments

Another big release with the asynchronous runner! Previously, enrichments could only run from the browser. This was enough for fast enrichments, or with a small number of items. But you had to keep your browser open to enrich a large collection. This prevented me from adding long-running enrichments such as email finder, email verification, scraping, etc.

Currently, it is not possible to choose to run a specific enrichment asynchronously. Some enrichments that take a long time to be processed have been configured to run asynchronously, and others are still triggered by the browser. In the coming weeks, you will be able to select how to run the enrichment.

This opens several future possibilities such as workflow building, etc.

Data Sources for Lead Generation

Data Sources are a new kind of enrichment! Classic enrichments run on each item to provide additional data. Whereas data sources create new items. This is perfect for lead generation.

Data Sources are available from the "Import" menu.

Start from a Google Search query

This one is self-explicit. You write a Google Search query, and it returns the Google results as items. This source can scrape a maximum of 200 results with a free account, and up to 1000 results with the Standard plan.

Google is more powerful than you might think. With operators, it is possible to build complex queries and search on specific websites.

A search such as "saas human resources site:linkedin.com/company/*" will return all SaaS companies in the HR spaces. You can search for LinkedIn profiles, job ads, etc.

Will scrape and import results items from the following Google query:

Start from a Sitemap URL

This data source is technical but very powerful. Sitemaps are XML files listing all the webpage URLs a website has. For datablist.com, the sitemap lists all guides, blog posts, etc.

And for companies or people directories, job boards, blogs, etc. you can scrape the pages in a snap using the sitemap and the Bulk Scraper (or Links Scraper).

This data source plays nice with the "Unique value" setting available for a property. With the "Unique value" setting, you can detect a delta between two sitemap imports. Perfect for finding new job ads, newly published companies, or people.

If you need help implementing a Lead Generation workflow using sitemaps, just contact me.

New Enrichments

LinkedIn Profile Scraper

Extract public information from LinkedIn profiles. This enrichment loads and parses LinkedIn Profile pages to get data. An option is available to only fetch the profiles in real-time or to allow cached profile data.

Email Verification Premium

Complete email verification service. From syntax validation to checking the mailbox exists and can receive emails.

Email Finder

Find a professional email address using first name, last name, and company info (name or domain). Email addresses are verified and you pay only for the emails found.

Bulk Scraper

Scrape URLs with CSS selectors. Use the proxy option to scrape protected webpages, and configure multiple selectors to scrape several texts.

Links & Email Addresses Scraper

Scrape a page and search for LinkedIn member/company profile URLs, Email Addresses, and Instagram Profiles.

Apollo People Search

Search one or more profiles using Apollo.io. Define matching Job Titles, Seniority, and company domains and get profiles.

PeopleDataLabs Person Search

Search one or more profiles using PeopleDataLabs powerful ElasticSearch query language. Use variables from your items to build complex queries.

Instagram Profile Scraper

Extract public information from Instagram profiles in bulk. This enrichment loads and scrapes Instagram Profile pages to get data.

Find Company domains from Company names

Return the domain matching the company name. Return the domain with the more traffic when several domains match.

Detect Language from a Text

Return the language code and name by analyzing a text.

Duplicates Finder

Two improvements have been done to the Duplicates Finder.

The first is a link to automatically combine or drop the remaining conflicting properties in duplicates finder.

The links are available after a first "Auto-Merge" that returns the conflicting properties.

The second improvement is a new button to download the changes list from the duplicates merging. The change list contains the modifications done on each item: updated or deleted. And two columns for each property, "Previous {property name}" and "Destination {property name}".

Extract Menu

A new "Extract" menu has been added in the collection header. You can extract email addresses, tags, domains, etc. from texts.

Improved Splitting Property tool

The "Split Property" has been improved.

First, you no longer need to explicitly set the number of properties to create. Now, an "analysis" step scans your first 2000 items to detect the best number of properties to create.

Second, a new option is now available to group split terms by name.

New Filters

Startswith, Endswith, and RegEx filtering on texts

Startswith, Endswith, and RegEx filters are now available on texts.

RegEx expressions are powerful when you master them. Perfect for finding items that match a pattern (phone number validation, URLs).

Check Data Filtering documentation.

Relative filters on DateTime

You can now filter dates by comparing them to the current day.

You define three parts:

  • Next or Last
  • A number
  • A duration term: hours, days, months, years

For example: "Last 2 days".

Map Extract and Convert results into an existing property

Previously, extract and convert tools returned the results into new properties. So, after adding new items, you couldn't re-run the tools on the new items without having to create new properties.

You can now select if the results go to a new property or an existing one. Only compatible properties are available. If you convert Text to DateTime, you can only map the result property to existing DateTime properties.

Misc

  • Show a tooltip on the preview cells with text overflow
  • Allow Number and Checkbox properties for RichText variables
  • Add "Sum" in Calculations
  • Create a new property with the keyword shortcut "p" on a collection page
  • Shortcuts to filter from the "Distinct Value" calculation
  • Convert DateTime to Text
  • Handle multiple date formats for Text to DateTime conversion
  • Shortcut to BulkEdit from the column menu
  • Allow Bulk Edit on DateTime properties

Bug Fixes

  • Fix error when editing a collection name and directly switching to another collection
  • Fix phone numbers (+XXXX) that were imported as numbers. CSV columns with texts in the format "+XXXXX" (plus sign and digits) with at least 8 digits are kept as Text.
  • Fix loading items issue when switching between collections quickly. A "loading" text was displayed and the items didn't load.


November 1st, 2023

Calculations

You can now run calculations on property values. Calculations are accessible from a property column menu.

Datablist runs the calculation in the "current view". It takes the items in this order:

  • If you have selected items in your collection, it will process them.
  • If you have a filter or a full-text search term, it will process the filtered items.
  • Otherwise, it will process all your collection items.

Calculations available for all data types:

  • Count Empty - How many items with an empty value for the property.
  • Count Filled - How many items with a value for the property.

Other calculations depend on the property data types such as Text or Number.

Calculation available for text-based data types:

  • Characters count - Return the sum of all characters. Leading and trailing spaces are not counted. Spaces in between words are.
  • Words count - Return the number of words found in the texts.
  • Count distinct values - Return facets for a property with how many times each value appears. This is great for aggregation of limited choice values (countries, status, etc.).

For number-based data types:

  • Min - Return the lowest value for the property.
  • Max - Return the highest value for the property.
  • Average - Return the sum of values divided by the number of non-empty values.

Check the calculations documentation to know more.

Filter Groups

Data Filtering has been improved with "Filter Groups".

With Filter Groups, you can create complex filters with different filtering operations. Filtering operations define how filters are combined. With "AND", an item must pass all conditions. With "OR", an item passes once one of the filters returns true.

Filter Groups are compatible with Saved Filters.

Duplicate Finder Improvements

Select a different algorithm for each property

Until now, a single data-matching algorithm was selected before the deduplication process. Internally, Datablist checked each property data type to apply the selected algorithm on compatible properties. And it fell back to Exact matching on the other properties (e.g. Date, Checkbox, Number).

Now, each property used for deduplication is listed in the data-matching algorithm step.

Compatible algorithms are listed according to their data type. And options only apply to the property.

For example, two properties might use a fuzzy matching algorithm and have different distance thresholds.

Ignore the case in the Exact algorithm

By default, Datablist Duplicates Finder is case-insensitive. But in some cases, you need to match duplicate values only when they have a similar case.

A new option is available for the "Exact" Algorithm to be case-sensitive.

Master Item Rule selection

After the data matching step, an important part of deduplication is duplicate merging. With the auto-merge algorithm, Datablist selects a master item, merges the values from the other items in it, and deletes all but the master item.

By default, the elected master item is the one with the most data.

A new setting has been added in the auto-merging assistant to change this master item selection.

Two new rules are now available:

  • Last Updated - This rule chooses the item based on the newest modified date.
  • First Created - This rule chooses the item based on the oldest creation date.

During this development cycle, the "Most Complete" default rule has also been improved. Until now, the rule checked how many properties had data. When two items had the same number of properties with data, it took the last created item.

Now, for two items with the same number of properties with data, it also checks the text length.

For two items such as:

First Name | Last Name | Notes

John...... | Doe ..... | A great man.

John...... | Doe ..... | A great man. Remember to contact him.

The second one will be selected as the master item. The "Notes" text is longer for the second item.

Normalize street names

In Data Cleaning, normalization ensures you have a uniform format across all your data. Normalization reduces errors during deduplication and you get a consistent view of your data.

I have several built-in normalizations in mind for later:

  • Company name normalization to remove suffixes such as "Inc." or "GmbH".
  • People name normalization to clean nicknames, deal with initials, etc.

Last month, I released the first normalization algorithm to deal with street names written in English.

The "Normalize Street Name" algorithm deals with abbreviations (St. == St == Street), directional words (N 45 == North 45), etc.

Other Improvements & Fixes

  • Option to auto-generate column names during import for files without headers.
  • Fix Excel export in selected items (and duplicate groups download).
  • Fix auto merging on properties with punctuation differences.
  • Show how many duplicate groups have been merged during the auto-merge process.
  • Auto updated disposable provider domain list and added Stop Forum Spam as a new source.
  • Fix anonymous collection import for collections with more than 10k items.
  • Auto open Datetime picker on cell edition.
  • Show data loss warning every 48 hours for collections not synced to the cloud (anonymous, or free account with more than 1000 items per collection).


August 22, 2023

Datablist Extractor: Extract domains, email addresses, mentions, etc.

With Datablist Extractor, you can now extract the domains from a list of email addresses, or find all URLs in texts.

Domains, Emails, URLs, mentions (@xx), tags (#xx), etc. are structured entities to use later to enrich a company, a contact, or websites.

This was ranked high in the requested features. And it will play nice with future enrichments (see "Notes on enrichments" below).

For the first release, the following extractors are available:

  • Extract the domain from an email address
  • Extract the domain from an URL
  • Extract URL(s) from a text
  • Extract mentions (ex: @name) from a text
  • Extract tags (ex: #string) from a text
  • Extract emails from a text

Feel free to contact me if you need other extractors.

Datablist Extractor is available from the "Edit" button.

Deduplication with Fuzzy Matching

Datablist Duplicates Finder is getting better with fuzzy matching. Fuzzy comparisons work by calculating the similarity between two strings with a distance function. And a threshold lets you decide when the strings must be considered similar.

Fuzzy matching is perfect to find duplicate leads with people or company name typos. Or to find items with the same postal addresses written with variations.

Datablist implements two distances algorithms:

The threshold goes from 20 to 100. 100 for an exact match. The default value is set to 80.

Apollo.io People and Company enrichments

This summer, I've added two enrichments connected to the Apollo.io API. One for people and the other for companies.

Apollo.io People Enrichment

The enrichment is connected to Apollo.io People Enrichment. With at least a name and a company domain (or email address), Apollo returns all the business data for your contacts.

Among the returned values, you find:

  • Email Address
  • Phone Number
  • Title
  • Seniority
  • LinkedIn Profile URL
  • Address (city, state, country)
  • Company name, website, LinkedIn URL

Apollo free tiers in generous for API calls. You get 600 enrichment per day using their API. Create an account on Apollo.io, and get an API Key at https://developer.apollo.io/keys/.

Apollo.io Company Enrichment

In addition to the Apollo.io People Enrichment, Datablist now has an enrichment for company data using the Apollo.io API.

It takes a company domain (or URL) and returns:

  • Company Name
  • Website
  • LinkedIn URL
  • Twitter URL
  • Facebook URL
  • Crunchbase URL
  • AngelList URL
  • Address/Country
  • Phone Number
  • Industry
  • Founded Year
  • Number of employees

Notes on enrichments

Datablist Enrichments will be my next focus. Now that the foundation for data cleaning and data consolidation is done, I can move to the next layer.

For enrichments, first I see a revamp of the "Enrichment Runner" to make it simpler to use and to better handle errors. Datablist will get connected to more third-party APIs to enrich people, email addresses, and companies. As well as some native premium enrichments to be used with Datablist Credits System.

Each data provider has some specificity, some can work with LinkedIn URLs, others with email addresses, and some are best suited for the USA or Europe. Costs add up when you have to subscribe to each provider. Datablist will help you save money with those integrations.

Contact me if you want to share ideas and/or suggest integrations.

Generate PDF for a list of URLs

This enrichment takes an URL, opens a headless Chrome browser, and triggers a print. The result is saved and the download link is returned for each URL.

You can specify the page orientation.

Improvements

New domain output for the Free Email Validator

Datablist free email validation service now returns the domain from the list of email addresses.

Combined with the "Business Email" output (returns True if the domain is not from a generic email provider (Gmail, Yahoo, etc.)), you can get company data from your email list with the Apollo.io Company Enrichment.

Convert timestamp to Datetime

A new data type conversion is available to get a Datetime from a Unix timestamp. A timestamp is a way to represent a date using the number of seconds from the Unix Epoch on January 1st, 1970 at UTC. Datablist detects timestamps in seconds or milliseconds and returns a formatted Datetime.

Improvement with Copy-Pasting

In spreadsheet tools, pasting tabulated data overwrites the cell's values. With Datablist, and its structured data and items, pasting data creates new items.

This is what users are expecting 90% of the time (I think). And still, copy-pasting to edit multiple cell values in bulk is great.

Datablist should be able to perform both. A first iteration has been deployed to edit several cells after pasting tabulated data when the data contains only one column.

For now, it only works when the pasted data has one column. Datablist shows a confirmation dialog to know if it must create new items or edit the current cells.

Another change has been released to improve what text is set to the clipboard on a "copy" action. If you perform a copy to clipboard (ctrl+c) and get something that doesn't feel right, please tell me.

Other Improvements & Fixes

  • Show memory error notification. To get fast interactions, Datablist uses a local database that lives inside your web browser. When importing a CSV file, Datablist stores the data on this database and synchronizes it with Datablist servers (when Cloud Syncing is enabled). Web browsers may prevent Datablist to store data. This happens during private browsing with some web browsers, or when your hard drive is full. Datablist now shows an error notification when it can't store data locally.
  • Improve value unicity processing. Now after the cell edition and copy-pasting.
  • Fix import for CSV files with multiple similar headers
  • Import TXT files with a single line and only comma-separated values
  • Skip deleted properties during full-text search


May 2023

Clean and enrich your data with ChatGPT

ChatGPT is amazing. It's cheap and it brings real value for data cleaning, segmentation, or summarisation. I'm still scratching the surface of its potential with Datablist.

In May, I added 2 new enrichments with ChatGPT: "Ask ChatGPT" and "Classification with ChatGPT".

I'm curious about how to integrate it more with Datablist. If you have ideas, please share them with me 🙂

Ask ChatGPT

The "Ask ChatGPT" is simple: write a prompt and select an input property. Datablist sends a request for each of your items with a message using the prompt and the text from your item.

It uses the GPT-3.5 Turbo model and handles retries on ChatGPT rate errors.

Text Classification with ChatGPT

My favorite use of ChatGPT is text classification. Given a text, ask ChatGPT to assign it to a label.

For job titles, ChatGPT performs well to segment them between tech, marketing, sales, and operation. For locations, ChatGPT can classify them between continents.

The "Classification with ChatGPT" enrichment brings two interesting improvements over the "Ask ChatGPT":

  • First, you just need to write the list of labels separated with commas, and Datablist writes the prompt for you
  • Second, it has a cache on top of ChatGPT. If you run it on items with the same input texts, it saves you some processing time (ChatGPT is slow), and it saves you ChatGPT tokens.

Improved Email Address Validation

In May, I worked on the Email Address Validation enrichment.

I noticed the disposable domains list was not exhaustive. I've added a lot of new temp email providers. The enrichment now compares each email domain with a list of more than 50k junk domains.

Also, I've added two new outputs data:

  • Business Email - A checkbox that returns true if the email domain doesn't belong to generic email providers (such as Gmail, Yahoo, etc.)
  • Processed - A checkbox that is set to true once the validation algorithm has processed the item. This is useful to filter your email list to avoid re-validating email addresses again.

Data Synchronization Improvements

I've improved the cloud synchronization process:

  • After an import, your data will be synchronized faster to Datablist Cloud API.
  • The number of saved items is now visible during the synchronization with Datablist Cloud API (see image below).
  • When you connect to Datablist on a new web browser, or if you are a new user, an initial synchronization occurs to fetch data from Datablist Cloud API. Before, an "empty collection" message was displayed until the end of the data fetching. On large collections, with fetching taking some time, the "empty collection" message felt not right. From now on, the collection items are refreshed directly after the first items are fetched.
  • Several synchronization issues have been fixed. And conflicts saving are better handled.
  • And other bugs with data syncing have been fixed.

Improvements

Export duplicate groups

A top requested feature: exporting the duplicate items in a CSV or Excel file!

For some use cases, removing or merging duplicates in Datablist doesn't make sense. When you want to have a list of item ids to remove them from an external system (a database, a CRM, etc.), you expect a CSV with the list of item ids to delete.

Copy data from one property to another

This is a new data manipulation action. It copies values from one property to another one with an option to prevent the copy if the destination property already contains data.

Improvements & Fixes

  • On large collections, the Undo/Redo caused page crashes following a memory limit. Datablist keeps the previous data value in memory on bulk edit actions to allow undo operation. On a 1 million items collection, that means keeping the previous values for 1 million items in memory... To prevent this, the Undo/Redo manager discards old undo operations when it takes too much memory. This is not perfect. At some point, a real revision system will be implemented.
  • Fix the "Export Ready" counter when exporting selected items

April 2023

Deduplicate items across collections

I use Datablist to create lists of prospects. I have lists of companies from LinkedIn, a list from my user base, lists from scraping, company databases, etc.

All those lists have different properties. So, it doesn't make sense to create a single list to manage all my prospects. I like to keep them in different collections.

Until now, I couldn't check duplicate leads across all of my prospect's collections. From all the feedback I received, I was not alone to have this issue.

In April, I made big changes to the Duplicates Finder. I enabled deduplication across multiple collections and I moved the Duplicates Finder from an exact match algorithm to a probabilistic one.

I'm very confident this feature will help you deal with your lists of contacts the way it helps me. It's great to find engaged leads who appear in several communities. And to cross-check it with your user base.

You can check our updated Duplicates Finder documentation to learn more.

Improved deduplication algorithm

Match duplicate items that have empty values

Building a deduplication algorithm is complex. A brut force algorithm doesn't scale well. A list of 200 000 items generates 200 000*199 999/2 = 19 999 900 000 unique item pairs.

The previous "Duplicates Finder" algorithm was fast but worked only for exact matches. If you had a collection with leads and you ran the algorithm on the "names", "email addresses" and "company websites". It found duplicate items that had the same values.

If a lead had an empty company website, or no email address, the lead was often ignored.

With the new deduplication algorithm, the Duplicates Finder finds duplicate items even with some empty values. It computes a similarity score between items that work with incomplete data.

You can check our updated Duplicates Finder documentation to learn more.

Probabilistic similarity score

As I said above, the Duplicates Finder now uses a similarity score to find duplicate items. Datablist takes two items and calcules the similarity between them.

It opens a lot of possibilities to compare items that are not 100% similar. I've released two new algorithms to find duplicate items with minor differences.

The first one is the "Smart Algorithm":

  • It removes all spaces and punctuation characters (before, after, between words)
  • It matches words in different orders
  • It removes URL protocol for URL comparaison

For example:

Item Id | Full Name | Company Website
00001 | James-Bond | https://www.acme.com
00002 | bond james | http://www.acme.com
00003 | james bond |

Would all pop up as duplicate items.

The second algorithm uses the "Metaphone" phonetic algorithm. It converts texts to codes to match similar-sounding words.

For example:

Item Id | Full Name | Company
00001 | Filip Dupon | google
00002 | Dupont-Philip | GOOGL
00003 | Dupond philippe | gogle

Would be flagged as duplicate items.

You can check our updated Duplicate Finder documentation to learn more.

Optimized duplicate group listing and merging for large lists

And one more thing, I've improved the Duplicate Finder results page to scale with thousands of duplicate groups. The page could freeze before when you had a lot of items flagged in duplicate groups.

The new page load the items on demand so it scales up to thousands of items.

A new "Don't process" action was added. It removes the duplicate group from the results listing. Skipped groups are ignored during the "Auto Merge" action.

New enrichments

Name Parser

Return the gender, country, and all name parts (First Name, Last Name, Title, etc.) from a person's full name.

Extract the name from an email address

Use probabilistic analysis to parse an email address and extract a first name and a last name.

Location Lookup

Return the City, Country, Latitude, and Longitude for a location. Read our new guide to extract the City and Country from a list of addresses.

Improvements & Fixes

  • Fix auto detect of data type for numbers with more than 22 digits. They will now be imported as Text.
  • Fix the issue with running enrichments before the credits balance is loaded
  • Fix the issue with running enrichments before the enrichment options are loaded
  • Change Payment Method and Password directly in your Datablist account