Linking datasets
20 minutes


In this tutorial, we will go over how to link multiple datasets together using Dedupe.io.

New to Dedupe.io?
We recommend going through our Intro to Dedupe.io tutorial first.

Our linking approach

Dedupe.io takes the approach of starting with one dataset, optionally de-duplicating it, and then linking additional datasets to it, one at a time.

This means that for any dataset you’ve already uploaded to Dedupe.io, you can link additional datasets to it based on a common set of fields.

Currently, linking is additive in Dedupe.io, meaning when you link two datasets together the clusters that we find in both datasets will be merged. In future releases, we will support linking that checks one dataset against another without merging them.

Uploading multiple datasets

To start a data linking session, start by uploading a dataset you want to link by clicking on the ‘Upload a new dataset’ button.

You’ll be taken to the Upload data page. Fill out the name and description as you would for any dataset.

Uploading de-duplicated data

Dedupe.io supports uploading datasets that are already de-duplicated (one row for each unique record). If this is the case, you can check the ‘This data is already de-duped’ checkbox. Checking this will skip all the Dedupe.io steps and mark your dataset as de-duplicated in the system.

If your data is not already de-duplicated, you can follow our tutorial on de-duplicating one dataset.

Chicago vacant buildings and violations

In the below example, I am going to link together Chicago Vacant and Abandoned Buildings Reported from 311 Service Calls (already de-duplicated) and Vacant and Abandoned Buildings - Violations to find the buildings that have been reported as abandoned and that have also received violations (and perhaps more importantly, find which ones haven’t received violations).

Below, I am starting a new dataset and checking the ‘This data is already de-duped’ checkbox.

Uploading an already de-duplicated file
Uploading an already de-duplicated file

That will take a few minutes to process, as it is about 30,000 rows. When it’s done, I can upload another dataset: the violations. For this dataset, I am going to indicate that I want to link it to Chicago Vacant and Abandoned Buildings from the drop down list.

Uploading an already de-duplicated file
Linking violations to Chicago Vacant and Abandoned Buildings

Once I’ve uploaded the data and it has been processed, we’ll move on to the next step.

Aligning fields to compare

Next, we will identify the fields in each dataset that we want Dedupe.io to pay attention to for finding duplicates. You’ll be shown a drop down list of column names to pick from.

Next to that, you will pick from Compare as to tell Dedupe.io how to compare values in that column. The Default comparator will compare based on how similar each field is, character by character. Address will automatically split address and into separate components making them much easier to compare. Name will do the same for person and company names.

Dedupe.io has several other kinds of comparators, which you can read about on the field comparators page.

In the case of my vacant buildings data, I want to link on address. However, address is all in one field for one dataset, and split across five fields in another. We can combine these five fields together, shown below, and use the Address comparator to properly compare them.

Uploading an already de-duplicated file
Aligning fields to compare

Dedupe.io can compare multiple fields, which is what we recommend whenever possible. Add additional fields by clicking the blue ‘Add field to compare’ button in the lower right corner. In this case, though, all we have to match on is address, so we’ll stick with that and move on.

Train model

Next, dedupe.io will take a sample from each of your datasets and pick two random records for you to review.

For this pair of records, Dedupe.io asks us, ‘Do these records refer to the same thing?’ We will answer ‘Yes’, ‘No’, or ‘Unsure’. Once we mark these records, Dedupe.io will find another pair for us to review and we’ll repeat the process.

Uploading an already de-duplicated file
Training Dedupe.io

Dedupe.io uses these responses to refine its understanding of your data. The more training you provide, the better the linking results will be. At a minimum, we need 10 positive and 10 negative responses to proceed.

You are, however, welcome to provide as much training as you’d like. Just know that Dedupe.io will learn a little less for each additional training pair you mark. Stopping at 50 yes and 50 no responses would be more than enough.

Add to clusters

Once you’re done training and click the ‘Next’ button, Dedupe.io will take some time to apply your training to the rest of your data. This can take several minutes for datasets under 100,000 rows to several hours for datasets with millions of rows.

Now that it has a good idea of the clusters that are in your data, it is looking through all the individual records that have not yet been added to a cluster.

When it is finished, you will be able to continue on to the next step. Here, we will review these records and match them to one or more clusters.

In this example, we have an unmatched record in yellow and two clusters that look like good matches. We’ll keep the checkbox next to each cluster checked and click the ‘Match record to cluster(s)’ button.

Uploading an already de-duplicated file
Matching records

Automatically accepting the rest

After reviewing each record, we’ll be shown another one until we’ve gone through the entire queue. We have the option if we’re confident enough, to ‘Automatically judge the rest’. Clicking this will have Dedupe.io judge the rest of the records automatically and skip us ahead to the next and final step.

Warning: take caution when automatically judging too many records. The records Dedupe.io is asking you to match here are the records that are often the most ambiguous. We recommend reviewing as many of these records as possible for the best results.

You’re done! Browse your results and download your data

When Dedupe.io finishes processing, you’ll have the ability to browse your data.

For detailed instructions, read the section on browsing your results in Getting started with Dedupe.io tutorial.

Questions?

If you have any questions on this tutorial, on dedupe.io, or are interested in signing up for our private beta, get in touch at dedupe@datamade.us.