In this tutorial, we will go over how to link multiple datasets together using Dedupe.io.
New to Dedupe.io?
We recommend going through our Intro to Dedupe.io tutorial first.
Dedupe.io takes the approach of starting with one dataset, optionally de-duplicating it, and then linking additional datasets to it, one at a time.
This means that for any dataset you’ve already uploaded to Dedupe.io, you can link additional datasets to it based on a common set of fields.
Currently, linking is additive in Dedupe.io, meaning when you link two datasets together the clusters that we find in both datasets will be merged. In future releases, we will support linking that checks one dataset against another without merging them.
To start a data linking session, start by uploading a dataset you want to link by clicking on the ‘Upload a new dataset’ button.
You’ll be taken to the Upload data page. Fill out the name and description as you would for any dataset.
Uploading de-duplicated data
Dedupe.io supports uploading datasets that are already de-duplicated (one row for each unique record). If this is the case, you can check the ‘This data is already de-duped’ checkbox. Checking this will skip all the Dedupe.io steps and mark your dataset as de-duplicated in the system.
If your data is not already de-duplicated, you can follow our tutorial on de-duplicating one dataset.
Chicago vacant buildings and violations
In the below example, I am going to link together Chicago Vacant and Abandoned Buildings Reported from 311 Service Calls (already de-duplicated) and Vacant and Abandoned Buildings - Violations to find the buildings that have been reported as abandoned and that have also received violations (and perhaps more importantly, find which ones haven’t received violations).
Below, I am starting a new dataset and checking the ‘This data is already de-duped’ checkbox.
Uploading an already de-duplicated file
That will take a few minutes to process, as it is about 30,000 rows. When it’s done, I can upload another dataset: the violations. For this dataset, I am going to indicate that I want to link it to Chicago Vacant and Abandoned Buildings from the drop down list.
Linking violations to Chicago Vacant and Abandoned Buildings
Once I’ve uploaded the data and it has been processed, we’ll move on to the next step.
Next, we will identify the fields in each dataset that we want Dedupe.io to pay attention to for finding duplicates. You’ll be shown a drop down list of column names to pick from.
Next to that, you will pick from Compare as to tell Dedupe.io how to compare values in that column. The Default comparator will compare based on how similar each field is, character by character. Address will automatically split address and into separate components making them much easier to compare. Name will do the same for person and company names.
Dedupe.io has several other kinds of comparators, which you can read about on the field comparators page.
In the case of my vacant buildings data, I want to link on address. However, address is all in one field for one dataset, and split across five fields in another. We can combine these five fields together, shown below, and use the Address comparator to properly compare them.
Aligning fields to compare
Dedupe.io can compare multiple fields, which is what we recommend whenever possible. Add additional fields by clicking the blue ‘Add field to compare’ button in the lower right corner. In this case, though, all we have to match on is address, so we’ll stick with that and move on.
Next, dedupe.io will take a sample from each of your datasets and pick two random records for you to review.
For this pair of records, Dedupe.io asks us, ‘Do these records refer to the same thing?’ We will answer ‘Yes’, ‘No’, or ‘Unsure’. Once we mark these records, Dedupe.io will find another pair for us to review and we’ll repeat the process.
Dedupe.io uses these responses to refine its understanding of your data. The more training you provide, the better the linking results will be. At a minimum, we need 10 positive and 10 negative responses to proceed.
You are, however, welcome to provide as much training as you’d like. Just know that Dedupe.io will learn a little less for each additional training pair you mark. Stopping at 50 yes and 50 no responses would be more than enough.
Once you’re done training and click the ‘Next’ button, Dedupe.io will take some time to apply your training to the rest of your data. This can take several minutes for datasets under 100,000 rows to several hours for datasets with millions of rows.
Now that it has a good idea of the clusters that are in your data, it is looking through all the individual records that have not yet been added to a cluster.
When it is finished, you will be able to continue on to the next step. Here, we will review these records and match them to one or more clusters.
In this example, we have an unmatched record in yellow and two clusters that look like good matches. We’ll keep the checkbox next to each cluster checked and click the ‘Match record to cluster(s)’ button.
Automatically accepting the rest
After reviewing each record, we’ll be shown another one until we’ve gone through the entire queue. We have the option if we’re confident enough, to ‘Automatically judge the rest’. Clicking this will have Dedupe.io judge the rest of the records automatically and skip us ahead to the next and final step.
Warning: take caution when automatically judging too many records. The records Dedupe.io is asking you to match here are the records that are often the most ambiguous. We recommend reviewing as many of these records as possible for the best results.
When Dedupe.io finishes processing, you’ll have the ability to browse your data. When Dedupe.io finishes processing, you’ll have the ability to browse your data.
For detailed instructions, read the section on browsing your results in Getting started with Dedupe.io tutorial.