Should I use Dedupe.io or the dedupe Python library?


While you can use either Dedupe.io or the dedupe library to de-duplicate or link your data, there are some important differences to note when choosing which one to use.

Dedupe.io is built on top of an open source Python library called dedupe. This library provides core functionality for defining data models, capturing training data, and initial clustering of data.

While you can use either Dedupe.io or the dedupe library to de-duplicate or link your data, there are some important differences to note when choosing which one to use.

We broke down the high level differences here:

Dedupe.io dedupe library
Who is it for? Everyone Python programmers
What kind of accuracy are you looking for? up to 100% up to 90-95%
How quickly do you need your data de-duplicated? Within a few hours Days or weeks for custom programming
Do you need to review or audit the results? Supported Not supported
Where does your data live? Cloud hosted by Dedupe.io Self-hosted

Who is it for?

The dedupe library is built with Python programmers in mind. In order to use it, you will need to have experience with developing code in Python.

Dedupe.io is for everyone. No programming experience is required.

What kind of accuracy are you looking for?

Dedupe.io allows you to review and validate the clusters it finds, allowing you to get to 100% accuracy.

The Python library simply clusters records. It’s good at it, but not 100% accurate.

How quickly do you need your data de-duplicated?

Even for Python developers, the decision to use Dedupe.io or the dedupe library will ultimately come down to the time and resources you want to spend getting oriented on machine learning and probabilistic matching, and then re-implementing or manually doing some of the functionality Dedupe.io gives you out of the box.

If you need your data de-duplicated quickly (within a few hours or days), you should consider Dedupe.io.

Do you need to review or audit the results?

We’ve used the dedupe library hundreds of projects and found that it always takes multiple tries to get the perfect data model for the results we’re looking for. This takes time, as does manually reviewing the final results.

To address these challenges with the dedupe library, we built Dedupe.io to dramatically reduce the amount of manual review needed to get the most accurate results. We have created several steps in the Dedupe.io process to review the results and give you the opportunity to fine-tune how Dedupe.io de-duplicates your data.

Additionally, the Dedupe.io web interface allows for a team of reviewers to check records and clusters of records, speeding up the review process even more.

For achieving high accuracy on very large datasets, Dedupe.io is the choice we recommend.

Does your data change over time?

If you have a dataset or group of datasets that are constantly changing and growing, keeping your data de-duplicated on an ongoing basis can become its own engineering project.

Dedupe.io provides a web API for matching new records against an already de-duplicated project, allowing you to clean up your data as it grows. Dedupe.io also has a method for adding more training to your project, allowing Dedupe.io’s model of your data to be updated as your data changes.

This kind of continuous matching is possible with the dedupe library, but Dedupe.io has already solved many aspects of this challenge.

Where does your data live?

Dedupe.io is a software as a service (SaaS) tool, so using it means uploading your data to our servers. Our privacy policy states that we neither rent nor sell any of the data you provide to us to anyone.

In some cases, you may not have permission or want to move your data outside of your internal network. In this case, we would recommend using the dedupe library. Dedupe.io provides consulting services and support if you need assistance with this. Email us to get started.

Final thoughts

While all of the above trade-offs should be considered in your decision, generally the dedupe library is built for advanced Python developers in mind, while Dedupe.io is intended for everyone. If you have Python developers in-house, strict data sharing restrictions, and are okay with matching results that good but not 100% accurate, then the Python library is for you. In just about all other cases, we recommend using Dedupe.io.

Questions on if Dedupe.io is right for you? Drop us a line at info@datamade.us