Should I use Dedupe.io or the dedupe Python library?


While you can use either Dedupe.io or the dedupe library to de-duplicate or link your data, there are some important differences to note when choosing which one to use.

Dedupe.io is built on top of an open source Python library called dedupe. This library provides core functionality for defining data models, capturing training data, and initial clustering of data.

While you can use either Dedupe.io or the dedupe library to de-duplicate or link your data, there are some important differences to note when choosing which one to use.

Are you a Python programmer?

The dedupe library is built with Python programmers in mind. In order to use it, you will need to have experience with developing code in Python.

Dedupe.io is for everyone.

What kind of accuracy are you looking for?

Dedupe.io allows you to review and validate the clusters it finds, allowing you to get to 100% accuracy.

The Python library simply clusters records. It’s good at it, but not 100% accurate.

How quickly do you need your data de-duplicated?

Even for Python developers, the decision to use Dedupe.io or the dedupe library will ultimately come down to the time and resources you want to spend getting oriented on machine learning and probabilistic matching, and then re-implementing or manually doing some of the functionality Dedupe.io gives you out of the box.

If you need your data de-duplicated quickly, you should definitely consider Dedupe.io.

How much manual review are you willing to do?

We’ve used the dedupe library for quite a few projects and found that it always takes multiple tries to get the perfect data model for the results we’re looking for. This takes time, as does manually reviewing the final results.

To address these challenges with the dedupe library, we built Dedupe.io to dramatically reduce the amount of manual review needed to get the most accurate results. We have created several steps in the Dedupe.io process to review the results and give you the opportunity to fine-tune how Dedupe.io de-duplicates your data.

Additionally, the Dedupe.io web interface allows for multiple users to review records and clusters of records, speeding up the review process even more.

For achieving high accuracy on very large datasets, Dedupe.io is the obvious choice.

Are you OK sharing data with a third party?

Dedupe.io is a software as a service (SaaS) tool, so using it means uploading your data to our servers. In some cases, you may not have permission or want to move your data outside of your internal network. In this case, we would recommend using the dedupe library.

If you do upload your data to Dedupe.io, know that our privacy policy states that we neither rent nor sell any of the data you provide to us to anyone.

Final thoughts

While all of the above trade-offs should be considered in your decision, generally the dedupe library is built for advanced Python developers in mind, while Dedupe.io is intended for everyone. If you have Python developers in-house, strict data sharing restrictions, and are okay with matching results that good but not 100% accurate, then the Python library is for you. In just about all other cases, we recommend using Dedupe.io.

Questions on if Dedupe.io is right for you? Drop us a line at dedupe@datamade.us