Dedupe.io was shut down Jan 31, 2023.
The Dedupe.io team has decided to dedicate our focus to our consulting practice at DataMade and work on projects more aligned with our mission to support our clients in working toward democracy, justice, and equity.
We are continuing our consulting practice around the open source dedupe library and would be happy to consult with you on setting up a solution based on it. Contact us to get started >
Instructions and tips on formatting and processing files for upload.
You can import a file of these file types:
The file must contain one row for every record, with the first row indicating the name of each column.
Spreadsheets (.xls, .xslx)
For spreadsheets, only the first sheet will be imported. Special formatting, and formulas will not be imported.
Comma-separated text (.csv)
Comma-separated value (.csv) files are a standard way to represent tabular data. Cell values with comma, newline or quotes must be quoted and quotes escaped by doubling. For example:
"17"" LCD display".
By convention, the first row acts as header and defines the names of the columns in the data.
If your spreadsheet contains more than one header row, columns with duplicate names, or columns with no field name, you will need to edit your file to correct these issues before uploading.
Once you upload your data to dedupe.io, you will not be able to edit it. We take the approach of dealing with data in its original messy state and simply identify which records to cluster together.
However, there are some cases where editing your data before uploading to dedupe.io is a good idea. If a column has a lot of blank values, that’s ok. Dedupe.io will know how to ignore them appropriately. However, if your data has text like “Null” or “n/a” in them, it would be a good idea to clear them out. We recommend using tools like Excel, or Open Refine for larger spreadsheets to make these kinds of changes.
Dedupe.io works best on spreadsheets that have 100 rows or more. When spreadsheets have fewer than 100 rows, Dedupe.io will have troubled getting a good sample of the data to work with. (It’s usually faster to clean such small spreadsheets by hand, anyway.) For these reasons, we prevent uploads of spreadsheets with fewer than 100 rows.
If you’d like to test the service and you don’t have a large enough messy dataset on hand, you can use our example spreadsheet of early childhood education centers in Chicago:
Download Chicago Early Childhood Locations (800 rows)
Dedupe.io is setup to handle a wide variety of file encodings and will usually be able to determine how to read and process your CSV or Excel spreadsheet.
However, some files have encoding issues which can cause warning messages like this when you try to upload to Dedupe.io:
Encoding warning in Dedupe.io
In these cases, we recommend opening up your file in Excel (or the free alternative LibreOffice) and re-saving as a CSV file and using the UTF-8 character set encoding.
Saving a file as CSV in LibreOffice
Encoding as UTF-8 in Libre Office
Then, upload your new file to Dedupe.io.