Formatting files for upload


Instructions and tips on formatting and processing files for upload.

Supported file types

You can import a file of these file types:

  • Spreadsheets (.xls, .xslx)
  • Comma-separated text (.csv)

The file must contain one row for every record, with the first row indicating the name of each column.

Spreadsheets (.xls, .xslx)
For spreadsheets, only the first sheet will be imported. Special formatting, and formulas will not be imported.

Comma-separated text (.csv)
Comma-separated value (.csv) files are a standard way to represent tabular data. Cell values with comma, newline or quotes must be quoted and quotes escaped by doubling. For example: "17"" LCD display".

By convention, the first row acts as header and defines the names of the columns in the data.

Pre-processing your data

If your spreadsheet contains more than one header row, columns with duplicate names, or columns with no field name, you will need to edit your file to correct these issues before uploading.

Once you upload your data to dedupe.io, you will not be able to edit it. We take the approach of dealing with data in its original messy state and simply identify which records to cluster together.

However, there are some cases where editing your data before uploading to dedupe.io is a good idea. If a column has a lot of blank values, that’s ok. Dedupe.io will know how to ignore them appropriately. However, if your data has text like “Null” or “n/a” in them, it would be a good idea to clear them out. We recommend using tools like Excel, or Open Refine for larger spreadsheets to make these kinds of changes.

Row limits

Dedupe.io works best on spreadsheets that have 100 rows or more. When spreadsheets have fewer than 100 rows, Dedupe.io will have troubled getting a good sample of the data to work with. (It’s usually faster to clean such small spreadsheets by hand, anyway.) For these reasons, we prevent uploads of spreadsheets with fewer than 100 rows.

If you’d like to test the service and you don’t have a large enough messy dataset on hand, you can use our example spreadsheet of early childhood education centers in Chicago:

Download Chicago Early Childhood Locations (800 rows)

Troubleshooting file encoding

Dedupe.io is setup to handle a wide variety of file encodings and will usually be able to determine how to read and process your CSV or Excel spreadsheet.

However, some files have encoding issues which can cause warning messages like this when you try to upload to Dedupe.io:

Encoding warning in Dedupe.io
Encoding warning in Dedupe.io

In these cases, we recommend opening up your file in Excel (or the free alternative LibreOffice) and re-saving as a CSV file and using the UTF-8 character set encoding.

Saving a file as CSV in LibreOffice
Saving a file as CSV in LibreOffice

Encoding as UTF-8 in Libre Office
Encoding as UTF-8 in Libre Office

Then, upload your new file to Dedupe.io.