Comparing CSV files with Python pandas

In this article I want to show you a few lines of Python code that can help you to save a lot of time when comparing CSV and other files with eachother.

In literally every data migration project you will need to compare the actual outcome of the migration with the expected outcome. While there is a plethora of dedicated software tools (e.g. Redgate’s SQL Compare) a DIY approach can take you quite far.

Let us assume that we have two datasets in CSV format:

Dataset 1 (“expected”):

A;B;C;D;Key
1;A;1;J;1
1;A;2;J;2
1;A;3;J;3
1;A;4;J;4
1;A;5;J;5

Dataset 2 (“actual”), that has been altered once per column

A;B;C;D;Key
1;A;1;J;7
1;B;2;N;2
1;A;3;J;3
2;A;6;J;4
1;A;5;J;5

We have both datasets as files on the disk, let us now derive some Python code to compare them.

We use the following import statements to load a few helpful Python modules:

What is the purpose of these modules?

  • sys provides us with the exit command, allowing us to break the program if we encounter a state that prevents us from continuing
  • collections has a nice contruct, Counter, that we can use to compare the columns of both datasets
  • pandas is the famous data wrangling library we use to mess with the data

Let’s load the data and compare if the number of columns in both datasets match. If this is not the case, all further effort to compare them will be wasted so it is safe to exit the program with an error message.

If the columns match we can proceed with the comparison. What we will implement next is the following:

  • We will combine both dataset via outer join using a key column. In contrast to an inner join, that leaves only those rows in the combined dataset whose key value exists in both, the outer join leaves all data in the combined set.
  • To be able to distinguish between the expected and the actual (“true”) dataset we will add the suffixes “_e” and “_t” to each of their resp. columns.
  • For each set of “_e” and “_t” columns we will add another column, the “_c” column, with the help of pandas this column will then receive True & False values for each comparison.
  • The columns will finally be reordered so that we can inspect them later more easily.

“comparison” is the dataframe that contains our results now:

As you can see we have the expected “_e” column, the actual “_t” column and the comparison “_c” columns that show if the values for a specific set of columns and rows match.

This is the result we will write to disk, but for larger datasets the inspection of the result will become quite laborious, so we need to create a summary of the found comparisons. For this to work we will:

  • create a list of the “comparison” columns
  • initialize an empty “log” dataframe that will contain the summary
  • create the sums of True & False values for each “comparion” column
  • add those sums to the “log” dataframe
  • save a transposed version to the disk
  • and print it on the screen

You can find the complete code in a single GIST here: https://gist.github.com/UweZiegenhagen/8451df524a078da118346ae2cab202d6.js if you want to adapt it to your needs.

Please note that this code works fine for individual comparisons, if you have more comparisons a few additional lines of Python code can save a lot of time. I will probably write about this in another article, please comment if you are interested.

From Cologne, Germany, working in the financial industry as a specialist for data-related topics. Hobbies include LaTeX typesetting, electronics and programming