Friends of Sinn Féin: the methodology

Upload, clean, check: the steps taken to analyse 14,879 data entries

Download full spreadsheet of donors.

Irish Times Data has digitised 20 years of filings made by Friends of Sinn Féin (FoSF), the party's US fundraising arm, allowing for analysis of cumulative donations made by donors.

The filings are publicly available on the US Department of Justice website in PDF format.

As with most information held in PDF form, the data had to be uploaded into a format which would allow us to bring filings into a spreadsheet in order to calculate totals for individual donors and the total amount raised in particular US states.

READ SOME MORE

To do this we used optical character recognition (OCR) software and manual inputting (in the case of handwritten notes/illegible entries).

OCR software can have difficulties identifying certain characters. For example it may read the number 8 as a capital B. Elsewhere, passages of text or amounts had been obscured meaning that, although they were legible to the human eye, they could not be picked up by the software.

Therefore each and every row and column, 14,879 entries (or almost 60,000 individual cells), were manually checked by the Irish Times Data team.

The total donations for each filing period as recorded by The Irish Times were then calculated and checked against the totals reported by Friends of Sinn Féin in its (mostly) biannual filings. Where the figures did not correlate, multiple checks were carried out for accuracy.

However, in some periods the total donations as tallied by The Irish Times did not correlate with the amounts filed by Friends of Sinn Féin. In response to a query about these discrepancies, the president of Friends of Sinn Féin, Jim Cullen, said the organisation includes revenue from non-donation sources (eg bank interest and credit card rebates) in its filings. Therefore income sources which are "not derived from fundraising" are not listed as individual donations.

As is the case with any large data set, errors can exist, both in the original source data and as a result of importing the data. Therefore it was necessary to employ another piece of software to clean the data. To do this we used a software package called Open Refine which aids the cleanup of large data sets.

There were three stages of this part of the clean up.

States: in a number of cases the wrong US state had been inputted. As there are a finite number of states this process could be done with relative ease. Where it was unclear in which state a particular city/town was located a Google search was employed to identify the state.

City/town/other location: Using Open Refine we checked town names for simple spelling mistakes: for example Cincinnati was spelled in several different ways, something which could be easily rectified using the software. Where a place name located within the same state was spelled differently in two separate cells a Google search was carried out to ascertain the correct spelling and any other variations were replaced.

Names: To allow for the correct categorisation of donors it was necessary to make minor, typographical changes to the name column including alterations to correct inconsistent capitalisation, inconsistent naming conventions (eg in some filings a name was listed as “John Smith” but in others as “Smith, John), obvious spelling mistakes and inconsistent abbreviations.

In a small number of cases where the same business name appeared differently in multiple columns the contents were amalgamated to reflect the diversity of entries using forward slashes.

At the end of the process random spot checks were conducted on the database and checked against the original filings made by Friends of Sinn Féin.

It is the policy of The Irish Times to correct errors at the earliest opportunity. Anyone wishing to draw attention to a potential error should email data@irishtimes.com.