Open source software for easily processing and reconciling data in bulk into Wikidata. OpenRefineOpenRefine is a tool which can be used to clean up data. It is an open source software used to edit spreadsheets and upload data easily to Wikidata. Access OpenRefineThis is a video from Emma Carroll, Data Visualisation student intern at the University of Edinburgh, on how to use OpenRefine using the Survey of Scottish Witchcraft database as her example dataset. The above tutorial by Emma Carroll covers the basics of using OpenRefine including: opening and uploading a spreadsheet onto OpenRefine, reconciling data, building a schema and uploading data from the spreadsheet onto Wikidata. View media transcript OpenRefine is a useful open-source software which can be used to work with different spreadsheets. It is also used as a tool for uploading data from spreadsheets directly onto Wikidata. Today I'm going to run a tutorial showing how to reconcile data and add it onto Wikidata. You can download OpenRefine online for free following the link shown here. You can then open OpenRefine using Google Chrome or Internet Explorer. Once OpenRefine is opened, a spreadsheet can then be uploaded. This can be done either by uploading directly from your computer, using a web address, a clipboard feature, uploading a database, or using a link from Google data. I've decided for this example to work with data from Google Sheets. And the example that I'm using is from the Scottish witchcraft database, where I will be looking at the ethnicity of different witches recorded in the database. So I will copy the link and then paste it directly into the textbox. From there, I am then given a preview of the OpenRefine data and I'm able to then give the project a name. Then on OpenRefine, you're then able to see all of the columns that you've used on your original database. And for this, I then have the most important information that I'm going to look at and that is the name of the accused witch and the ethnic origin. I also then have the reference which is added at the end, which will be a reference for each point that I then upload onto Wikidata. If you follow the reference, it will then take you onto a page on the survey of Scottish witchcraft database, where you will be able to see more information about each witch, not only their ethnicity. Firstly, the spreadsheet on OpenRefine can be neatened up so that there are not columns featured in this spreadsheet which are not needed. This can be done by clicking on the arrow at the top of the column and then going on 'Edit column', and then clicking on 'Remove this column'. This will then remove this column from the spreadsheet. This can be done for any columns that are not needed within the spreadsheet. Once this is done, you're then ready to reconcile the data. To reconcile your data you must click on the arrow at the header of the column and scroll down to reconcile and click on 'Start reconciling'. From there, you want to pick the Wikidata service and be ready to start reconciling. OpenRefine gives you suggestions of different items that you can reconcile your data against to have a more accurate result. You can either pick from this list or you're able to search your own. So for ethnicity, ethnic group seems like a better Q-number to reconcile against. From there OpenRefine can then reconcile all of the data that you need. Once OpenRefine has finished reconciling your data it will then have a green bar at the top of the header column, which then indicates the amount of your data, which has then been correctly reconciled. As you can see from the 17 rows within this spreadsheet, there are 15 matched and 2 not matched and OpenRefine split these into two separate groups to make it easier to look at them. So we then know that for Irish, we're wanting to go with the Irish people rather than Irish Americans and Open Refine is doing this by giving a different Wikidata suggestions. If I click on the single tick, it will match this one item to this cell. However, if I click on the double tick, it will then match all of the cells with the same information to the same Wikidata point. Once all of your points have been matched, we then have a full dataset for ethnic origin and as a result, there is a 100% match rate with a full green bar. Next we can then reconcile other items within the spreadsheet which will be used along with this already reconciled column to upload data onto Wikidata. We can then reconcile other parts of the dataset which would be useful to use in connection with the ethnic origin in order to upload this data onto Wikidata. And so the next step is to reconcile the name of the witches in a similar way to what we done for the ethnic data. For the name of the witches we can then relate this to Q5, which is for humans, and this time Wikidata and OpenRefine have given us the correct list. As names are more difficult to be able to find a connection with, we can then also reconcile against another column within the spreadsheet. And for this example, accuse reference would be the most useful as each witch has a accuse reference which is also uploaded onto Wikidata. Accuse witch ID. And then we're ready to also start reconciling for this dataset. You can then reconcile as many different columns as you wish. However, in this example two columns is enough. The next step in order to upload data from OpenRefine onto Wikidata is to build a schema which can then be uploaded onto Wikidata. This is used using the Wikidata extension and clicking on 'Edit Wikidata schema'. From there, you're then able to build a schema which is made up of different items and properties. By clicking on 'Add item' we're then able to drag a different reconciled column into this box. And as I want to add data about each ethnic origin directly on to the name of each witch I then want to have my main item as the witch, which is then called 'Full Name' in this instance. I then want to add a statement about the witch. And by doing this, I will click on add statement. I can find the property that I want to add about each accused witch by searching in this search bar here where I can type to get suggestions about the different properties. I can then drag another reconciled item about the witch and so I'm going to drag in ethnic origin. This is then telling us the ethnic group for each accused witch that is mentioned in this spreadsheet. The next step is then to add a reference. And this can be done by clicking on the 'Add Reference' box here. There can either be one reference for every single item, which is going to be uploaded onto Wikidata or each can have a singular reference. This can be done by typing in URL reference. And either you can put in one URL directly into this box if every single statement has the same reference. Or else in this instance, I'm going to put in one of the columns taken from the spreadsheet directly into this box by dragging it. And this is because each one of these points that I want to add, to Wikidata has a different reference. Once I have a reference, I'm then able to go into the next column to see if there are any issues with my edits. OpenRefine tends to flag up any issues that it sees in your edits so that they are as accurate as possible. As you can see in this example, there are no issues. And so the next tab then shows you the preview of how it will look for each one. For example, we can see that John Neill is part of the ethnic group of English people and we have a reference below, which will then take us on to more information about each witch. [...] Once you're happy with each one of the points and are confident that the data that you're going to upload into Wikidata is correct, you are then able to export the data. This can be done either using 'Upload edits to Wikidata', which will upload your edits directly onto Wikidata and is the best approach for smaller datasets under a thousand points. Or else if you have a very large dataset, you can then also 'Export to QuickStatements'. And this then makes your data into the correct format to be uploaded directly into QuickStatements, which is another program, which can be used to upload data directly into Wikidata. In this example I'm going to upload the data directly into Wikidata. And it already tells me that there are no issues with my edits and I'm ready to upload. I then type in an edit summary and can click on 'Upload edits'. And through OpenRefine, I'm then able to upload my edits directly onto Wikidata. Once this is complete, I'm then able to go on the name of one of the accused witches, which will then take me directly to their Wikidata page. And if I go down the page, I'm then able to see that there The next tutorial (below) from Maggie Lin, our second Data Visualisation intern, takes things a step further and covers bulk data item creation using OpenRefine. Bulk data item creation using OpenRefine - tutorial by Maggie Lin. View media transcript OpenRefine is a great tool for data cleanup and transformation, and when used with Wikidata, it is a very useful application for bulk editing and item creation. Today, I'm going to take you through a little example on how to bulk create items with OpenRefine. When getting started with OpenRefine, you want to visit their website and click the download link depending on which operating system you're using. Once it is downloaded, click on the application, and it should open in your browser like so. To create a project, you want to import data to OpenRefine, and there are many different ways you can do this. Either locate a file from your computer, from a 'Web Address', paste from your 'Clipboard', 'Database', or from 'Google Data'. Today, I'll be using 'Google Data' in the form of a Google sheet. I like using Google sheets due to the Wikipedia and Wikidata tools that you can download as an add on. This is the sheet I've prepared for this video tutorial, and a function I like to use is the Wikidata lookup function, which takes a property, and this is a property P-number, 'P4478', which is the 'Accused Witch ID' and an 'identifier' and returns a ''QID''. What we are trying to do today is create witchcraft investigations, which link to an Accused Witch. These witches have already been created in Wikidata. However, the case of witchcraft investigations linked to them have not. So here we use the Wikidata lookup function to find the 'QID' as these witches have an external identifier, which is the Accused Witch ID here, 'P4478'. And if I just want to drag this down, I have quickly found the 'QID's for all the Accused Witches I want to have created a case investigation for. So now I'm going to click on this sheet that I want to use today, and it takes me down here, which gives you options on how you want to change your data. As I am happy with how the spreadsheet looks, I'm going to go ahead and create a project. I want to start off with reconciling. This is the process of matching your dataset with an external source, in this case, Wikidata. As all the Accused Witches are already in Wikidata, we want to create the cases of which has investigation using the data we have here. As we want to link back to the Accused Witches, it means we should reconcile the Accused Witches. So here if I just click 'Reconcile' here, 'Start Reconciling'. And this is where you pick which type in Wikidata, you want to reconcile against. You can– Well, all the witches are 'human', so I'm going to stick with this, but you can reconcile against any type. I can just put like a 'dog', or 'reconcile against no particular type'. So it just reconciles against the whole of Wikidata. However, they are all 'human', so I'm going to stick with 'human'. Another way we can do this is using the 'Names of the Accused' and 'Reconcile'— with 'relevant details from other columns'. So the property we want to use is the 'Accused Witch ID'. 'Survey of Scottish Witchcraft - Accused witch ID'. And hopefully these columns will be the same. Perfect. Now, as we want to create new items, we should look at this label column, and start reconciling. They will all be type 'witchcraft investigation', but I'm switching (unclear) 'Reconcile against no particular type' as they shouldn't– they won't be created yet. As we are wanted to create new items, you click 'Action' and you click 'Create a new item for each cell'. Now we are finished with this spreadsheet. To start structuring our new items, we want to look at the 'Edit Wikibase Schema' option. And we want to 'add item' as we're creating new items and use the label as a column, which creates the names of these items. We also want to add the terms ('add terms'). So, 'Aliases', 'Descriptions', and also 'Labels'. Don't forget to add the labels again as, if you don't, the names are not going to show up properly, and you want to put your 'language', which for us is 'English', and then drag the columns to match. Perfect. And now we want to add the statements ('add statements'), which will be the information that is in the items. So we want to have the 'Survey of Scottish Witchcraft - 'Case ID', and then we'll use the 'Case ID' for that. Okay, so that's all the 'Statements' added in. So, 'main subject', is the 'Names of the accused', which we reconcile before, and this means that the cases of 'Witchcraft Investigation' that we create will link back to the 'accused'. And now for 'references', to add a 'reference', just pop in the 'reference URL' here, and then drag the 'Reference'. All the references for each case are different, so we're going to take the 'reference' column. But if all the references were the same, you could just put an URL in here and it would work. And a nice trick that I like to use is to copy this and then paste a reference ('paste reference') for both of these. Perfect. Now, if you're happy with your schema, you can save it here ('save schema'), and then look at the 'Issues'. The issues are 'This edit batch will create new Wikidata items', which is perfect because that's what we're wanting to do. And then look at the 'Preview' to see what the items are going to look like. And if we scroll down, it is looking like how we want the cases to be structured. Once we're happy with this, we want to click the 'Wikidata' button up here, and either 'Upload edits to Wikibase' or 'Export to QuickStatements'. If you click 'Export to QuickStatements', it gives you all the edits in quick statements, syntax. Further resourcesWikidata page on OpenRefine.© Ewan McAndrew and Hannah Rothmann, University of Edinburgh, 2020, CC BY-SA 4.0, unless otherwise indicated. This article was published on 2024-10-08