Instructions for the IHCC Data Steward
The data steward has three main responsibilities:
- Working with the data dictionary maintainer to incorporate a new data dictionary
- Ensuring the quality and consistency of existing mappings over time
- Updating data dictionaries in response to changes to the schema or the data dictionaries themselves
Incorporating a new data dictionary
Initial Communication
- Point the data dictionary maintainer to the instructions here.
- Wait for the submission email containing the Excel Workbook described in the instructions above. Before you start with the next steps, make sure you have:
- A valid email connected with a Google account for the submitter. If the data submitter is unable to provide a Google account email, please see here
- A short capital letter ID for their data dictionary
Prepare data dictionary automated mapping
- Go to https://droid.ontodev.com/IHCC (see for some general DROID tips below).
- Make sure you are logged in. If you are, you will see your name in the top left corner; if you are not, you will see the Login via GitHub link like this:
- If you were redirected to the “Available Projects” Overview, click on the IHCC project.
- In the IHCC
Branches
overview, click “Refresh”, and then do one of the following two options:- If you want to continue working on a data dictionary you have previously created, click on the
Checkout
button next to it. - If you want to create a new data dictionary, click “Create new”. Then, in the
Branch from
window selectmaster
, and then enter aBranch name
, ideally corresponding to the name of the data dictionary (its good practice for IHCC to use lower-case branch names only). Then clickCreate
. Here in example where we are creating the fictional data dictionaryexample
:
- If you want to continue working on a data dictionary you have previously created, click on the
- In the branches view, you will see your new branch along with a red
Delete
button next to it. This means that the branch is available and ready to be viewed. Click on the name of the branch (in the running example, that iskort
). This will bring you to the IHCC branch control page, which will look something like that: - Click on
Upload cohort data
, the first step in the IHCC Data Workflow. - Fill in the form that pops up in a new window. The filled in form should look something like this:
- Click
Submit
. Once the upload is complete, you will see a page with two links (Open Google Sheet
andBack
). We will close this for now, to go back to the branch control page with the workflow. Note: You are only able to do this once in your branch. If you feel like you made a mistake, simply go back to the start, delete the branch you created and start over. - At this point, it makes sense to look at the Google sheet once to catch obvious errors (bad looking labels, incomplete dictionary etc). There is a fair bit of quality control running, but it makes sense to err on the cautious side.
- Next, click on
Run automated mapping for new data dictionary
. This will trigger the automated mapping process. In the console below, you will see the progress of the mapping. When the process is finished, the data dictionary will have been uploaded along with the mapping suggestions into. After refreshing a few times, you will hopefully see a Green confirmation pronouncingSuccess
. Else, you will have to read the log file and look out for typical errors - You can now share the data dictionary with the submitter by clicking
Share Google Sheet with submitter
.
Supporting the mapping phase
- When you receive word from the data dictionary maintainer that they have finished their mappings, you can
Run automated validation
. If everything is correct, you should, after a while, see the greenSuccess
message.- Watch out:
Success
does not mean that there are no errors! It only means that the validation process ran successfully. Read the short log file for potential problems, and communicate them back to the data dictionary providers. DROID will have added validation errors into the Google sheet itself, so they can be fixed by the data dictionary maintainers. - Tip: There is a tiny delay between edits performed on a Google sheet and the sheets being synced properly. Its generally a good idea to leave around 10 seconds between an edit to a Google sheet and running a
DROID
command.
- Watch out:
- This phase may require a bit of back and forth with the data dictionary providers until the validation is passed.
Finalizing the mapping.
- If everything went with the previous step and the validation has passed, click
Prepare data dictionary for build
. - If the preparation terminated correctly, you can now
Build data dictionary
. This will produce the OWL files and other related datasets (including the updated dataset for the IHCC browser!). - You are now ready to
Commit
the changes andPush
the changes to GitHub. If asked to create a pull request, do that. - Ask a colleague to review your pull request on GitHub. If you could a positive review, merge the pull request and you are done.
- Don’t forget to delete the branch on GitHub, and perhaps even on DROID!
Ensuring the quality and consistency of mappings over time
From time to time, it makes sense to run do a health check on existing mappings, because of the possibilities of them diverging over time. For example, a new data dictionary could be mapping the same term against a different GECKO category, either because of a difference in opinion, or because the GECKO term did not exist when the one of the two data dictionaries was mapped. To do a QC analysis:
- Go to https://droid.ontodev.com/IHCC (see for some general DROID tips below).
- Same as before, create a new branch, naming it something like
qc-oct2020
or similar. - Click on the branch, and then, in the section
IHCC Data Admin Tasks
, click onRun all mappings (quality control)
. - If everything went well, this process will result in a success. If not, follow the instructions in the log - perhaps a few of the previous mappings are now outdated and need to be changed (see next section, “Updating an existing data dictionary”).
Updating an existing data dictionary
There are currently no tools to support edits to data dictionaries easily. Please follow the following steps (you may have to ask for help from your IHCC tech goto-person)
git clone
the data harmonization repository.- The source of truth for any data dictionary and their mappings is in the
templates
directory, for exampletemplates/ge.tsv
. Edit this file using, for example Excel. - In your console, run
cd
to thedata-harmonization
project directory, then typemake all
. This will rebuild all data dictionaries, including your recent changes. - Inspect the diff, commit the changes to a branch, push and create a pull request.
- Ask for a review from another IHCC member, then proceed as usual. If you know how, make sure editing the TSV with Excel causes only a small diff, reflecting your changes. Else, it is possible Excel saved the file with somehow wrong settings.
General DROID tips
DROID
is the tool you will be using to manage most of the workflows around incorporating new data dictionary mappings.DROID
is essentially a thin layer that allows us execute commands and run typicalgit
commands in a simple-to-use interface.- General tips:
- Remember to hit the
refresh
button when you have kicked off a process to determine if the process is still ongoing. - No process you execute as part of the IHCC pipeline should ever take more than 5 min. It is unlikely a process could get stuck, but it would, you can
Cancel
it when not seeing any progress after waiting 5 min.
- Remember to hit the
- Nothing you do in the DROID UI except for the red Push button on the left can actually do anything to the GitHub repository that holds our data. The
delete
button in the branches overview, for example, only deletes a copy of a branch; not the actual branch on GitHub. - If clicking on a link does not open the tabs as documented by the process above, try switching of any Ad blockers you have running in your browser.
- A list of typical errors that can occur during the whole process can be found here.
- One of the central components underlying DROID is GNU Make, a system to, in essence, build files in a systematic way. One typical message that often confuses new users of
make
, but one we would not expect you to encounter using DROID, isnothing to be done for X
, whereX
is usually the name of a file. This means that the file and all its dependencies have already been built correctly and you can proceed to the next step. If you usemake
outside of DROID in the command line, you can add a-B
at the end of your command to invalid all previous attempts to build the file and force rebuilding it (and all its dependencies), for example.make test.txt -B
.