J. Elisabeth Wells, John Pearson, Chris Frampton, May 2010
The more careful planning is carried out before a research project commences, the smoother the project will run. These comments provide a very brief overview supplemented by links to other sites (we did not want to re-invent the wheel with our advice).
Design and analysis
When you are designing your project you should seek biostatistical advice to ensure that your design is the most appropriate for the aims of your project. In addition you may need assistance with determining an appropriate sample size. It is very important to work out the main analyses which will be required, in relation to your aims. Do not do the study and then try to work out how to analyse it. The study design and the main analysis plan need to be worked through before you seek ethics approval or submit a grant application. Leave plenty of time for biostatistics consultation at this stage. In some projects, pilot work will be required to determine the most useful conditions to study. For example, in experimental work, pilot work may be needed to establish useful doses and durations or to see the replicate variability.
While working on the design and the main analysis plan, you should also begin to make decisions about data management (see below) and the procedure for analysis. If you are a student, you will have to analyse your own data (unless you report in your thesis that more complex analyses were carried out by a biostatistician, which should occur only in exceptional circumstances). Therefore students need to make decisions about which statistical package they will use for analysis. SPSS is provided free here, and R is freeware. SAS is also available. Biostatisticians will use whichever software they prefer, or that which is most suitable for the project.
Data management involves setting up a system for:
- recording data, checking and cleaning data entry
- recording meta-data (the data about what was measured, by whom, the units, the protocol and all the other information which you know during a project but which need to be documented because you, yourself, will otherwise forget it, and because your collaborators also need to have access to this meta-data).
We recommend use of a database or direct entry into a statistical package which requires disciplined data entry (this is true for SPSS). Researchers commonly use Excel to store their data. Excel is available to all and very flexible but it is that flexibility which means that the data entered may not be analyzable without a great deal of time-consuming cleaning and adjusting. If you do use Excel then do first read the University of Reading guide (URL below) called, “Disciplined use of spreadsheets for data entry” or the other two links given below.
The University of Reading Statistical Services Centre has produced a series of guides to design, project management including data management, analysis and presentation of graphs and tables. These are all downloadable for free from: reading.ac.uk
The most relevant for the studies we see here are:
- Data management for experimental projects
This contains good overall advice for any research project, particularly more complex ones.
- Disciplined use of spreadsheets for data entry
This is a very thorough discussion which includes all sorts of clever ways of auditing data. However if you are going to import data into a statistics package and use that as your archived master copy, then you will do cleaning in the statistics package. Also, for importing data into a statistics package, you will need to have your data on one spreadsheet and your metadata on another spreadsheet within the same workbook, not on the same sheet as in this document. For a more basic introduction see the links below.
- Good graphs in Excel
This is clear and useful.
- Excel for statistics: tips and warnings
This provides strong reasons for using a proper statistics package to analyse your data. SAS, R or SPSS are all available.
The two links below are for preparing data in Excel to then import into SPSS. However most of the advice is relevant, regardless of the statistics package or spreadsheet to be used. These two links are an excellent place to start, perhaps reading the second one first.
This is a clear one-page set of instructions with technical details on allowable names and the appropriate format for inputting dates. There is even an example Excel file which you can download and try importing.
This is a slightly humorous set of instructions which makes some good background comments which are only implicit in the advice in ‘TUTORIAL-SPSS-Prepare- Data_Excel.htm’: for example, ‘Arrange the data in a rectangular grid’ and, ‘Don’t mix strings and numbers.’ It also has several screen shots to show the process of importing data into SPSS.
Data entry – checking and cleaning
Do set up a system for checking data entry. The best system is double entry followed by file comparison. Otherwise you will need to compare electronic data and the originals, preferably by having one person looking at the original data and the other looking at the electronic data. The least effective is to look from the original to the electronic data yourself, but doing that is much better than not checking data entry.
Data cleaning after checking data entry. At a minimum, look at distributions for extreme or invalid values. The book we recommend on SPSS, which is available in the Canterbury Medical Library, has good chapters on preparing a codebook (chapter 2), data entry (chapter 4) and data cleaning (chapter 5). Researchers tend to skimp on these aspects, as they rush to find the results they are interested in. They then later have to re-do their analyses when errors become apparent or, worse still, they continue on in ignorance of the errors in their data.
Pallant, J. SPSS Survival Manual (3rd edition). Crows Nest, NSW, Australia: Allen & Unwin, 2007