兔子先生

Skip to Main Content

Useful Apps for Statistics

Most students collect and manage their data in Google Sheets or Excel. These spreadsheet programs can be useful in calculating descriptives statistics (averages, range, standard deviation) and can also be used to calculate some basic inferential statistics (t-test, chi-square, etc.). However, the overall functionality is somewhat limited. Therefore, it is worth knowing about some alternatives: 

  • - a free webpage with many statistical tests that you can run. Very user friendly. Provides examples of how to report data for many tests and also tests assumptions (normality). 
  • Link to 兔子先生 University software downloads- 
    • Microsoft Excel, along with all Microsoft Office 365 apps (Powerpoint, MS Word), is available free for all 兔子先生 students.
    • - this is a large install file and SPSS is a major statistical program. We only recommend downloading if you are planning on using advanced statistics regularly. 
- Simply known as “R”, this is the industry standard for running statistics. It is somewhat complex to download and utilize because it mostly requires coding of statistical hypotheses. Nevertheless, R is worth knowing about and may be worth learning if you are envisioning a career in academia or statistics.

Getting Started With Spreadsheets and Descriptive Statistics

As you get started, you want to organize your data in your spreadsheet and report your descriptive statistics. 

Video Resources 

  • (7:01) - Using Google Sheets to calculate Sum, Average, Max, and Min.
  • (9:00) - Remove duplicates, Remove white space, Find and replace, and concatenate functions.
  • (7:26)- Pivot tables are critical when you have a large dataset with dozens or hundred of rows. Knowing how to use a pivot table can be a very useful skill. 
  • (7:57) - Details how to freeze your top row in a dataset with many rows, paste special-transposed (rotating data), and more. 
  • (2:32) - Using spreadsheets to caculate basic measures including Mean, Median, and Mode.
  • (8:24) - How to use speadsheest to caculate Range and Standard Deviation; Also demonstrates the math behind Standard Deviation. 
  • (1:10) - Simple explanation of how to write up descriptive statistics in your research paper. 

What if I feel I have outliers in my data?

Did you have a couple data points that had a large influence on your result? Are these data points indicative of the data (should be included) or are they true outliers (inaccurate or not indicative of your data)? If you have a valid reason for doing so, you can remove outliers but you must explain that you did this and for what reasons.

Inferential Statistics, Interpreting P-Values, and What Test to Use

When it comes to generating a P-value, one of the most challenging aspects is determining what statistical test to use. This often relates to your hypothesis and goals, but also the structure of the data you collected. The following video resources can help you identify what statistical test is appropriate for your data. 

Video Resources 

  • (17:36) - Start here if you are unclear on what a p-value is or why it is important.
  • (11:06) - Walks through how to look at your data and determine what statistical test may be most appropriate. 
  • (4:19) - Understanding the types of variables (categorical, numerical, etc.) is critical to determining what statistical test you should use.
  • (3:51) - A common question is “How many samples should I take?” This video provides some insights on how to assess that question for the data you are collecting.

Comparing Two (Or More) Groups with Numerical Data

When comparing two groups, you are most likely going to be using a t-test (or non-parametric alternative) or a chi-square test. Below are links with details on how to use these tests. 

If your data looks like the table below, with two groups and numerical data as replicates, you will likely use a T-Test:

Table 1. Example data table for which you would likely use a T-Test. Note that you may have many rows of data and you do not have to have an equivalent number of rows in each column (it would be ok to have 7 rows of data under Group 1 and 9 rows of data under Group 2, for example).  

Group 1

Group 2

12

8

5

3

6.4

12

Etc. 

etc.


Realworld examples: 1) You are comparing bee abundance on native versus non-native flowers;  You are evaluating a Pre- and Post-test of student scores or participant opinions; 2) You are assessing the number of trash items on beaches (Category 1) compared to in parks (Category 2). 

How to run a paired t-test if my data are not normally distributed?

One option is to use a “Signed ranks test”, also called a “Wilcoxon Signed Ranks Test”. See details below on how to conduct this test.

Should I use a “one-tail” or “two-tail” T-test?

A 2-tailed test is for when you think there will be a difference but you are not sure in what direction (group 1 might be greater than group 2 OR group 2 might be greater than group 1). A 1-tailed test is for when you have reason to believe that one group will be greater than the other (e.g., you believe period 4 will have higher scores than period 7 for a specific reason). You get to decide whether to use the 2-tail or 1-tail. This is your call as experimenter based on your hypothesis.

Checking Normality of your Data

Before using a t-test or ANOVA, you are expected to check the normality of your data. In statistics, “normality” refers to the degree to which your data has most values centered around the mean, with fewer extreme high or low values. This is also known as a “bell-curve” histogram, where most values in the dataset are near the mean and fewer values are at the upper or lower ranges. Technically, you should only use a t-test or ANOVA if your data is normally distributed although some statisticians content that both tests are robust to slight deviations from normality. 

How do I test my data for normality?

You can plot a histogram of the data and see if it generally looks like a bell curve. However, this is a bit subjective. A more common method is to use the Shapiro-Wilks test which evaluates if a dataset significantly deviates from a normal distribution. 

Video Resources 

  • (5:08) - Example uses fabricated pre and post test student scores. The hypothesis is that student scores increased. 

What if my data is not normally distributed? One option is to uniformly transform the data so it more closely resembles a normal distribution. For example, you could take the log10 of each value in your dataset and then reevaluate normality. 

Video Resources

  • (3:15) - How to apply a uniform mathematical transformation to all data points to possibly improve normality.  
  • (11:07) - How to create a histogram of your data to visually assess normality. 

Non-parametric Tests (for when your data is not normally distributed)

If transformations do not work, you can use a non-parametric statistical test. Below is a table with the non-parametric alternatives of a few statistical tests. 

Table 2. Examples of parametric tests and their non-parametric alternatives. 

Parametric test (use if data is normally distributed)

Non-parametric alternative (use when data is not normally distributed)

Unpaired T-test

Mann-Whitney Test

Paired T-test

Wilcoxon Sign-Rank Test

ANOVA

Kruskal-Wallis Test

Video Resources 

  • (7:05) - Example demonstrating t-test variations using Google Sheets.
  • (5:08) - Example using fabricated pre and post test student scores. 
  • (4:35) - Examples shows how to run a Wilcoxon sign rank test (non-parametric alternative to t-test) because the data is not normally distributed. 
  • (4:06) - Example uses a Mann-Whitney test (non-parametric alternative to t-test) because the data were not normally distributed.

In running a Shapiro-Wilks test, do I need to run each column of data through the calculator individually? E.g. calculate for one category, then for the second category, then for the third category. Or should I put all three columns of data into the calculator at once to check for normal distribution?

Yes, you check each numerical variable individually. If one deviates from normality, you can try a transformation or you can shift to a non-parametric alternative. Another option is to go ahead with a parametric test as some people contend they are robust to some deviations from normality.

How do I analyze pre-post test score data? I have students taking a test before and after an inquiry lesson and want to see if there is a significant difference.

If your subjects are the same (same students take the pre and post test) you could do a "paired t-test". The way to set this up in excel is to have your first column be student names, your second column be scores from first test, and your third column be scores from second test. You then type "=t.test" in a blank cell and a few options will come up (this formula "t.test" may be slightly different depending on what version of excel you have). You want to choose the "arrays" which just means highlighting scores for test 1 for array 1, and scores for test 2 for array 2. Then you have to choose "1-tail" or "2-tails". Choose 1-tail if you have a hypothesis that scores will specfically go up or will go down. Choose 2-tails if you are not sure if they will go up or go down, you just think they will be different in either direction. In most cases, you will probably choose 1-tailed because you probably thought the scores would go up which is why you are testing this. Finally, you choose "1" for the type of T-test. 1 should equal "paired" t-test. A paired t-test just means that the test realizes that each row is linked or paired (in this case they are linked because they are the same student in each row). Thus, one student could go from a 65 average to a 75 and that would be a nice increase. Another student could go from a 90 to 100 and that is the same increase for that student.

It wont make sense until you try it! :-) Putting fake data in and testing this is the way to understand. The formula will provide a p-value which is the probability that any difference is due to chance alone. The lower the p-value the more likely it is that there is a true difference in test scores. If the p-value is less than 0.05 you can reject your null hypothesis (no difference in test scores) and say that the difference is significant.

How should I report a t-test? I do not see the “t-stat” in Excel or Google Sheets, just the p-value.

It does not seem like Microsoft Excel or Google Sheets report the t-statistic and it is not critical that you report this in every paper. However, for those that are interested, it is fairly easy to calculate this yourself. In the numerator you have the mean of the first group minus the mean of the second group. In the denominator you have the square root of (the standard deviation of the first group divided by the sample size of the first group) + (the standard deviation of the second group divided by the sample size of the second group)). You can visualize the formula here - 

Before computers, folks would calculate the "t-statistic" using the above formula, then they would go to a large table in the back of a statistics textbook and look for their t-statistic given the sample size to see if their data was significantly different or not. With the advent of computing it was possible to just directly get the p-value so t-stat is less important but it is still common practice to report it.

You also want to report the sample size of both groups and you can do this in a few ways. You could say something like- “Although scores were higher for group A (Mean = XX; Standard deviation = XX, N = XX) than group B (Mean = XX; Standard deviation = XX, N = XX), the difference between these groups was not significant (t = XX, p = 0.XX).

What if my data are not “significantly different”?

This is common! First, it is important to realize that something can fail to be “statistically significant” but still be “biologically significant”. For instance, initially a species may be declining in population number but it may be such a slow decline with so much variation that it is not “significant”. However the “decreasing trend” is still very biologically relevant. Also, if you have a p-value that is greater than 0.05 but still lower than say 0.15 some researchers would discuss this as being “statistically insignificant” but still “suggestive of a trend” that might be significant if you had a larger sample size or controlled for some additional confounding variables. Instead of simply declaring  "there was not a significant difference (p = 0.11)" it might be more appropriate to say "Given the available sample size and variation, it was not possible to detect a significant difference (p = 0.11)."

Also, sometimes, we have low "statistical power" which means we have a low likelihood of detecting a significant difference even if one exists. Low statistical power can result from low sample size and large variation in the groups (high standard deviation). It is possible to calculate statistical power using fancier statistical programs. The main “take home point” is that a p-value >0.05 really indicates no significant difference that we could detect with the available data. A larger sample size might result in a significant difference. 

Finally, if you find no significant difference despite being satisfied with your statistical power (you had good sample size, not much variation), you should not just give up. Now you can consider other hypotheses. This is what science and inquiry is all about-- it is a continual investigation where one piece of evidence leads to another question.

What if I have three (or more) groups? If you have a third Group (or more), and your data is organized as shown in Table 1 above (but with more columns), you can use an ANOVA (Analysis of Variance) or the non-parametric alternative, a Kruskal-Wallis Test.

Video Resources 

  • (10:49)- Understanding the term “omnibus test” and how to run an ANOVA using statskingdom.com. Example compares CO2 emissions per capita compared in countries classified as “Low”, “Medium”, and “High” gross domestic product.
  • (2:16) - There are style guides such as APA that provide examples of how to write up research results. However, sometimes the best guide is going to the published primary literature. 

Comparing Two (or more) Groups with Categorical Data

If your data looks like the table below, where you have at least two groups and two categories and have counted up the frequency of occurrences in each cell, you will likely use a chi-square test.

Table 3. Example contingency data table for which you woud use a chi-square test. 

Category 1

Category 2

Group 1

35

22

Group 2

11

6


Realworld examples: 1) You are interested in comparing the percentage of dogs with aversive vs. non-aversive collars. You randomly observe 100 dogs and note that 86 of 100 are wearing non-aversive collars. You compare that to an “expected” equal percentage of 50 aversive and 50 non-aversive to see if 86 out of 100 is significantly different than the null hypothesis of no difference. 2) You are comparing the racial demographics of those participate in a citizen science program with the racial demographics of the city where the program is located. 

What if I have three (or more) groups or categories? The chi-square test can accommodate more than two groups and more than two categories. The above table is known as a Contingency Table. Because it has two categories and two groups, it is known as a 2 x 2 contingency table. However, depending on your data, you might have a 2 x 3 contingency table or other sizes (3 x 2, 3 x 3, 2 x 4, etc.). 

Video Resources 

  • (5:36) - Example looks at number of people interacting with nature, versus not directly interacting with nature, in different city parks. 
  • (14:40) - By Craig Beals, Dragonfly graduate and current instructor, this video provides an overview of the chi-square test for determining if observed frequencies significantly differ from expected frequencies.
  • (13:22) - Dragonfly graduate, Katie Dell, discusses how to setup a Pearson’s chi-square test of independence. 

I need to analyze data from a survey with many questions. Should I analyze each question separately or take an average of all questions?

This is up to you as the researcher. You could create one or more metrics of by combining responses from similar question types and then test if those metrics change as a whole; you could also take a single average of all responses. Or you could test each question separately. This really depends on your hypothesis-- do you think individual responses to different questions will change in isolation of each other or do you think all will increase together?

Important-- the more tests you run to see if there is significance, the more likelihood that one will just be significant by chance. If you think about it, we accept a p-value <0.05 as being "significant" because a 5% chance that the results are different due to chance seems low. However, each time you run a statistical test you inflate the chance that a result will appear significant (p<0.05) simply due to chance. Think about it like this-- if you roll a 20 sided die, you have a 5% chance of landing a 1. However, if you roll the 20 sided die twice, you now have a 10% chance of landing at least one 1. This may be confusing (or may not) but the main idea is to be careful and skeptical of low p-values if you run many tests. 

I did an observational study on the types of dog collars people place on their dogs. My hypothesis was more dogs would be equipped with non-aversive collars than aversive. I figured out I needed to have 384 data points in order to have a confidence level of 95% while accepting a margin of error of 0.05. The comparison revealed a higher percentage of dogs (86% [n=331} vs. 14% [n=53] were observed wearing non-aversive collars. From this point, I am not sure how to do the analysis. Are you able to assist?

You could do a chi-square test on this data. If you assume that the random proportion of aversive vs. non-aversive would be roughly 50/50 (this is your null hypothesis), you can compare your observed data 331/53 to what would be expected if all was random (192/192). To set this up, go into excel or Google Sheets and enter 192 in cell 1A, 192 in cell 1B, 331 in cell 2A, and 53 in cell 2B. Then, in any blank cell, type "=CHISQ.TEST" and highlight each column (different versions of excel may have slightly different names for this test). You will be given a p-value (probability) of the observed difference being different than that expected by random chance alone. Looking at your data, it seems highly likely that you will have a very low p-value and it will be significant.

What do statistics tell us about causality?

Usually nothing. Statistics typically just identify a pattern. It is up to you to then discuss if this is due to specific mechanism or not. For instance, a significant difference between the number of Cownose Rays in sandy versus rocky habitat DOES NOT prove that it is due to better camouflage in the sandy habitat. It just tells you there is a difference. The causal factor for the difference could be food availability, predator avoidance, microclimate factors, or other factors.

Comparing species abundance, richness, and diversity

A relatively common question in ecological research is how the number of species differ through time or across land use types. Common measures of species include:

  • Species abundance - the number of individuals of a species in an area
  • Species richness - the number of different species in an area
  • Species diversity - a measure of the balance or evenness of different species abundances in an area

Before you start your study, it is important to specify which measure(s) you are most interested in evaluating. The videos below clarifies the differences between abundance, richness, and diversity. They also demonstrate how to calculate the Shannon-Weiner Index of the diversity, a useful statistic when comparing the biodiversity of two populations. 

Video Resources 

  • (4:22) - Shows the differences in these measures and how to calculate Shanon-Weiner Index. 
  • (8:33) - Goes into more detail on how to perform Shannon's diversity index using excel. 

I am comparing biodiversity of fish in two bodies of water. What is a good metric to use?

You can compare abundance, species richness, and/or Shannon-Weiner Diversity for the two water bodies (see video above for differences in measuring each of those metrics. Note that this will give you one measure for each water body so you will not be able to use a t-test or other inferential statistics. To analyzed inferential statistics on the two bodies of water, you would need replicates at each site (multiple samples at different times) or more lakes to compare. That can be a lot of work! 

Data Visualization (Making Professional Figures and Graphs)

Once you have run your statistical tests, you need to create visuals to demonstrate the findings. Excel and Google Sheets have numerous chart making capabilities. Creating a clear and professional graph or figure is a key skill to develop. The Purdue OWL (Online Writing Lab) including Tables and Figures. The video resources below also will be helpful as you consider the best Figure and Table options for your work. 

Video Resources 

  • (part 1) (3:10) - Discusses differences between Figures and Tables.
  • (part 2) (5:31) - How to write clear, professional captions. 
  • (2:42) - Be creative and embrace multi-panel Figures. You can create these in powerpoint or google slides. 
  • (4:49) - These simple, small improvements can make your data table look much more professional. 
  • (7:09) - This is useful if you ran an ANOVA test and also a posthoc pairwise test (e.g., Tukey’s test) to see what pairs were significantly different. 
  • (3:40) - When and when not to use a pie-chart; The importance of scaling on correlation scatterplots. 

Dragonfly Introduction to Statistics - Free Online Course

Need more guidance? We have developed a free and entirely optional that all Dragonfly students and alumni can self-enroll in. The course is open year round and you can start at anytime. There currently are three Modules: 

  • Module 1: Descriptive Statistics - General principles of experimental design, when to use median vs. mean, how to calculate standard deviation, how to write up descriptive statistics in your papers. 
  • Module 2: T-Tests, Normality, and P-Values - includes paired vs. unpaired t-tests, Mann-Whitney Test, Wilcoxon Sign Rank Test and more. 
  • Module 3: Beyond the T-Test, Other Inferential Statistics - includes one-way ANOVA, chi-square tests, correlation/regression and more. 

You can go work at your own pace following along with short video tutorials to make the calculations in your own spreadsheet. We estimate that each Module will take 1.5-3 hours to complete. The modules provide an introduction only and is designed to generally improve or refresh your knowledge about statistics. As a small incentive, successful completion of each module results in a digital Badge for that Module. :-) 

To get started, you can self-enroll here: 

Additional Reading

Pfannkuch, M., Ben-Zvi, D. "Chapter 31: Developing Teachers' Statistical Thinking." Teaching Statistics in School Mathematics - Challenges for Teaching and Teacher Education. A joint ICMI/IASE Study, The 18th ICMI Study, Springer, 2011.

Gotelli and Ellison. "Chapter 6: Designing Successful Field Studies", A Primer of Ecological Statistics. Sunderland, Massachusetts: Sinauer Associates, 2004.

Hurlbert, S.H. "Pseudoreplication and the Design of Ecological Field Experiments." Ecological Monographs. 54.2 (1984): 187-211. ***Classic paper, Worth a read if you plan on doing ecological surveys!***

Platt, J.R. "Strong Inference: Certain systematic methods of scientific thinking pay produce much more rapid progress than others." Science. 146 (1964): 347-353. ***This is a classic paper, Worth a read!***

The Research Methods Knowledge Base was created in 2006 by William M.K. Trochim, a Professor in the Department of Policy Analysis and Management at Cornell University. It includes helpful introductory information on social research methods including measurements, data analysis, sampling and more: http://www.socialresearchmethods.net

 Qualitative Research: Studying how things work by Robert Stake. The Guilford Press, 2010. ***This book makes the bold (and refreshing) assertion that all research is both qualitative and quantitative, whether we admit it or not. It also gives one of the clearest explanations of the different approaches and the general concept of epistemology. There is also a chapter on Action Research. The great thing about this book, and many others, is that it is freely available as an ebook from 兔子先生 library. Just search for it at <http://www.lib.miamioh.edu/> ****

Patton, M. Chapter 3: Qualitative Research & Evaluation Methods. Sage Publications Inc., 2002.

Bateson, P. & Martin, P. (2007). Chapter 5. In Measuring behaviour (pp. 48-61). New York: Cambridge University Press.

Chamberlin, T.C. "The Method of Multiple Working Hypotheses: With this method the dangers of parental affection for a favorite theory can be circumvented." Science. 148 (1965): 754-759.***Classic paper, Worth a read!

*Cobb, P., Confrey, J., diSessa, A., Lehrer, R., and Schauble, L. "Design experiments in educational research." Educational Researcher. 32.1 (2003): 9-13.

Cox, G.W. (1985a). Exercise 10. In Laboratory manual of general ecology (pp. 60-68). Dubuque, IA: Wm. C. Brown Publishers.

Cox, G.W. (1985c). Exercise 29. In Laboratory manual of general ecology (pp. 163-167). Dubuque, IA: Wm. C. Brown Publishers.

Henderson, P.A. (2003a). Chapter 6. In Practical methods in ecology (pp. 83-94). Malden, MA: Blackwell Science.

Henderson, P.A. (2003b). Chapter 7. In Practical methods in ecology (pp. 95-99). Malden, MA: Blackwell Science.

McDonald, J.H. (2014). Types of biological variables. In Handbook of biological sciences (3rd ed.) (pp. 6-13). Baltimore, MD: Sparky House Publishing. Retrieved from http://www.biostathandbook.com/variabletypes.html.

McDonald, J.H. (2014). Choosing a statistical test. In Handbook of biological sciences (3rd ed.) (pp. 293-296). Baltimore, MD: Sparky House Publishing. Retrieved from http://www.biostathandbook.com/testchoice.html. 

Provenzo Jr, E., Ameen, E., Bengochea, A. "Photography and Oral History as a Means of Chronicling the Homeless in 兔子先生: The StreetWays Project." Educational Studies. 47.5 (2011): 419-435.

Thomson, G., J. Hoffman and S. Staniforth. Measuring the Success of Environmental Education 兔子先生. Sierra Club of Canada, Canadian Parks and Wilderness Society, Global Environmental and Outdoor Education Council, 2010.

MEERA website (My Environmental Education Research Assistant),an initiative out of the University of Wisconsin-Stevens Point, which for quite awhile was about the only environmental education organization even talking about evaluation. This site offers some great basics and leads you through conducting an evaluation. http://learningstore.uwex.edu/assets/pdfs/G3658-2.PDF NAAEE

Measuring Environmental Education Outcomes

Contact Project Dragonfly

111 Upham Hall
Oxford, Ohio
Est. 1994