Homework 3

Author

Sami Gibson

Published

March 4, 2005

Set-up

suppressPackageStartupMessages({
  library(tidyverse) # reads in tidyverse package
  library(here) # reads in here package
  library(janitor) # reads in janitor package
  library(readxl) # reads in readxl package
  library(performance)}) # reads in performance package
# Stores the salinity data as an object called salinity
salinity <- read.csv(here('data',"salinity-pickleweed.csv")) 
# Stores my personal data as an object called my_data
my_data <- read.csv(here('data',"193DS data - stress (5).csv"))

Here is my GitHub Repository

Problems

Problem 1. Slough soil salinity

You are working at a restoration site where you are managing planting of California pickleweed (Salicornia virginica) along a brackish slough (i.e. there is a mixture of fresh water and salt water).

You decide to measure plant growth for individual pickleweed plants by plucking an individual out of the ground and measuring the biomass (in g). You also measure salinity (as electrical conductivity in units of millisiemens per centimeter, or mS/cm) at the location in which the individual was growing. Admittedly, this isn’t a perfect study, but it’s what you can do with the time and resources you have!

a. An appropriate test

In 1-3 sentences, name the appropriate test(s) to determine the strength of the relationship between salinity and California pickleweed biomass (hint: there are two). Describe the differences between the two tests.

Be specific in your response to demonstrate your understanding of the variables in this question.

An appropriate parametric test is Pearson’s correlation coefficient (r), which assesses the strength and direction of a linear relationship between two continuous variables (soil salinity (mS/cm) and California pickleweed biomass (g)) assuming approximately normal distributions and independent observations. A non-parametric alternative is Spearman’s rank correlation (\(\rho\)), which evaluates the strength of a monotonic relationship using ranked values and does not require normality. These tests quantify how strongly variation in salinity is associated with variation in individual pickleweed biomass across the restoration site.

b. Create a visualization

Create a visualization that would be appropriate for showing the relationship between soil salinity (in mS/cm) and California pickleweed biomass (in g).

In addition to using the correct geometries, be sure to:

relabel the x- and y-axes and include units
use different colors from the ggplot() defaults
use a different theme from the ggplot() default

ggplot(data = salinity, # uses salinity data
       aes(x = salinity_mS_cm, # x is soil salinity
           y = pickleweed)) + # y is soil pickleweed biomass
  geom_point(color = 'magenta') + # changes color from default
  labs(title = "California pickleweed biomass vs. soil salinity", # creates title
    x = "Soil salinity (mS/cm)", # relabels x-axis
    y = "CA pickleweed biomass (g)") +  # relabels y-axis
  theme_minimal() # changes from default theme

c. Check your assumptions and run your test.

In the order that is appropriate, create separate sections using subheaders to:

check your assumptions
run your test

In each section, write the code to check your assumptions and run your test as you see fit.

In the section in which you check your assumptions, write 1-3 sentences describing:

which assumptions you checked
how you checked your assumptions
your assessment of your assumption checks

Part 1: Checking My Assumptions

# Assumptions:
#   1. Linear relationship between variables
ggplot(data = salinity, # uses salinity data set
       aes(x = salinity_mS_cm, # x-axis is salinity
           y = pickleweed)) + # y-axis is pickleweed biomass
  geom_point(color = 'magenta4', # sets point color
             alpha = 0.8, # sets point transparency
             size = 2) + # sets point size
  labs(x = "Soil salinity (mS/cm)", # creates x-axis label
       y = "Pickleweed biomass (g)", # creates y-axis label
       title = "Pickleweed biomass vs. soil salinity") + # creates title
  theme_minimal() # cleaner theme

#   2. Variables are continuous
str(salinity) # confirm salinity_mS_cm and pickleweed are numeric (continuous)

'data.frame':   23 obs. of  2 variables:
 $ salinity_mS_cm: num  6.58 9.23 4.25 9.26 0.34 6.59 2.26 3.31 1.24 4.91 ...
 $ pickleweed    : num  25.65 38.44 11.84 19.88 9.97 ...

#   3. Variables are normally distributed (this is up for debate)
ggplot(data = salinity, # uses salinity data frame
       aes(sample = salinity_mS_cm)) + # uses salinity_mS_cm variable
  geom_qq_line(color = "green") + # adds green theoretical normal reference line
  geom_qq() + # adds points 
  labs(x = "Soil salinity (mS/cm)", # creates x-axis label
       y = "Count", # creates y-axis label
       title = "Distribution of soil salinity") + # creates title
  theme_minimal() # cleaner theme

ggplot(data = salinity, # uses salinity data frame
       aes(sample = pickleweed)) + # uses salinity_mS_cm variable
  geom_qq_line(color = "green") + # adds green theoretical normal reference line
  geom_qq() + # adds points 
  labs(x = "Pickleweed biomass (g)", # creates x-axis label 
       y = "Count", # creates y-axis label
       title = "Distribution of pickleweed biomass") + # creates title
  theme_minimal() # cleaner theme

#   4. Independent observations: assumed because each observation 
#      represents a different plant sampled once (no repeated measures)

I visually checked for the assumption of a linear relationship between variables using a scatterplot (appeared generally linear). I checked the Pearson correlation assumptions of continuous variable using the str() function (both variables are type nmumeric) and of normally distributed variables using Q–Q plots (points fall close to the reference line). Independent observations are assumed based on the sampling design (each plant measured once from distinct locations) so I concluded the assumptions for the Pearson’s r are reasonably met.

Part 2: Running My Test

# Parametric Pearson's correlation test
cor.test(salinity$pickleweed, # pickleweed biomass variable from salinity df 
         salinity$salinity_mS_cm, # salinity variable from salinity df
         method = "pearson") # specifies Pearson (linear assumption met)


    Pearson's product-moment correlation

data:  salinity$pickleweed and salinity$salinity_mS_cm
t = 2.8979, df = 21, p-value = 0.008605
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1568265 0.7757682
sample estimates:
      cor 
0.5344778

d. Write about your methods and results.

In 1-3 sentences each, write about:

which test you used, and why

I used Pearson’s product–moment correlation to assess the strength and direction of the linear relationship between soil salinity (mS/cm) and pickleweed biomass (g), since both variables are continuous. After checking my assumptions, I determined that the Pearson assumptions of a linear relationship between variables, continuous variables, normally distributed variables, and independent observations were all reasonably met. This made Pearson’s correlation appropriate for evaluating this linear relationship.

your interpretation of your test (along with the appropriate summary of the test in parentheses)

I found a moderate relationship (positive + linear) between soil salinity (in mS/cm) and California pickleweed biomass (g) (Pearson’s r = 0.53, t(21) = 2.9, p = 0.01, \(\alpha\) = 0.05). This indicates that pickleweed biomass tends to increase as salinity increases, suggesting that pickleweed individuals may perform better under higher salinity conditions.

e. Write about the implications of your test.

You’re working on a team of people at this restoration site who are also concerned about pickleweed planting. In 2-3 sentences, write what you would communicate to them about the results of this test and what it means for pickleweed planting success at your site.

Be cognizant of your audience as you are writing: what would they need to know to take action?

My analysis indicated a moderate positive relationship between soil salinity and pickleweed biomass (Pearson’s r = 0.53, t(21) = 2.9, p = 0.01, \(\alpha\) = 0.05), meaning that plants in areas with higher soil salinity tended to be larger. This suggests that pickleweed is performing well under the more saline conditions typical of the brackish portions of the slough. Prioritizing planting in areas with moderate to higher salinity may improve establishment and growth.

f. Double check your own work.

In part a, you outlined two potential tests to answer this question about the strength of the relationship between soil salinity and pickleweed biomass. In part c, you chose a test, checked your assumptions, and ran one.

Try running the other test you listed in part a. Include the annotated code and output.

# Non-Parametric Spearman's correlation test
cor.test(salinity$pickleweed, # pickleweed biomass variable from salinity df 
         salinity$salinity_mS_cm, # salinity variable from salinity df
         method = "spearman") # specifies Spearman (no linear assumption)


    Spearman's rank correlation rho

data:  salinity$pickleweed and salinity$salinity_mS_cm
S = 824, p-value = 0.003426
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.5928854

In 1-3 sentences, describe whether or not the two tests would have led you to make the same decision (about the null hypothesis) and interpret the results the same way (about the relationship between soil salinity and pickleweed biomass).

In your description, be specific about the tests, their components, and their relation to the variables.

Yes, both Pearson’s r and Spearman’s \(\rho\) reject the null hypothesis of no association (Pearson’s r = 0.53, t(21) = 2.9, p = 0.009, \(\alpha\) = 0.05; Spearman’s \(\rho\) = 0.59, S = 824, p < 0.003, \(\alpha\) = 0.05). Both tests indicate a moderate positive relationship between soil salinity (mS/cm) and pickleweed biomass (g) indicating higher salinity is associated with higher biomass. The Spearman result is slightly stronger which is consistent with the diagnostics showing mild heteroscedasticity because this approach is more robust to mild violations of assumptions.

Problem 2. Personal data

a. Updating your visualizations

Revisit the visualizations you created in homework 2.

Provide the code and output for updated plots with your most recent observations.

Note: if you think that a different plot type or different variables would be more interesting to visualize, then change your plots!

NOTE: I am pivoting my response variable to be time slept and my main predictor to be sleep efficiency with time spent stress remaining as an additional potential predictor.

You should have the annotated code and output for two plots.

For each plot, be sure to:

label the x- and y- axes and provide units
include the date of the most recent observation as a subtitle
clean up the visual clutter (e.g. grids, backgrounds)
use colors that are different from the ggplot() defaults

Visualization 1 (categorical predictor variable)

# Clean data
my_data_clean <- my_data |> # starts with the raw personal dataset
  clean_names() |>  # standardizes column names (lowercase + underscores)
  # converts day_of_week to an ordered factor + display days in order (Mon–Sun)
  mutate(day_of_week = factor(day_of_week, 
                              levels = c("Monday","Tuesday","Wednesday","Thursday",
                                         "Friday","Saturday","Sunday"))) |>
  # convert the column to Date format with the correct format string
  mutate(date = as.Date(date_mm_dd_yyyy, format = "%m/%d/%Y")) |>
  # mutates sleep eficiency to be a numberic variable
  mutate(sleep_efficiency = parse_number(as.character(sleep_efficiency)) / 100)


# Collect the most recent date for the subtitle
most_recent_date <- my_data_clean |> # uses cleaned data set     
  summarise(max_date = max(date)) |> # computes most recent obs date
  pull(max_date) # extracts most recent obs date

# Defines a list of colors to be used for each day of the week in the plot
week_colors <- c(
  "Monday"    = 'coral',
  "Tuesday"   = 'steelblue',
  "Wednesday" = 'lavender',
  "Thursday"  = "pink",
  "Friday"    = 'limegreen',
  "Saturday"  = "yellow",
  "Sunday"    = "lightblue") 
# ggplot base layer
ggplot(my_data_clean, # uses cleaned data
       aes(x = day_of_week, # categorical predictor (day of week) on x-axis
           y = time_slept_minutes)) + # response (time slept (min)) on y-axis 
  # creates boxplot
  geom_boxplot( # creates boxplots by day
    aes(fill = day_of_week), # fills color by day of week
    width = 0.65, # sets box width
    alpha = 0.50, # sets transparency so points are visible
    outlier.shape = NA) + # hide default outlier dots 
  geom_jitter( # adds jitter points of individual observations
    aes(color = day_of_week), # designates point color by day of week
    width = 0.12, # jitters horizontally to reduce overlap
    height = 0, # no vertical jitter 
    alpha = 0.65, # sets transparency for readability
    size = 2) + # sets point size
  # creates scatter points for each day of the week
  scale_fill_manual(values = week_colors, # applies custom fill colors
                    guide = "none") + # hides legend
  scale_color_manual(values = week_colors, # applies custom point colors 
                     guide = "none") + # hides legend
  labs(title = "Time slept varies across days of the week", # creates title
       subtitle = paste("Most recent observation:", # creates subtitle
                        format(most_recent_date, "%b %d, %Y")), # reformat date
       x = "Day of week", # creates x-axis label
       y = "Time slept (minutes)") + # creates y-axis label + units
  theme_classic(base_size = 12) + # cleaner theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # rotate x-axis labels

Visualization 2 (continuous predictor variable)

ggplot(data = my_data_clean, # uses cleaned data
       aes(x = sleep_efficiency, # continuous predictor (sleep efficency) on x-axis
           y = time_slept_minutes)) + # response (time slept (min)) on y-axis 
  geom_point( # plots individual days
    color = "magenta4", # changes color to purple
    alpha = 0.7, # sets transparency so points are visible
    size = 2) + # sets size of points 
  labs(title = "Sleep duration vs. sleep efficiency", # creates title
       subtitle = paste("Most recent observation:", # creates subtitle
                        format(most_recent_date, "%b %d, %Y")), # reformat date
       x = "Sleep Efficiency (range 0-1)", # creates x-axis label
       y = "Time slept (minutes)") + # creates y-axis label + units                  
  theme_classic(base_size = 12) # cleaner theme

b. Captions

In text (not in code), write captions for both your figures.

Figure 1. Time slept varies across days of the week. Colored boxplots represent the distribution of daily time slept (minutes) for each day of the week (Monday–Sunday) which are each indicated with a different color (Monday is coral, Tuesday is steelblue, Wednesday is lavender, Thursday is pink, Friday is limegreen, Saturday is yellow, Sunday is lightblue). Boxes display the interquartile range (IQR), horizontal lines indicate medians, and whiskers extend to 1.5 times the IQR. Semi-transparent colored points represent individual daily observations which are jittered horizontally within each day to reduce overlap. The subtitle indicates the most recent observation included in the dataset. Data represent personal daily tracking records collected between January–February 2026 using an Oura ring.

Figure 2. Sleep duration is associated with sleep effciency (as measured by the Oura Ring). Magenta circles represent individual daily observations of time slept (minutes) plotted against sleep efficiency (ranging 0-1 with 1 indicating 100% efficiency and 0 indicating 0% efficiency). Each scatterplot point corresponds to a single recorded day. Sleep efficiency as defined by Oura is the percentage of time you were asleep while you were in bed. The subtitle indicates the most recent observation included in the dataset. Data represent personal daily tracking records collected between January–February 2026 using an Oura ring.

Problem 3. Affective visualization

In this problem, you will create an affective visualization using your personal data in preparation for workshops during weeks 9 and 10.

a. Describe in words what an affective visualization could look like for your personal data (3-5 sentences).

Inspired by warming stripes and quilt art, I could create a visualization where each day is represented by a horizontal strip arranged chronologically from top to bottom. The color of each strip would represent time stressed with the number of the hue determined by the minutes spent stressed. The opacity of each strip would represent sleep efficiency (measured as a percentage), with more opaque colors indicating higher efficiency. The length of each strip would represent time slept, so longer strips indicate more sleep. The design would mirror the strips across the center of a square layout, creating a quilt-like pattern that highlights changes/patterns in stress and sleep over time.

b. Create a sketch (on paper) of your idea.

Include a photo of this sketch in your document.

# includes sketch
knitr::include_graphics(here('data',"sketch.jpg"))

c. Make a draft of your visualization.

Feel free to be creative with this! The one rule is that you may not use any code to create your visualization.

# includes draft
knitr::include_graphics(here('data',"draft.jpeg"))

d. Write an artist statement.

An artist statement gives the audience context to understand your work. For each of the following points, write 1-3 sentences to address:

the content of your piece (what are you showing?)

This piece visualizes my daily sleep duration and time spent stressed data arranged chronologically to reveal patterns, fluctuations, and rhythms over time. Each day is represented as a horizontal strip of color, where the hue of the color reflects the amount of time I felt stressed. The opacity of each strip represents sleep efficiency, with higher efficiency shown as more opaque colors, while the length of each strip represents the total time slept that night.

the influences (what did techniques/artists/etc. did you find influential in creating your work?)

I was influenced by warming-strips temperature visualizations and quilt art, which use simple color gradients to communicate change over time in an intuitive and emotional way. I was also inspired by data-art projects that allow trends to be felt rather than just measured. My initial inspiration came from temperature-color blankets shared on the r/dataisbeautiful Reddit page that I saw over the past summer and really stuck with me.

the form of your work (watercolor, oil painting, crocheted object, etc.)

My visualization will be created digitally using the Sketchbook app on my iPad.

your process (how did you create your work?)

I plan to use a grid background (which I will later delete) to keep the spacing and lengths of the strips consistent. The sleep efficiency percentage will match the opacity of each strip with days above the mean appearing more opaque and days below the mean appearing more transparent. Each block on the grid will indicate an additional 10 minutes slept and a change in hue will be directly indicated by minutes stressed.

Problem 4. Statistical critique

At this point, you have seen and created a lot of figures for this class. Revisit the paper you chose for your critique and your homework 2, where you described figures or tables in the text. Address the following in full sentences (3-4 sentences each).

For this section of your homework, you will be evaluated on the logic, conciseness, and nuance of your critique.

a. Revisit and summarize

What are the statistical tests the authors are using to address their main research question? (Note: you have already written about this in homework 2! Find that text and provide it again here!)

The statistical test in this paper is a Kruskal–Wallis test (to compare the three treatments), followed by pairwise Mann–Whitney U-tests for post-hoc comparisons. The response variable is periphyton (primary producer) net primary production and the predictor variable is Mediterranean barbel (small endangered predator fish species) density.

Insert the figure or table you described in Homework 2 here.

# includes figure
knitr::include_graphics(here('code',"pone.0117630.g005.png"))

b. Visual clarity

In 2-4 sentences, answer the question.

How clearly did the authors visually represent their statistics in figures? For example, are the x- and y-axes in a logical position? Do they show summary statistics (means and SE, for example) and/or model predictions, and if so, do they show the underlying data?

The authors clearly present the primary comparison by placing predator density categories on the x-axis and chlorophyll-a concentration on the y-axis, which creates a logical and intuitive layout. The figure shows summary statistics using mean values with error bars representing standard error, and uses letter groupings (“a” = absent, “b” = barbels) to indicate statistically significant differences among treatments. While this approach communicates statistical comparisons efficiently, the figure only displays summary values and does not show the underlying data points. As a result, it is difficult to assess the distribution of observations, sample size, or the presence of potential outliers.

c. Aesthetic clarity

In 2-4 sentences, answer the question.

How well did the authors handle “visual clutter”? How would you describe the the data:ink ratio?

The figure is visually straightforward and mostly uncluttered, with minimal gridlines and clear labeling. The data-to-ink ratio is fairly high because most visual elements (bars, error bars, and treatment labels) directly represent data rather than decorative elements. However, the large filled bars dominate the visual space and emphasize area rather than the mean values themselves. The bold primary colors also create strong contrast without conveying additional information, which makes the figure slightly visually distracting despite its otherwise simple design.

d. Recommendations

In 2-4 sentences, outline what recommendations would you make to make the figure or table better. What would you take out, add, or change? Provide explanations/justifications for each of your recommendations.

I would recommend replacing the bar chart with a dot plot or boxplot that includes the underlying data points. Showing the raw observations would allow readers to assess variability, sample size, and potential outliers rather than only viewing summary statistics. Additionally, using slightly more muted colors would reduce unnecessary visual emphasis and improve accessibility without changing the meaning of the figure. Finally, explicitly labeling the y-axis or legend with what the error bars represent (SE), would make the statistical summary clearer without requiring readers to refer to the caption.

Set-up

Problems

Problem 1. Slough soil salinity

a. An appropriate test

b. Create a visualization

c. Check your assumptions and run your test.

Part 1: Checking My Assumptions

Part 2: Running My Test

d. Write about your methods and results.

e. Write about the implications of your test.

f. Double check your own work.

Problem 2. Personal data

a. Updating your visualizations

b. Captions

Problem 3. Affective visualization

a. Describe in words what an affective visualization could look like for your personal data (3-5 sentences).

b. Create a sketch (on paper) of your idea.

c. Make a draft of your visualization.

d. Write an artist statement.

e. Prep your materials to share in class.

Problem 4. Statistical critique

a. Revisit and summarize

b. Visual clarity

c. Aesthetic clarity

d. Recommendations