--- title: "Cutflower_bioassay_analysis" author: "Sonja Glasser" date: '2025-01-09' output: html_document --- Data analysis for pollen diet on Crithidia count and bee death. Pollen types are A-E, which were collected from honeybee colonies, sorted by color and analyzed for pollen composition using DNA metabarcoding analysis. ## Metadata * Here are values for the different variables in the dataframe Individual sample * "Bee_ID" = unique bee ID Predictor variables * "Treatment" = type of pollen diet * "Wing_MM" = Wing size marginal cell * "pollen_consumed" = amount of pollen consumed Response variables * "Count" = Crithidia count / hemocytometer (5 squares) * "Dead" = Bee dead (1=yes/0=no) Random/Secondary Predictor variables * "Colony" = colony * "Inoc_type" = Crithidia inoculum from source colony or cultured inoculum / only using samples inoculated with source colony inoculum. * "Inoc_Date" = Date of inoculation * "Dissect_Date" = Dissection date * "Death_date" = Date found bee dead Extra columns * "Notes" = notes * "data_entered_by" = person who entered data * "date_entered" = date data entered For reference I used https://github.com/llf44/Asteraceae -pollen as an example for how to analyze data ```{r setup, include=FALSE} require(here)#telling R where to look for a file require(lme4) # generalized mixed models library(tidyverse) # language & ggplot library(car) # Anova library(emmeans) #extract the estimated marginal means from the model library(effects)#allEffects() library(MuMIn)#dredge () model selection library(performance) #check_collinearity library(AICcmodavg) #compare models #library(GGally) # pairwise plots library(lattice) # another graphing package library("glmmTMB") library(multcomp) library(broom) #tidy() function library(DHARMa) library(survival) library(survminer) library(coxme) ``` ## Prepare dataframe call in data and reclassify! ```{r} o <-read.csv(here("data", "count_cons_rdataset.csv")) cols<- c("Bee_ID","Colony","Treatment","Inoc_type","Dead") o[cols] <- lapply(o[cols], factor) o$Wing_MM<-as.numeric(o$Wing_MM) o$Inoc_Date<-as.Date(o$Inoc_Date, "%m/%d/%Y") o$Dissect_Date<-as.Date(o$Dissect_Date, "%m/%d/%Y") o$Death_date<-as.Date(o$Death_date, "%m/%d/%Y") o$Treatment<-factor(o$Treatment, levels=c("BW","A", "B", "C", "D","E", "SUN")) o<-o%>% filter(pollen_consumed != "NA") %>% # remove data without pollen consumption. mutate(pollen_consumed = if_else(pollen_consumed < 0, 0, pollen_consumed))#change pollen consumption summary(o) ``` I will make a data frame without the bee death. ```{r} o_no_d<-o%>% filter(Dead!=1, Count!="NA") o_no_d%>% group_by(Treatment)%>% count() ``` ## Data visualization Here are a boxplot of the patterns. ```{r} summary(o_no_d$Treatment) boxplot(Count ~ Treatment, data = o_no_d) boxplot(Count ~ Colony, data = o_no_d) boxplot(Count ~ Inoc_Date, data = o_no_d) plot(Count ~ pollen_consumed, data = o_no_d) #there is one data point, that could be an outlier. I will remove that data point and see if pollen_consumed become less problematic in the models. bee id 443 hist(o$pollen_consumed, breaks = 50) ``` This is a nice sanity check. there doesn't appear to be a huge difference between pollen diets and consumption, but I will need to test it anyways. ## Pollen consumption model We are interested in how pollen diet treatment effects infection, but we know that if one diet is less attractive than another and is eaten less, then the treatment effect could actually be a starvation effect which reduces infection levels. If treatment doesn't effect consumption, that would be great, since how much is eaten won't confound the effect of treatment. But we need to confirm that. ```{r} histogram(o$pollen_consumed) summary(o$pollen_consumed) ``` ### Visualize pollen consumption vs. predictor variables ```{r} boxplot(pollen_consumed ~ Treatment, data = o) boxplot(pollen_consumed ~ Colony, data = o) ggplot(o_no_out, aes(Wing_MM, pollen_consumed)) + geom_point()+ geom_smooth(method=lm)+ theme_classic() ``` It seems that across all treatments, bee size and parent colony may influcence how much pollen is consumed. We need to keep bee size and colony in the model to control for effect of physiological and genetic differences. Let's take a closer look at the difference between treatment and consumption. ```{r} ave_cons_treat<-o%>% group_by(Treatment)%>% summarize( mean_con = mean(pollen_consumed, na.rm = TRUE), # Mean, removing NAs median_con = median(pollen_consumed, na.rm = TRUE), # Median sd_con = sd(pollen_consumed, na.rm = TRUE), # Standard deviation min_con = min(pollen_consumed, na.rm = TRUE), # Minimum max_con = max(pollen_consumed, na.rm = TRUE), # Maximum n_obs = n() # Number of observations in the group ) ave_cons_treat bartlett.test(pollen_consumed ~ Treatment, data = o) #very significantly different. ``` It seems bees consumed more sunflower pollen on average and pollen diet B had the lowest average pollen consumed. The standard deviation for sunflower pollen is also higher, so there may be more variance, but I think in a model that accounts for more variance, that should be fine. There is one data point, 443, that seems a problematic, it "ate" much more pollen than any other bee. Although there is no note associated with this data point, it is likely that pollen could have been knocked out of this container. I will remove 443 from the dataset to test for effects of treatment on pollen consumption, I should also see if we should remove 443 from the whole dataset. ```{r} o_no_out<-o%>% filter(Bee_ID!="443") histogram(o_no_out$pollen_consumed) # It isn't much different and I don't think I should take it out. ``` It is still pretty skewed. I may want to do a log transformation to the pollen consumption data. Or do a Gamma distribution. But for now, let's see how a gaussian model will fit. ### Alternative model ```{r} mod_pollen <- lmer(pollen_consumed ~ Treatment + Wing_MM + Colony +(1|Inoc_Date), data=o) check_collinearity(mod_pollen) sim_pollen <- simulateResiduals(mod_pollen) plot(sim_pollen) qqnorm(residuals(mod_pollen)) qqline(residuals(mod_pollen)) plot(mod_pollen) ``` ### Reduced model ```{r} mod_pol_red <- lmer(pollen_consumed ~ Wing_MM + Colony +(1|Inoc_Date), data=o) ``` Does the reduced model have to have a good fit? ### Hypothesis testing ```{r} anova(mod_pollen, mod_pol_red) ``` Retaining Treatment significantly increases model fit. Including treatment creates a better fitting model. ```{r} Anova(mod_pollen, type=2) summary(mod_pollen) ``` All predictors significantly explain the variance of the data. ### Pollen consumption model interpretation ```{r} summary(glht(mod_pollen,linfct=mcp(Treatment="Tukey"))) ``` ```{r} emm_trt <- emmeans(mod_pollen, ~ Treatment) emm_trt pairs(emm_trt, adjust = "tukey") sig_letters<-cld(emm_trt, Letters = "abcdefg") eff_size(emm_trt, sigma = sigma(mod_pollen), edf = df.residual(mod_pollen)) plot(emm_trt) + ylab("Adjusted pollen consumed") sig_letters<-cld(emmT, Letters = "abcdefg") ``` Pollen diet treat did have a significant effect on how much pollen was consumed. In this case, according to the tukey test of pairwise comparison, sunflower pollen had a higher consumption than the other pollens. ## Infection Model To understand how infection is affected by diet I start with the full model of pollen treatment and the additive variables of amount of pollen consumed, colony (since there were only 4 colonies it would be better to leave it as a fixed factor) and the covariate size of bee which has been shown to affect infection intensity within an individual. For the zero inflated portion of the model I will use the null since I don't have a strong hypothesis for which variable could account for structural zeros in this system. glmmTMB(Count ~ Treatment + pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), zi=~Treatment+pollen_consumed, data=o_no_d, family=nbinom1) The model presented below states that the: Crithida count = (𝜂𝑖) is affected by the pollen diet treatment = (𝛽Treatment𝑋Treatment𝑖) and the parent colony (𝛽colony𝑋colony𝑖) the additive covariates amount of pollen consumed (𝛽pollen_consumed𝑋pollen_consumed𝑖) and size of bumblebee = (Wing_mm𝑋Wing_mm𝑖) The variability between trials is also accounted for by using the random factor of Inoculation date = (1|Inoc_Date). The random factor inoculation date only assumes that the mean of the response variable are affected by site. The 𝛽0 = When all continuous explanatory variables (cut flower abundance, julian.date and wing_mm) are set to to 0. 𝛽Treatment = is the contrast in 𝜂 between groups, (pollen diet treatment) 𝛽pollen_consumed = change in 𝜂 with a unit increase in pollen_consumed. 𝛽Colony = is the contrast in 𝜂 between groups, (parent colonies) 𝛽Wing_mm = change in 𝜂 with a unit increase in wing marginal cell size (proxy for bee size). Y Inoculation date = is the contrast in 𝜂 between groups (inoculation dates) #### Linear predictor: *Count:* 𝜂𝑖 = 𝛽0 + 𝛽Treatment𝑋Treatment𝑖 + 𝛽pollen_consumed𝑋pollen_consumed𝑖 + 𝛽Wing_mm𝑋Wing_mm𝑖+ 𝛽Colony𝑋Colony𝑖 + YInoculation dateZInoculation date𝑖 *Binomial:* ~ 1 #### Error distribution: *Count process:* 𝑒𝑖~ NB(r, k) *Binary process:* 𝑒𝑖~ Binomial(𝑛, 𝑝𝑖), n = 1 #### Link function *Count process:* log link ln(u) = n *Binary process:* logit link ui= e^ni/1+e^ni ### Model distribution selection ```{r} mean.crit<- mean(o_no_d$Count) var.crit<- var(o_no_d$Count) thet.crit <- mean.crit^2 / (var.crit - mean.crit) hist(o_no_d$Count,main="Crithidia cell counts distribution",xlab="crithidia cell counts", right=TRUE,freq=FALSE,col=adjustcolor("darkred",0.5)) lines(0:600, # range of the observed counts dpois(0:600,lambda=mean.crit),lwd=2) lines(0:600,dnbinom(0:600,mu=mean.crit, size = thet.crit),lty=2) ``` The mean of the crithidia count was 15.6 where as the variance was calculated as 1623.5. The variance is a LOT HIGHER than the mean. This is definitely a zero-inflated negative binomial model. The line and the dotted line represent the poisson distribution and the negative binomial distribution, respectively. These distributions were simulated using sample parameters calculated from the Crithidia cell counts. As we can see a poisson distribution is inappropriate for this data set while the negative binomial distribution looks like a better fit. But, the dotted line does not fit the spread of the data perfectly, therefore, a zero inflated negative binomial model would be the most appropriate model. (this sentence is crazy... I don't think this is a good justification at all.) ### Check collinearity Let's check the variance inflation (VIF) EXPLORE WHAT VIF DOES ```{r} check_collinearity(glm(Count ~ Treatment + Wing_MM + pollen_consumed + Inoc_Date + Colony , data=o_no_d, family="poisson")) ``` It looks like collinearity won't be a problem since all variables are less than 5. colony and inoc date would be collinear... I will check collinearity after running the full model. ### Check negative binomial distribution When fitting a glmmTMB, with zero inflation, we need gto make she we are defining the family correctly. One way to do this is by plotting the model residuals against the observed values, if the relationship is quadractic use nbinom2 if it linear use nbinom1. For our data it is usually quadratic. ```{r} mod_1 = glmmTMB(Count ~ Treatment + pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), zi=~1, data=o_no_d, family=nbinom1) mod_2 = glmmTMB(Count ~ Treatment + pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), zi=~1, data=o_no_d, family=nbinom2) ``` ```{r} residuals_squared_nbinom1 <- (resid(mod_1))^2 residuals_squared_nbinom2 <- (resid(mod_2))^2 # Plot observed values against squared residuals plot(o_no_d$Count, residuals_squared_nbinom1, xlab = "Observed Values", ylab = "Squared Residuals") # looks more linear than quadratic :o plot(o_no_d$Count, residuals_squared_nbinom2, xlab = "Observed Values", ylab = "Squared Residuals") # looks very strange ``` It seems like the nbinom1 will be the better family since it looks slightly more linear than quadractic. In addition when doing the dharma model diagnostic comparisons, the models with the nbinom1 seem to fit better. To understand how infection is affected by diet I start with the full model of pollen treatment and the additive variables of amount of pollen consumed, of colony (since there were only 4 colonies it would be better to leave it as a fixed factor) and size of bee which has been shown to affect infection intensity within an individual. For the zero inflated portion of the model I will set it to the null. The null hypothesis is the model without treatment since all the other factors are know to affect infection outcomes, how much pollen is consumed, susceptibility differences across colonies and bee size. Intercept only ZI proportion: ```{r} #Alt hypothesis mod_full = glmmTMB(Count ~ Treatment + pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), zi=~1, data=o_no_d, family=nbinom1) #Null hypothesis mod_red = glmmTMB(Count ~ pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), zi=~1, data=o_no_d, family=nbinom1) ``` Treatment in ZI - test for effects of pollen diet treatment on infection prevalence: ```{r} #Alt hypothesis ZI treat mod_full_T = glmmTMB(Count ~ Treatment + pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), zi=~Treatment, data=o_no_d, family=nbinom1) ``` ## Model validation: Check model fits: ```{r} check_collinearity(mod_full) simoutbin500<-simulateResiduals(fittedModel=mod_full_T, n=500) plot(simoutbin500) ``` Check model fits: ```{r} check_collinearity(mod_full_T) simoutbin500<-simulateResiduals(fittedModel=mod_full_T, n=500) plot(simoutbin500) ``` Model fit looks good and the collinearity looks good as well. VIF <2, so that is good too! ## Hypothesis testing ```{r} anova(mod_full, mod_red) ``` ```{r} anova(mod_full_T, mod_red) ``` The full model which includes treatment significantly increases model fit. ## Model interpretation ```{r} summary(mod_full) Anova(mod_full) summary(glht(mod_full,linfct=mcp(Treatment="Tukey"))) ``` ```{r} summary(mod_full_T) Anova(mod_full_T) ``` Zero inflation was modeled with an intercept-only term because allowing zero inflation to vary by treatment led to unstable and non-estimable parameter estimates, indicating that treatment did not meaningfully explain the presence of excess zeros. The variance in the model is significantly explained by the colony, size of bee and pollen treatment. According to the Tukey test all pollen diets are significantly different from the negative control expect pollen diet "B". ```{r} emmeans(mod_full, ~ Treatment, type = "response") |> contrast(method = "trt.vs.ctrl", ref = "SUN") ``` I didn't end up using emmeans, but here is some code exploring some of the model predicted terms. I found this other guide to using emmeans. I am using this guide for emmeans model result interpretations. https://rvlenth.github.io/emmeans/articles/AQuickStart.html Here are the estimated marginal means for each factor then the comparisons / contrasts. But I can't get the emmeans dataframe like in the code in the block below. ```{r} emmT<- emmeans(mod_final, ~ Treatment) sig_letters<-cld(emmT, Letters = "abcdefg") d<-as.data.frame(pairs(emmT, adjust="tukey")) #Here it is for colony as well as an attempt for Wing_mm emmC<- emmeans(mod_final, ~ Colony) pairs(emmC, adjust="tukey") #Wing_MM is a continuous numerical predictor seq(min(o_no_d$Wing_MM), max(o_no_d$Wing_MM), length.out = 5) emmw<- emmeans(mod_final, ~ Wing_MM, at = list(Wing_MM = c(1.53000, 1.86425, 2.19850, 2.53275, 2.86700))) ``` Calculations of percent of effect sizes: % change=(eβ−1)×100 Treatmentβ estimate exp(β) % change vs BW A -1.8321 0.160 −84.0% B− -0.6259 0.535 −46.5% C− -2.6637 0.070 −93.0% D− -1.9200 0.147 −85.3% E− -1.5063 0.222 −77.8% SUN− -1.6252 0.197 −80.3% ## Plots of trends using raw data ### I need the sig letters not numbers. I used these plots and assign letters above which is significant. ```{r} #make treatment types as non-aster vs other aster vs. sun o_no_d <- o_no_d %>% mutate(Flower_family = ifelse(Treatment == "SUN", "Sunflower", ifelse(Treatment == "A" | Treatment == "B" | Treatment == "C" | Treatment == "D" | Treatment == "E", "Pollen Diets", ifelse(Treatment == "BW", "Buckwheat", NA)))) o_no_d $Flower_family<-factor(o_no_d $Flower_family, levels=c("Buckwheat","Pollen Diets","Sunflower")) o_no_d_summary<-o_no_d%>% rename(count=Count)%>% group_by(Treatment, Flower_family)%>% summarise(Count=mean(count), sd =sd(count), n = n(), se=sd/sqrt(n)) o_no_d_summary plot.df<-left_join(o_no_d_summary,sig_letters,by="Treatment") plot.df.2 <- data.frame( Treatment = c("BW", "A", "B", "C", "D", "E", "SUN"), .group = c("b", "a", "b", "a", "a", "a", "a"), y_position = c(100, 20, 30, 25, 30, 25, 40) # choose values above your boxplots ) # I needed to make a dataframe with the signifcant letter and "positions" on the y axis to get the sig letters on the figure. the "plot.df" dataframe was not actually helpful. #here is another way to do it, but more automated based on values mean_y <- mean(o_no_d$Count, na.rm = TRUE) #and here is another way to do it... too keep all sig letters at the same level except for buckwheat. plot.df.3 <- o_no_d %>% group_by(Treatment) %>% summarise() %>% # empty summary—just keeps unique Treatment names mutate( .group = c("b", "a", "b", "a", "a", "a", "a"), y_position = if_else(Treatment == "BW", mean_y + 100, mean_y + 15) ) #and even ANOTHER way to do it. plot.df.3.5 <- o_no_d %>% group_by(Treatment) %>% summarise(y_position = mean(Count, na.rm = TRUE) + 10) %>% # buffer = 10 mean(Count, na.rm = TRUE) + 10 mutate(.group = c("b", "a", "b", "a", "a", "a", "a")) # this would be nice if there wasn't so much scatter in the data, there are a lot of zeros and then very high values in each... so I think the manual placement is actually better for now. my_y_title_obs <- expression(paste("Obs. No. of ", italic("C. bombi"), " cells in 0.02", mu, "l")) Treat_vs_infection<-o_no_d%>% ggplot(aes(x=Treatment, y=Count, color=Flower_family))+ geom_boxplot(lwd=.5, outlier.alpha = 0.3)+ geom_text( data = plot.df.3, aes(x = Treatment, y = y_position, label = .group), inherit.aes = FALSE, # <--- important, but I don't remember why. size = 5, fontface = "bold", color = "black", vjust = -2 )+ theme_classic()+ theme(plot.title = element_text(size=20, face="bold", hjust = 0.5), axis.title.x = element_text( size=15, face="bold"), axis.title.y = element_text( size=15, face="bold"), axis.text.x = element_text(face="bold", size=15), axis.text.y = element_text(face="bold", size=15), legend.text = element_text(face="bold", size=11), legend.key.size = unit(2,"line"), legend.title = element_blank())+ labs(y=my_y_title_obs, x="Pollen diet treatment")+ scale_color_manual(values=c( "black","#E69F00", "#D55E00")) Treat_vs_infection ggsave("bioassay_count_sept2025.jpg", width = 7, height = 4) ``` ```{r} emmodel<-emmeans(mod_full, specs=pairwise ~ Treatment,type="response") emmodel_frame<-as.data.frame(emmodel$emmeans) pairs(emmodel, adjust = "tukey") my_y_title_obs <- expression(paste("Observed No. of ", italic("C. bombi"), " cells in 0.02", expression(mu), "l")) my_y_title_pred <- expression(paste("Predicted No. of ", italic("C. bombi"), " cells in 0.02" , mu, "l")) emmodel.plot.treatment<-emmodel_frame%>% ggplot(aes(x=Treatment, y=response))+ geom_line(aes(group = Treatment),size=2)+ geom_point(size=3)+ geom_errorbar(aes(ymin=response-SE, ymax=response+SE), width=.3, size=1.0)+ theme_classic()+ labs(y=my_y_title_pred, x="Pollen diet treatment") #theme(labs=my_y_title_pred, plot.title = element_text(size=14, face="bold", hjust = 0.5), # axis.title.x = element_text( size=12, face="bold"), #axis.title.y = element_text( size=12, face="bold"), #axis.text.x = element_text(face="bold", # size=10), # axis.text.y = element_text(face="bold", #size=10)) emmodel.plot.treatment ggsave("bioassay_count_Nov2024.jpg", width = 7, height = 5) ``` ## Survival Model Binomial survival model ```{r} table(o$Treatment,o$Dead) table(o$Dead)# 99 dead / 381 o$Dead <- as.numeric(o$Dead) ``` Let's look at some realtionships using ggplot_point (scatter plot) with regression lines included. ```{r} ggplot(o, aes(pollen_consumed, Days_alive, colour = factor(Colony))) + geom_point()+ geom_smooth(method=lm)+ facet_wrap(~Colony)+ theme_classic() ggplot(o, aes(Days_alive, Wing_MM, colour = factor(Colony))) + geom_point()+ geom_smooth(method=lm)+ facet_wrap(~Colony)+ theme_classic() ggplot(o, aes(pollen_consumed, Days_alive, colour = factor(Treatment))) + geom_point()+ geom_smooth(method=lm)+ facet_wrap(~Treatment)+ theme_classic() ``` The Cox model interprets effects per 1 whole unit, not per 0.01 or per 0.001, which is problematic when looking at our data since we have pollen consumed from 0-0.11, so a whole unit is a huge extrapolation in this case. I will multiply the pollen consumed by *100 to make it a 1% change. Another approach would be to zscale the factor, which I have done in the field assay models. I wonder if that means that I should scale the bee size variable as well... I don't think I will. Basically if I leave it as is raw (0–0.13 mg) = HR per 1 mg pollen (nonsense scale) ×100 = HR per 0.01 mg pollen scale() = HR per 1 SD pollen change (best for interpretation) Within in 1 standard deviation makes the most sense... I will use the z_pollen_consumed variable. ```{r} o$pollen_01 <- o$pollen_consumed * 100 o$z_pollen_consumed <- scale(o$pollen_consumed) ``` Since pollen consumption, bee size and genotype is hypothesized to influence probability of death, to see if their is a treatment effect beyond these known mediators is important. It is another aspect of how diet effects pollinators. How diets influence infection. What about nutrition and forage and supporting pollinators. If there is a diet that kills the bees but also clears infection, then death would be an inconvenient framework. If the diets are not good for the bees to live on. How fast do they die. Metric of how good or bad, or how quickly they die. If treatment doesn't effect consumption, that would be great! coxmefull<- coxme(Surv(Days_alive,Dead) ~ Treatment + z_pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), data = o) This is the full model, and it asks if treatment changes the instantaneous risk of dying during the 7-day observation window. It is really saying, among bees that consumed the same amount of pollen, were the same size, and came from the same colony and inoculation date, does treatment change the instantaneous risk of dying during the 7-day observation window? Because pollen consumed is a mediator, mediating the effect of the treatment, including it takes away from the treatment effect. ```{r} surv_obj<- Surv(o$Days_alive, o$Dead) attr(surv_obj, "type") coxmefull<- coxme(surv_obj ~ Treatment + z_pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), data = o) coxme_red <- coxme(surv_obj ~ z_pollen_consumed + Colony + Wing_MM + (1|Inoc_Date), data = o) ``` ### model diagnostics Can't use the dharma package to check, so I am using a work around by looking at just the fixed effects ```{r} # Fit a Cox model with fixed effects only cox_fixed <- coxph(Surv(Days_alive, Dead) ~ Treatment + z_pollen_consumed + Colony + Wing_MM, data = o) # Test proportional hazards ph_test <- cox.zph(cox_fixed) ph_test # Optional: plot residuals to visualize plot(ph_test) ``` ```{r} anova(coxmefull,coxme_red) ``` ```{r} Anova(coxmefull) summary(coxmefull) ``` ```{r} # Fixed-effects Cox model cox_fixed <- coxph(Surv(Days_alive, Dead) ~ z_pollen_consumed + Colony + Wing_MM, data = o) # Choose three biologically meaningful pollen levels pollen_levels <- quantile(o$z_pollen_consumed, probs = c(.1, .5, .9), na.rm = TRUE) newdat <- data.frame( z_pollen_consumed = pollen_levels, Colony = names(sort(table(o$Colony), decreasing = TRUE))[1], Wing_MM = mean(o$Wing_MM, na.rm = TRUE) ) sf <- survfit(cox_fixed, newdata = newdat) # <<< THIS LINE is what fixes your error ggsurvplot(sf, data = newdat, legend.labs = c("Low pollen", "Medium pollen", "High pollen"), legend.title = "Pollen consumption", xlab = "Days post inoculation", ylab = "Predicted survival probability", conf.int = TRUE) ``` this shows with less pollen consumption, bees die more, not surprising.