# Regressions using completly fake data!

# Why can’t I just have all of the data?

I have been updating some code that uses data that I don’t have access to. I used one of the data sets from the `survival`

package but it annoyed me that the variable names were not what I wanted them to be and I couldn’t be bothered changing them. I was killing time on a train the other day and discovered the `simstudy`

package and decided to give it a go. It worked beautifully and made updating my code so much easier. It also made me realise how much R I have learnt in the last year so I wanted to show the difference between the old and the new code with you.

# Creating data

Start by defining the variables you want in your data set and create the survival part of the data and before creating the full data set.

```
def <- defData(varname = "v1_6", dist = "binary", formula = 0.6, id = "idnum", link = "logit")
def <- defData(def, varname = "v1_5", dist = "uniformInt", formula = "18;110")
def <- defData(def, varname = "v2_1_1", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_1_2", dist = "binary", formula = 0.6)
def <- defData(def, varname = "v2_1_3", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_1_4", dist = "binary", formula = 0.55)
def <- defData(def, varname = "v2_1_5", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v2_5", dist = "binary", formula = 0.5)
def <- defData(def, varname = "v7_4", dist = "binary", formula = 0.5)
def <- defData(def, varname = "event", dist = "binary", formula = 0.1)
def <- defData(def, varname = "died", dist = "binary", formula = 0.5)
def <- defData(def, varname = "NIHSS_15", dist = "uniformInt", formula = "0;42", link="identity")
def <- defData(def, varname = "age_group", dist = "categorical", formula = "0.5;0.2;0.3", link="identity")
# Define survival data part
sdef <- defSurv(varname = "survTime", formula = "v1_5/2", scale = 20,
shape = "v2_5*1 + (1-v2_5)*1.5")
sdef <- defSurv(sdef, varname = "censorTime", scale = 80, shape = 1)
# Create data
df <- genData(5000, def)
df <- genSurv(df, sdef)
```

What a lovely data set I created!

# Descriptive statistics

Now that I have data I want to compare the group that experienced the event to the group that didn’t. I use CreateTableOne from the `tableone`

package. I need to specify which variables I want to include (or all variables will be included) and which ones I want to be factors (will be shown as %) or nonnormal variables (will be shown as median).

```
table_var<- c("v1_5", "v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
median_var <- median_var<- c("v1_5", "NIHSS_15", "v2_2", "v7_4")
factor_var <- c("v1_6","v2_5","v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5","age_group")
table1<- CreateTableOne(table_var, data = df, strata = "event", factorVars = factor_var,
testNonNormal = kruskal.test, argsNonNormal = median_var)
table1_print<-print(table1, nonnormal = median_var, contDigits=0,
exact = "LOC", printToggle = TRUE, factorVars = factor_var)
```

```
## Stratified by event
## 0 1 p test
## n 4483 517
## v1_5 (median [IQR]) 63 [40, 86] 63 [42, 85] NA nonnorm
## v1_6 = 1 (%) 2939 (65.6) 329 (63.6) 0.412
## v2_1_1 = 1 (%) 2192 (48.9) 262 (50.7) 0.471
## v2_1_2 = 1 (%) 2723 (60.7) 310 (60.0) 0.767
## v2_1_3 = 1 (%) 2260 (50.4) 250 (48.4) 0.401
## v2_1_4 = 1 (%) 2438 (54.4) 276 (53.4) 0.700
## v2_1_5 = 1 (%) 2254 (50.3) 259 (50.1) 0.974
## v2_5 = 1 (%) 2235 (49.9) 265 (51.3) 0.577
## v7_4 (median [IQR]) 1 [0, 1] 0 [0, 1] NA nonnorm
## age_group (%) 0.247
## 1 2187 (48.8) 252 (48.7)
## 2 892 (19.9) 117 (22.6)
## 3 1404 (31.3) 148 (28.6)
```

You might want to save the table in a different format such as a csv file. This code should help you do that. Note that the print options are slightly different.

```
table1_print <- print(table1, nonnormal = median_var, contDigits=0,
exact = "LOC", quote = FALSE, noSpaces = TRUE, printToggle = FALSE, factorVars = factor_var)
## Save to a CSV file
write.csv(table1_print, file = "my_table1.csv")
```

# Regressions

Time to move on to the regressions! I wrote the first version of this code about 9 months ago and it looked a bit like this but with more models:

```
surv_data <- Surv(time=df$survTime, event=df$event, type="right")
coxph.fit.v1.6 <- coxph(surv_data ~ v1_6 , data=df, method="breslow")
coxph.fit.v1.5 <- coxph(surv_data ~ v1_5 , data=df, method="breslow")
coxph.fit.v2.5 <- coxph(surv_data ~ v2_5 , data=df, method="breslow")
```

As you can imagine this made my code really long and included a lot of copying and pasting.

As I learnt more and more R (and discovered the power of `purrr`

after an R ladies event where Charlotte Wickham presented) I decided to update the code before giving to the person who will run it. I start by putting all of my predictors in a vector. I specify all the predictors I care about and use `map`

to fit the model for each one.

```
predictors <- c("v1_5", "v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
surv_data <- Surv(time=df$survTime, event=df$event, type="right")
unadjusted <- map(predictors, ~coxph(as.formula(paste("surv_data", .x, sep="~")), data=df))
```

I refuse to copy numbers off the screen because I know I will get it wrong so I use the `texreg`

package to create nice regression tables. I need to override the coefficients as I am interested in exp(coef) and this is how I did it 9 months ago. I spent most of my time terrified that I would accidentally use coefficients from the wrong model and adding new models took forever.

```
screenreg(list(coxph.fit.v1.6, coxph.fit.v1.5, coxph.fit.v2.5 ),
single.row= TRUE , ci.force= TRUE, include.aic = TRUE, include.rsquared = TRUE, include.maxrs= FALSE,
ci.test = 1,
override.coef = list(summary(coxph.fit.v1.6)$coef[,2],
summary(coxph.fit.v1.5)$coef[,2],
summary(coxph.fit.v2.5)$coef[,2])
)
```

Once again the `map`

function saved me. I create lists with the coefficients and upper and lower confidence intervals. This way I don’t have to worry about using the wrong coefficients and it is real easy to add more models.

```
unadjusted_coef <- map(unadjusted, summary ) %>%
map("coefficients") %>%
map(as.data.frame) %>%
map("exp(coef)")
unadjusted_lower_conf <- map(unadjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("lower .95")
unadjusted_upper_conf <- map(unadjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("upper .95")
screenreg(unadjusted, single.row= FALSE, ci.force= TRUE, ci.test=1,
override.coef = unadjusted_coef,
override.ci.low=unadjusted_lower_conf,
overide.ci.up=unadjusted_upper_conf)
```

```
##
## =======================================================================================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9 Model 10
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## v1_5 1.00
## [1.00; 1.00]
## v1_6 0.92
## [0.74; 1.10]
## v2_1_1 1.07
## [0.89; 1.24]
## v2_1_2 0.97
## [0.79; 1.14]
## v2_1_3 0.93
## [0.76; 1.10]
## v2_1_4 0.96
## [0.79; 1.13]
## v2_1_5 0.99
## [0.82; 1.16]
## v2_5 1.04
## [0.87; 1.21]
## v7_4 0.86
## [0.68; 1.03]
## age_group 0.97
## [0.87; 1.06]
## -------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC 8712.31 8711.94 8712.19 8712.58 8712.01 8712.49 8712.70 8712.52 8709.54 8712.23
## R^2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Max. R^2 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
## Num. events 517 517 517 517 517 517 517 517 517 517
## Num. obs. 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
## Missings 0 0 0 0 0 0 0 0 0 0
## PH test 0.98 0.60 0.68 0.78 0.59 0.06 0.38 0.46 0.25 0.24
## =======================================================================================================================================================
## * 1 outside the confidence interval
```

It is now super easy to add/remove variables to my models. I already know that I want to include two variables - v1_5 and v1_6 - in my full model so I add them to the model. I remove v1_5 from my predictors or I will have two models for just v1_5 and v1_6.

```
predictors <- c("v1_6", "v2_1_1", "v2_1_2", "v2_1_3", "v2_1_4", "v2_1_5", "v2_5","v7_4", "age_group")
adjusted <- map(predictors, ~coxph(as.formula(paste("surv_data ~ v1_5 + v1_6", .x, sep="+")),
data=df))
adjusted_coef <- map(adjusted, summary ) %>%
map("coefficients") %>%
map(as.data.frame) %>%
map("exp(coef)")
adjusted_lower_conf <- map(adjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("lower .95")
adjusted_upper_conf <- map(adjusted, summary) %>%
map("conf.int") %>%
map(as.data.frame) %>%
map("upper .95")
screenreg(adjusted, single.row= FALSE, ci.force= TRUE, ci.test=1,
override.coef = adjusted_coef,
override.ci.low=adjusted_lower_conf,
overide.ci.up=adjusted_upper_conf)
```

```
##
## =========================================================================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9
## -----------------------------------------------------------------------------------------------------------------------------------------
## v1_5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
## [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00] [1.00; 1.00]
## v1_6 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92 0.92
## [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10] [0.74; 1.10]
## v2_1_1 1.06
## [0.89; 1.24]
## v2_1_2 0.97
## [0.79; 1.15]
## v2_1_3 0.93
## [0.76; 1.10]
## v2_1_4 0.96
## [0.79; 1.13]
## v2_1_5 0.99
## [0.82; 1.16]
## v2_5 1.04
## [0.86; 1.21]
## v7_4 0.85
## [0.68; 1.03]
## age_group 0.96
## [0.86; 1.06]
## -----------------------------------------------------------------------------------------------------------------------------------------
## AIC 8713.55 8715.05 8715.43 8714.86 8715.34 8715.54 8715.37 8712.38 8715.02
## R^2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## Max. R^2 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82
## Num. events 517 517 517 517 517 517 517 517 517
## Num. obs. 5000 5000 5000 5000 5000 5000 5000 5000 5000
## Missings 0 0 0 0 0 0 0 0 0
## PH test 0.87 0.93 0.95 0.91 0.27 0.80 0.85 0.66 0.65
## =========================================================================================================================================
## * 1 outside the confidence interval
```

If you want to save the models in a word document this is how to do it.

```
htmlreg(adjusted, single.row= FALSE, ci.force= TRUE, ci.test=1, type="html", file="My_regressions.doc",
override.coef = adjusted_coef,
override.ci.low=adjusted_lower_conf,
overide.ci.up=adjusted_upper_conf)
```

`## The table was written to the file 'My_regressions.doc'.`

# Packages I used

```
library("tidyverse")
library("survival")
library("simstudy")
library("texreg")
library("tableone")
```