Supplemetary materials contains all results reported in the printed version of the study. Additional information include detailed descriptive statistics and values of all nodes and edges within both networks. Code can be displayed by click on the optional fields.
To analyze the network we used the packages qgraph (Epskamp et al. 2012), igraph (Csárdi and Nepusz 2006), bootnet (Epskamp and Fried 2018) and NetworkComparisonTest (Van Borkulo et al. 2017). For data preparation we used dplyr (Wickham and Henry 2017), tidyr (Wickham et al. 2017) and labelled (Larmarange 2019). For descriptive statistics, psych (Revelle 2018), lsr (Navarro 2015) and corrgram (Wright 2018) were used. The package DT (Xie, Cheng, and Tan 2019) provided download ready data tables while kableExtra (Zhu 2019) was used to render tables. Additional graphics were created with the packages ggplot2 (Wickham 2009) and cowplot (Wilke 2019).

Primary data preparation

This part covers data preparation. Due to personal data protection, data itself is not openly accessible.

Import data

In addition to both data sets, a file containing the variable labels is loaded.

# loading Dataset 1
load("dataset1.rda")
# loading Dataset 2
load("dataset2.rda")
# loading itemlabels (labels are in german)
.labels <- read.csv2("SOMS7 Itemlabels.csv")

Changing class of variables

Value laber for sex/ gender.

# item labels as character
.labels <- .labels %>% mutate_at(vars(German_Label, Englisch_Label), function(x) {as.character(x)})
# let's give some meaning to the numbers.
dataset1 %>% mutate(sex = recode_factor(sex, "1" = "female", "2" = "male")) -> dataset1
dataset2 %>% mutate(sex = recode_factor(sex, "0" = "male", "1" = "female")) -> dataset2
# id-variables should be character variables
dataset1 %>% mutate(Patienten_Code = as.character(Patienten_Code)) -> dataset1
dataset2 %>% mutate(Code = as.character(Code)) -> dataset2

Renaming

Shortening the names of the SOMS-7T items (Rief, Hiller, and Heuser 1997).

# let's make the names shorter for both datasets
names(dataset1) <- gsub("^B_SOMS7T", "S", names(dataset1))
names(dataset1)[grepl("S_[[:digit:]]$", names(dataset1))] <- paste("S_0", seq(1,9,1), sep = "")

# put the variables in the same order to match both data frames
dataset2 <- dataset2 %>% dplyr::select(Code, alter, sex, everything())
names(dataset2) <- names(dataset1)

Data selection

No cases have been removed of the records. The imported data sets are stored in a (masked) object. Hence sociodemographic variables are archived. Since the analysis were perfomed without genderspecific items, items 48 - 53 are removed from the data set.

# making masked objects, if needed
.dataset1 <- dataset1
.dataset2 <- dataset2

# remove all demographic items and gender-specific items
dataset1 %>%  dplyr::select(S_01 : S_47) -> dataset1
dataset2 %>%  dplyr::select(S_01 : S_47) -> dataset2

Itemlabels

Gender-unspecific

The following table discplays all 47 gender-unspecific items of the SOMS-7T. Since the SOMS-7T is a German questionnaire, items were translated. The following table shows all original items and the translated version.

.labels %>% 
  dplyr::select(Item, German_Label, Englisch_Label) %>%
  mutate(Item = gsub("S_", "S", Item)) %>% 
  .[1:47,] %>% 
  kable(., "html",
        caption = "Table: Itemlabels of the gender-unspecific questions of the SOMS-7",
        row.names = F,
        col.names = c("Item name", "German Label", "Englisch label")) %>% 
  kable_styling("striped")
Table: Itemlabels of the gender-unspecific questions of the SOMS-7
Item name German Label Englisch label
S01 Kopf- oder Gesichtsschmerzen Headache  and face pain
S02 Schmerzen im Bauch oder in der Magengegend Abdominal pain
S03 Rückenschmerzen Back pain
S04 Gelenkschmerzen Joint pain
S05 Schmerzen in den Armen oder Beinen Pain in legs and/or arms
S06 Brustschmerzen Chest pain
S07 Schmerzen im Enddarm Anal pain
S08 Schmerzen beim Geschlechtsverkehr Pain during sexual intercourse
S09 Schmerzen beim Wasserlassen Pain during urination
S10 Übelkeit Nausea
S11 Völlegefühl (sich aufgebläht fühlen) Bloating
S12 Druckgefühl, Kribbeln oder Unruhe im Bauch Discomfort in and around the precordium
S13 Erbrechen (außerhalb der Schwangerschaft) Vomiting (excluding pregnancy)
S14 Vermehrtes Aufstoßen (in der Speiseröhre) Regurgitation of food
S15 Luftschlucken, Schluckauf oder Brennen im Brust- oder Magenbereich Hiccoughing or burning sensation in chest or stomach
S16 Unverträglichkeit von verschiedenen Speisen Food intolerance
S17 Appetitverlust Loss of appetite
S18 Schlechter Geschmack im Mund oder stark belegte Zunge Bad taste in mouth or excessively coated tongue
S19 Mundtrockenheit Dry mouth
S20 Häufiger Durchfall Frequent diarrhea
S21 Flüssigkeitsaustritt aus dem Darm Discharge of fluid from anus
S22 Häufiges Wasserlassen Frequent urination
S23 Häufiger Stuhlgang Frequent bowel movements
S24 Herzrasen oder Herzstolpern Heart palpitation
S25 Druckgefühl in der Herzgegend Feeling pressure in cardiac region
S26 Schweißausbrüche (heiß oder kalt) Sweating
S27 Hitzewallungen oder Erröten Flushing or blushing
S28 Atemnot (außer bei Anstrengung) Breathlessness without exertion
S29 Übermäßig schnelles Ein- und Ausatmen Painful breathing or hyperventilation
S30 Außergewöhnliche Müdigkeit bei leichter Anstrengung Excessive tiredness upon mild exertion
S31 Flecken oder Farbänderungen der Haut Blotchiness or discoloration of skin
S32 Sexuelle Gleichgültigkeit Sexual indifference (loss of libido)
S33 Unangenehme Empfindungen im oder am Genialbereich Unpleasant sensations in or around the genitalia
S34 Koordinations- oder Gleichgewichtsstörungen Impaired coordination or balance
S35 Lähmung oder Stimmverlust Paralysis or localized weakness
S36 Schwierigkeiten beim Schluckauf oder Kloßgefühl Difficulty swallowing or lump in the throat
S37 Flüsterstimme oder Stimmverlust Aphonia (loss of voice)
S38 Harnverhaltung oder Schwierigkeiten beim Wasserlassen Urinary retention
S39 Sinnestäuschungen Hallucinations
S40 Verlust von Berührungs- oder Schmerzempfinden Loss of touch or pain sensations
S41 Unangenehme Kribbelempfindungen Unpleasant numbness or tingling sensations
S42 Sehen von Doppelbildern Double vision
S43 Blindheit Blindness
S44 Verlust des Hörvermögens Deafness
S45 Krampfanfälle Seizures
S46 Gedächtnisverlust Amnesia (loss of memory)
S47 Bewusstlosigkeit Loss of consciousness

Gender-specific

Item S48 to S52 are women-specific symptoms, while Item S53 is men-specific. These symptoms were not considered in the network analysis and therefore will not be further mentioned in the appendices.

.labels %>% 
  dplyr::select(Item, German_Label, Englisch_Label) %>%
  mutate(Item = gsub("S_", "S", Item)) %>% 
  .[48:53,] %>% 
  kable(., "html",
        caption = "Table: Itemlabels of the gender-specific questions of the SOMS-7",
        row.names = F,
        col.names = c("Item name", "German Label", "Englisch label")) %>% 
  kable_styling("striped")
Table: Itemlabels of the gender-specific questions of the SOMS-7
Item name German Label Englisch label
S48 Schmerzhafte Regelblutungen Painful menstruation
S49 Unregelmäßige Regelblutungen Inconsistent mentruation
S50 Übermäßige Regelblutungen Excessive mentruation
S51 Erbrechen während der gesamten Schwangerschaft Vomiting throughout pregnancy
S52 Ungewöhnlicher oder verstärkter Ausfluss aus der Scheide Unusual or increased vaginal discharge
S53 Impotenz oder Störungen des Samenergusses Imptence or ejaculatory disorders

Missing data

Information about frequency and pattern of missing values are displayed following the recommendations by (McKnight 2007).

rbind(.dataset1, .dataset2) %>% 
  mutate(sample = factor(c(rep("Dataset 1", nrow(dataset1)), rep("Dataset 2", nrow(dataset2))))) %>%
  # calculate sum of missings for all 47 gender unspecific items (exclude id and sample variables)
  mutate(missings = apply(.[,grepl("S_[[:digit:]]$|[[:digit:]][[:digit:]]$", names(.))], 1, function(x) {sum(is.na(x)) })) -> .missing

merge(.missing %>% group_by(sample) %>% summarise(N = length(sample), miss = sum(missings), Max = max(missings)),
      .missing %>% group_by(sample) %>% filter(missings > 0) %>% summarise(n = length(missings)),
      by = "sample") %>% 
  # average missings...
  ## ...total sample
  mutate(amn = miss/N) %>% 
  ## ...per participant
  mutate(amp = miss/n) %>% 
  ## ...per item
  mutate(ami = miss/47) -> .descriptive_missing_table

# adding more infos
.descriptive_missing_table %>% 
  ## % of participants with missings in the total sample
  mutate(pn = round(n/N*100,2)) %>% 
  ## % of cells with missing 
  mutate(pc = round(miss/(N*47)*100,2)) %>% 
  dplyr::select(sample, N, miss, pc, n, Max, pn, everything())-> .descriptive_missing_table

Dataset 1

.missing %>% 
  filter(sample == "Dataset 1") %>% 
  count(missings) %>% 
  mutate(p = round(n/sum(n)*100,2)) %>% 
  ggplot(., aes(x = factor(missings), y = n)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(p, "%", sep = "")), hjust = -0.2) +
  scale_y_continuous(limits = c(0, 250)) +
  labs(x = "Frequency of missing values",
       y = "Number of cases with missing values") +
  theme_classic() +
  theme(legend.position = "top",
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 14, color = "black"),
        legend.text = element_text(size = 12, color = "black"),
        panel.grid.major.x = element_line(color = "lightgrey", linetype = "dotted")) +
  coord_flip()


Figure. Barplot for the frequency of missing values. On x-axis, the amount of missing values is displayed and on y-axis the number of cases with such missing values are plotted. Overall 47 missing values were possible for all gender-unspecific items.

Global Structure
In Dataset 1 out of 11938 possible values for all 47 gender unspecific items, 49 cells showed missing values (0.41%). The structure of missing values for each case is shown in the figure below.

.missing %>% 
  filter(sample == "Dataset 1") %>%
  dplyr::select(Patienten_Code, S_01 : S_47) %>% 
  mutate_at(vars(-Patienten_Code), function(x) {ifelse(is.na(x), 1, 0)}) %>% 
  gather(., item, val, -Patienten_Code) %>% 
  ggplot(., aes(x = item, y = Patienten_Code, fill = factor(val))) +
  geom_tile() +
  scale_fill_manual(values = c("black", "grey90"),
                    labels = c("not missing", "missing")) +
  labs(x = "Item",
       y = "Case (not shown)",
       fill = "") +
  theme_classic() +
  theme(axis.text.x = element_text(size = 12, color = "black", angle = 90),
        axis.text.y = element_blank())


Figure. Dataset 1 Structure of missing values. Case = every single participants. S_1-S_47 = 47 gender-unspecific items from SOMS-7T.

# data preparation for pattern detection
.missing %>% 
  filter(sample == "Dataset 1") %>%
  dplyr::select(S_01 : S_47) %>% 
  mutate_all(., function(x) { ifelse(is.na(x), 1, 0) }) %>% 
  data.frame(.) * 2^(mapply(rep, 1:47, nrow(.missing %>% filter( sample == "Dataset 1")))) %>% 
  as.data.frame() -> .missing_ds1
.missing_ds1 %>% 
  mutate(pattern = apply(., 1, sum)) %>% 
  dplyr::select(pattern) -> .pattern_ds1



Pattern
Following (McKnight 2007), we examined how “messy” data is. Missing values may not be random, if the pattern shows a structure, e.g. all participants show missing values on specific items. If there are many multiple different pattern of missing data, this could be an indication for “messy data”. If there are only very few or only one pattern, missing data is seen as “clean” Ratio of missing data pattern was used as an indicator for the messines. A coefficient of 1 indicates that every participant has an own pattern. High messininess indicates that the pattern of missing data is more likely to be completly at random (MCAR), while a clean structre is more likely if the data is not missing at random (MNAR). If there are no missing values, this coefficient can not be calculated. Excluding the patterns without missings, there were 29 different patterns of missing data in Dataset 1. The ratio of missing data patterns was 0.85. This indicates, that most of the participants with missing data had their own pattern. We conclude, that the structure of missing data can be viewed as MAR.

ggplot(.pattern_ds1 %>% filter(pattern > 0), aes(x = factor(pattern))) +
  geom_bar() +
  labs(x = "Pattern-ID (not shown)",
       y = "Number of occurence") +
  theme_classic() +
  theme(axis.text = element_text(color = "black", size = 12),
        axis.title = element_text(color = "black", size = 13),
        axis.text.x = element_blank())


Figure. Occurence of all missing value patterns in Dataset 1. The IDs of the patterns has no further meaning. The plot includes all patterns with missing values.

Item

.missing %>% 
  filter(sample == "Dataset 1") %>% 
  dplyr::select(S_01 : S_47) %>% 
  mutate_all(., function(x) { ifelse(is.na(x), 1, 0) }) %>% 
  gather(., item, miss) %>% 
  group_by(item) %>% 
  summarise(n = sum(miss)) %>% 
  mutate(p = n / nrow(.dataset2)*100) -> .missing_ds1_itemdata

The mean frequency of missing values in Dataset 1 was 0.18% (SD = 0.23). 8 Items showed a higher frequency of missing values, 0 showed a lower frequency. The item with the highest amound of missing values on Dataset 1 was S_08 “Pain during sexual intercouse”).
The figure below shows the pattern of missing values for each item (left) and per item separated for gender (right). Overall the frequency of missing values is slighly higher for women. Again, S_08 was the item with the highest difference in missing values for both genders.

ggplot(.missing_ds1_itemdata, aes(x = item, y = p)) +
  geom_point(size = 2) +
  geom_rect(aes(ymin = 0.18 - 0.18,
                ymax = 0.18 + 0.23,
                xmin = 0, xmax = 48),
            alpha = .5, fill = "lightgrey") +
  geom_hline(aes(yintercept = mean(p)), color = "black") +
  geom_point(size = 2) +
  geom_segment(aes(x = item, xend = item,
                   y = mean(p), yend = p)) +
  labs(x = "SOMS7-Item (only gender unspecific)",
       y = "% of missing values",
       title = "Total percentage of missing values per item") +
  scale_y_continuous(limits = c(0, 5), breaks = seq(0,5,1)) +
  theme_classic() +
  theme(axis.text = element_text(color = "black", size = 12),
        axis.title = element_text(color = "black", size = 13)) +
  coord_flip() -> .missing_plot1_ds1
  

.missing %>% 
  filter(sample == "Dataset 1") %>% 
  dplyr::select(sex, S_01 : S_47) %>%
  mutate_at(vars(-sex), function(x) { ifelse(is.na(x), 1, 0) }) %>%
  # remove missings in gender variable
  na.omit(.) %>% 
  gather(., item, miss, -sex) %>% 
  group_by(item, sex) %>% 
  summarise(n = sum(miss)) %>% 
  mutate( p = ifelse(sex == "female", n/nrow(.dataset1 %>% filter(sex == "female"))*100, 
                       n/nrow(.dataset1 %>% filter(sex == "male"))*100)) %>% 
  ggplot(., aes(x = item, y = p, color = sex)) +
  geom_point() +
  geom_rect(aes(ymin = 0.18 - 0.18,
                ymax = 0.18 + 0.23,
                xmin = 0, xmax = 48),
            alpha = .5, fill = "lightgrey", color = "lightgrey") +
  geom_hline(aes(yintercept = mean(.missing_ds1_itemdata$p)), color = "black") +
  geom_point(size = 2) +
  geom_line(aes(group = sex)) +
  labs(x = "SOMS7-Item (only gender unspecific)",
       y = "% of missing values",
       title = "Percentage of missing values for both genders per item") +
  scale_y_continuous(limits = c(0, 5), breaks = seq(0,5,1)) +
  theme_classic() +
  theme(axis.text = element_text(color = "black", size = 12),
        axis.title = element_text(color = "black", size = 13),
        axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "right")+ 
  coord_flip() -> .missing_plot2_ds1

plot_grid(.missing_plot1_ds1, .missing_plot2_ds1)


Figure. Frequency of missing values for each item (total) and seperated for both genders. Vertical line marks the mean frequency. Grey area consists values between -1 und +1 SD around the mean.

Dataset 2

.missing %>% 
  filter(sample == "Dataset 2") %>% 
  count(missings) %>% 
  mutate(p = round(n/sum(n)*100,2)) %>% 
  ggplot(., aes(x = factor(missings), y = n)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(p, "%", sep = "")), hjust = -0.2) +
  scale_y_continuous(limits = c(0, 425)) +
  labs(x = "Frequency of missing values",
       y = "Number of cases with missing values") +
  theme_classic() +
  theme(legend.position = "top",
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 14, color = "black"),
        legend.text = element_text(size = 12, color = "black"),
        panel.grid.major.x = element_line(color = "lightgrey", linetype = "dotted")) +
  coord_flip()


Figure. Barplot for the frequency of missing values. On x-axis, the amount of missing values is displayed and on y-axis the number of cases with such missing values are plotted. Overall 47 missing values were possible for all gender-unspecific items.

Global Structure
In Dataset 2 out of a total of 26978 possible values for all 47 gender unspecific items, 1375 cells showed missing values (5.1%). The structure of missing values for each case is shown in the figure below.

.missing %>% 
  filter(sample == "Dataset 2") %>%
  dplyr::select(Patienten_Code, S_01 : S_47) %>% 
  mutate_at(vars(-Patienten_Code), function(x) {ifelse(is.na(x), 1, 0)}) %>% 
  gather(., item, val, -Patienten_Code) %>% 
  ggplot(., aes(x = item, y = Patienten_Code, fill = factor(val))) +
  geom_tile() +
  scale_fill_manual(values = c("black", "grey90"),
                    labels = c("not missing", "missing")) +
  labs(x = "Item",
       y = "Case (not shown)",
       fill = "") +
  theme_classic() +
  theme(axis.text.x = element_text(size = 12, color = "black", angle = 90),
        axis.text.y = element_blank())


Figure. Dataset 2 Structure of missing values. Case = every single participants. S_1-S_47 = 47 gender-unspecific items from SOMS-7T.

# data preparation for detecting patterns
.missing %>% 
  filter(sample == "Dataset 2") %>%
  dplyr::select(S_01 : S_47) %>% 
  mutate_all(., function(x) { ifelse(is.na(x), 1, 0) }) %>% 
  data.frame(.) * 2^(mapply(rep, 1:47, nrow(.missing %>% filter( sample == "Dataset 2")))) %>% 
  as.data.frame() -> .missing_ds2
.missing_ds2 %>% 
  mutate(pattern = apply(., 1, sum)) %>% 
  dplyr::select(pattern) -> .pattern_ds2

Pattern
Excluding the patterns without missings, there were 132 different patterns of missing data in Dataset 2. The ratio of missing data patterns was 0.73. This indicates, that most of the individuals with missing data had their own pattern. We conclude, that the structure of missing data can be viewed as MAR.

ggplot(.pattern_ds2 %>% filter(pattern > 0), aes(x = factor(pattern))) +
  geom_bar() +
  labs(x = "Pattern-ID (not shown)",
       y = "Number of occurence") +
  theme_classic() +
  theme(axis.text = element_text(color = "black", size = 12),
        axis.title = element_text(color = "black", size = 13),
        axis.text.x = element_blank())


Figure. Occurence of all missing value patterns in Dataset 2. The IDs of the patterns has no further meaning. The plot includes all patterns with missing values.

Item

.missing %>% 
  filter(sample == "Dataset 2") %>% 
  dplyr::select(S_01 : S_47) %>% 
  mutate_all(., function(x) { ifelse(is.na(x), 1, 0) }) %>% 
  gather(., item, miss) %>% 
  group_by(item) %>% 
  summarise(n = sum(miss)) %>% 
  mutate(p = n / nrow(.dataset2)*100) -> .missing_ds2_itemdata

The mean frequency of missing values in Dataset 2 was 5.1% (SD = 1.31). 5 items showed a higher frequency of missing values, 5 showed a lower frequency. The item with the highest amound of missing values in Dataset 2 was S_08.
The figure below shows the pattern of missing values for each item (left) and for each item separated for gender (right). Overall the frequency of missing values is slighly lower for males. Again, S_08 was the item with the highest difference in missing values of both genders..

ggplot(.missing_ds2_itemdata, aes(x = item, y = p)) +
  geom_point(size = 2) +
  geom_rect(aes(ymin = 5.10 - 1.31,
                ymax = 5.10 + 1.31,
                xmin = 0, xmax = 48),
            alpha = .5, fill = "lightgrey") +
  geom_hline(aes(yintercept = mean(p)), color = "black") +
  geom_point(size = 2) +
  geom_segment(aes(x = item, xend = item,
                   y = mean(p), yend = p)) +
  labs(x = "SOMS7-Item (only gender unspecific)",
       y = "% of missing values",
       title = "Total percentage of missing values per item") +
  scale_y_continuous(limits = c(0, 15), breaks = seq(0,15,1)) +
  theme_classic() +
  theme(axis.text = element_text(color = "black", size = 12),
        axis.title = element_text(color = "black", size = 13)) +
  coord_flip() -> .missing_plot1_ds2
  

.missing %>% 
  filter(sample == "Dataset 2") %>% 
  dplyr::select(sex, S_01 : S_47) %>%
  mutate_at(vars(-sex), function(x) { ifelse(is.na(x), 1, 0) }) %>%
  # remove missings in gender
  na.omit(.) %>% 
  gather(., item, miss, -sex) %>% 
  group_by(item, sex) %>% 
  summarise(n = sum(miss)) %>% 
  mutate( p = ifelse(sex == "female", n/nrow(.dataset2 %>% filter(sex == "female"))*100, 
                       n/nrow(.dataset2 %>% filter(sex == "male"))*100)) %>% 
  ggplot(., aes(x = item, y = p, color = sex)) +
  geom_point() +
  geom_rect(aes(ymin = 5.10 - 1.31,
                ymax = 5.10 + 1.31,
                xmin = 0, xmax = 48),
            alpha = .5, fill = "lightgrey", color = "lightgrey") +
  geom_hline(aes(yintercept = mean(.missing_ds2_itemdata$p)), color = "black") +
  geom_point(size = 2) +
  geom_line(aes(group = sex)) +
  labs(x = "SOMS7-Item (only gender unspecific)",
       y = "% of missing values",
       title = "Percentage of missing values for both genders per item") +
  scale_y_continuous(limits = c(0, 15), breaks = seq(0,15,1)) +
  theme_classic() +
  theme(axis.text = element_text(color = "black", size = 12),
        axis.title = element_text(color = "black", size = 13),
        axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "right")+ 
  coord_flip() -> .missing_plot2_ds2

plot_grid(.missing_plot1_ds2, .missing_plot2_ds2)


Figure. Frequency of missing values for each item (total) and seperated for both genders. Vertical line marks the mean frequency. Grey area consists values between -1 und +1 SD around the mean.

Descriptive statistics

Age & Gender

# table for age

rbind(
  .dataset1 %>% 
    dplyr::select(Alter) %>% 
    na.omit() %>% 
    summarise(Missing = sum(is.na(Alter)),
              M = mean(Alter, na.rm = T), 
              SD = sd(Alter, na.rm = T), 
              min = min(Alter, na.rm = T), 
              max = max(Alter, na.rm = T)),
  .dataset2 %>% 
    dplyr::select(Alter) %>% 
    na.omit() %>% 
    summarise(Missing = sum(is.na(Alter)),
              M = mean(Alter, na.rm = T), 
              SD = sd(Alter, na.rm = T), 
              min = min(Alter, na.rm = T), 
              max = max(Alter, na.rm = T))
) %>% 
  mutate(sample = c("Dataset 1", "Dataset 2")) %>% 
  dplyr::select(sample, everything()) %>% 
  kable(., "html",
        digits = 2,
        caption = "Table: Number of missings, mean, standard deviation, minimum and maximum for age") %>% 
  kable_styling("striped", full_width = F)
Table: Number of missings, mean, standard deviation, minimum and maximum for age
sample Missing M SD min max
Dataset 1 0 48.03 12.86 22 75
Dataset 2 0 43.30 11.74 16 73


# table for gender
merge(
  .dataset1 %>% 
    dplyr::select(sex) %>% 
    mutate(sex = forcats::fct_explicit_na(sex, na_level = "missing")) %>% 
    count(sex) %>% 
    mutate(p = n/nrow(dataset1)*100), 
  .dataset2 %>% 
    dplyr::select(sex) %>%
    mutate(sex = forcats::fct_explicit_na(sex, na_level = "missing")) %>% 
    count(sex) %>% 
    mutate(p = n/nrow(dataset2)*100),
  by = "sex", all.y = T
) %>% 
  kable(., "html",
        digits = 2,
        caption = "Table: Frequencies for gender",
        col.names = c("gender","n", "%", "n", "%")) %>% 
  kable_styling(full_width = F) %>% 
  add_header_above(., c(" " = 1, "Dataset 1" = 2, "Dataset 2" = 2))
Table: Frequencies for gender
Dataset 1
Dataset 2
gender n % n %
female 163 64.17 371 64.63
male 91 35.83 202 35.19
missing NA NA 1 0.17

Itemstatistics

The following table shows the frequency of missing values, mean and standard deviation as well as skewness and kurtosis for each of the 47 gender-unspecific SOMS-7T items.

rbind(dataset1, dataset2) %>% 
  mutate(sample = c(rep("Dataset 1", nrow(dataset1)), rep("Dataset 2", nrow(dataset2)))) %>% 
  gather(., item, value, - sample) %>% 
  mutate(item = as.numeric(gsub("^S_", "", item))) %>% 
  group_by(item, sample) %>% 
  summarise(n = length(na.omit(value)),
            miss = sum(is.na(value)),
            M = mean(value, na.rm = T),
            SD = sd(value, na.rm = T),
            Min = min(value, na.rm = T),
            Max = max(value, na.rm = T),
            Skew = skew(value, na.rm = T),
            Kurtosis = kurtosi(value, na.rm = T)) %>% 
  # kable(., "html",
  #       digits = 2, row.names = F,
  #       caption = "Table: Descriptive item statistics for both samples",
  #       col.names = c("Item-no.", "sample", "n", "missing", "M", "SD", "Min", "Max", "Skew", "Kurtosis")) %>% 
  # kable_styling("striped")
DT::datatable(., rownames = F, 
              filter= "top", 
              caption = "Table: Descriptive item statistics for both samples",
              colnames = c("Item-no.", "sample", "n", "missing", "M", "SD", "Min", "Max", "Skew", "Kurtosis"),
              extensions = "Buttons",
              options = list(dom = "Bfrtip", pageLength = 20,
                             buttons = c("csv", "pdf"))) %>% 
  formatRound(c("M", "SD", "Min", "Max", "Skew", "Kurtosis"), 3)

Distributions

rbind(dataset1, dataset2) %>% 
  mutate(sample = c(rep("Dataset 1", nrow(dataset1)), rep("Dataset 2", nrow(dataset2)))) %>% 
  gather(., item, value, - sample) %>% 
  mutate(item = gsub("^S_", "S", item)) %>%
  mutate(item = factor(item, c("S1","S2","S3","S4","S5","S6","S7","S8","S9","S10","S11","S12","S13","S14",
                       "S15","S16","S17","S18","S19","S20","S21","S22","S23","S24","S25","S26","S27",
                       "S28","S29","S30","S31","S32","S33","S34","S35","S36","S37","S38","S39","S40",
                       "S41","S42","S43","S44","S45","S46","S47"))) %>% 
  group_by(sample, item, value) %>% 
  summarise(n = length(value)) %>% 
  mutate(p = ifelse(sample == "Dataset 1", n/nrow(dataset1), n/nrow(dataset2))) %>% 
  na.omit() %>% 
  ggplot(., aes(x = factor(value), y = p, fill = sample)) +
  facet_wrap(~ item, ncol = 4) +
  geom_bar(stat = "identity", position = position_dodge()) +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, .25), labels = c("0%", "25%", "50%", "75%", "100%")) +
  labs(y = "Total count of cases",
       x = "Symptom severity",
       fill = "Sample") +
  theme_classic() +
  theme(legend.position = "top",
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 14, color = "black"),
        legend.text = element_text(size = 12, color = "black"),
        legend.title = element_text(size = 13, color = "black"))
Warning: Factor `item` contains implicit NA, consider using
`forcats::fct_explicit_na`

Warning: Factor `item` contains implicit NA, consider using
`forcats::fct_explicit_na`

Sample comparison

rbind(dataset1, dataset2) %>%
  mutate(MW = apply(., 1, mean, na.rm = T)) %>%
  mutate(sample = factor(c(rep("Dataset 1", nrow(dataset1)), rep("Dataset 2", nrow(dataset2))))) %>%
  dplyr::select(MW, sample) %>%
  group_by(sample) %>%
  summarise(M = mean(MW, na.rm = T),
            SD = sd(MW, na.rm = T),
            Min = min(MW, na.rm = T),
            Max = max(MW, na.rm = T),
            Schiefe = skew(MW, na.rm = T),
            Kurtosis = kurtosi(MW, na.rm = T)) %>% 
  kable(., "html",
        digits = 2, 
        caption = "Table: Key statistics for the global severity of symtoms for all 47 gender-unspecific SOMS-7 items for both samples") %>% 
  kable_styling("striped", full_width = F)
Table: Key statistics for the global severity of symtoms for all 47 gender-unspecific SOMS-7 items for both samples
sample M SD Min Max Schiefe Kurtosis
Dataset 1 0.83 0.45 0.15 3.02 1.25 2.19
Dataset 2 0.98 0.65 0.02 4.00 1.23 1.97

The mean value of severity of symptom was compared using the Welch-Test.

rbind(dataset1, dataset2) %>%
  mutate(MW = apply(., 1, mean, na.rm = T)) %>%
  mutate(sample = factor(c(rep("Dataset 1", nrow(dataset1)), rep("Dataset 2", nrow(dataset2))))) %>%  dplyr::select(MW, sample) -> .comparison

t.test(MW ~ sample, data = .comparison) #t(678.41) = -3.91, p < .001

    Welch Two Sample t-test

data:  MW by sample
t = -3.9088, df = 678.41, p-value = 0.0001021
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.23044444 -0.07634035
sample estimates:
mean in group Dataset 1 mean in group Dataset 2 
              0.8279417               0.9813341 
psych::cohen.d(.comparison, "sample") # d = .27 [0.12, 0.41]
Call: psych::cohen.d(x = .comparison, group = "sample")
Cohen d statistic of difference between two means
   lower effect upper
MW  0.11   0.26  0.41

Multivariate (Mahalanobis) distance between groups
[1] 0.26
r equivalent of difference between two means
  MW 
0.12 

The standardized mean difference (Cohens d) was

lsr::cohensD(MW ~ sample, data = .comparison) #0.26
[1] 0.2570848

Correlations

# polychoric correlations
Dataset1_poly <- cor_auto(dataset1[,1:47])
Variables detected as ordinal: S_01; S_02; S_03; S_04; S_05; S_06; S_07; S_08; S_09; S_10; S_11; S_12; S_13; S_14; S_15; S_16; S_17; S_18; S_19; S_20; S_21; S_22; S_23; S_24; S_25; S_26; S_27; S_28; S_29; S_30; S_31; S_32; S_33; S_34; S_35; S_36; S_37; S_38; S_39; S_40; S_41; S_42; S_43; S_44; S_45; S_46; S_47
Dataset2_poly <- cor_auto(dataset2[,1:47])
Variables detected as ordinal: S_01; S_02; S_03; S_04; S_05; S_06; S_07; S_08; S_09; S_10; S_11; S_12; S_13; S_14; S_15; S_16; S_17; S_18; S_19; S_20; S_21; S_22; S_23; S_24; S_25; S_26; S_27; S_28; S_29; S_30; S_31; S_32; S_33; S_34; S_35; S_36; S_37; S_38; S_39; S_40; S_41; S_42; S_43; S_44; S_45; S_46; S_47
# Spearman correlation
dataset1_cor <- cor(dataset1[,1:47], use = "pairwise", method = "spearman")
dataset2_cor <- cor(dataset2[,1:47], use = "pairwise", method = "spearman")

# backup correlation data frame, because we modify them later
.dataset1_cor <- cor(dataset1[,1:47], use = "pairwise", method = "spearman")
.dataset2_cor <- cor(dataset2[,1:47], use = "pairwise", method = "spearman")
rbind(merge(rbind(dataset1_cor %>% 
                    .[lower.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table(),
                  dataset1_cor %>% 
                    .[lower.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table() %>% 
                    prop.table(.)*100) %>% 
              as.data.frame() %>% 
              mutate(statistic = c("Frequency (n)", "Percentage (%)")),
            rbind(Dataset1_poly %>% 
                    .[upper.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table(),
                  Dataset1_poly %>% 
                    .[upper.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table() %>% 
                    prop.table(.)*100) %>% 
              as.data.frame() %>% 
              mutate(statistic = c("Frequency (n)", "Percentage (%)")),
            by = "statistic"),
      merge(rbind(dataset2_cor %>% 
                    .[lower.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table(),
                  dataset2_cor %>% 
                    .[lower.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table() %>% 
                    prop.table(.)*100) %>% 
              as.data.frame() %>% 
              mutate(statistic = c("Frequency (n)", "Percentage (%)")),
            rbind(Dataset2_poly %>% 
                    .[upper.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table(),
                  Dataset2_poly %>% 
                    .[upper.tri(.)] %>% 
                    abs() %>% 
                    cut(., breaks = c(0, .10, .30, .50, 1), include.lowest = T) %>% 
                    table() %>% 
                    prop.table(.)*100) %>% 
              as.data.frame() %>% 
              mutate(statistic = c("Frequency (n)", "Percentage (%)")),
            by = "statistic")) %>% 
    kable(., "html",
        digits = 2,
        caption = "Table: Sum of correlation coefficients in both Datasets",
        col.names = c("statistic", 
                      "[< .1]", "[.1, .3]", "[.3, .5]", "[> .5]",
                      "[< .1]", "[.1, .3]", "[.3, .5]", "[> .5]")) %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 1, "Spearman correlation" = 4, "Polychoric correlations (NPD)" = 4)) %>% 
  pack_rows("Dataset 1", 1, 2) %>% 
  pack_rows("Dataset 2", 3, 4)
Table: Sum of correlation coefficients in both Datasets
Spearman correlation
Polychoric correlations (NPD)
statistic [< .1] [.1, .3] [.3, .5] [> .5] [< .1] [.1, .3] [.3, .5] [> .5]
Dataset 1
Frequency (n) 309.00 688.00 76.00 8.00 152.00 577.00 301.00 51.00
Percentage (%) 28.58 63.64 7.03 0.74 14.06 53.38 27.84 4.72
Dataset 2
Frequency (n) 60.00 799.00 207.00 15.00 11.00 460.00 552.00 58.00
Percentage (%) 5.55 73.91 19.15 1.39 1.02 42.55 51.06 5.37

Eigenvalues

# calculate eigenvalues
.Dataset1_cor_eigen <- dataset1_cor  %>% eigen(.) %>% .$values
.Dataset1_poly_eigen <- Dataset1_poly %>% eigen(.) %>% .$values
.Dataset2_cor_eigen <- dataset2_cor %>% eigen(.) %>% .$values
.Dataset2_poly_eigen <- Dataset2_poly %>% eigen(.) %>%  .$values

# plotting eigenvalues
cbind(.Dataset1_cor_eigen, .Dataset1_poly_eigen, .Dataset2_cor_eigen, .Dataset2_poly_eigen) %>% 
  as.data.frame(.) %>% 
  mutate(item = as.numeric(rownames(.))) %>% 
  gather(., eigen, value, -item) %>% 
  mutate(sample = ifelse(grepl("Dataset2", eigen), "dataset2", "dataset1"),
         type = ifelse(grepl("poly", eigen), "Polychoric (NPD)", "Spearman")) %>% 
  ggplot(., aes(y = value, x = item, color = type)) +
  facet_wrap(~ sample, scales = "free") +
  geom_point() +
  geom_line() +
  geom_hline(aes(yintercept = 0), linetype = "dashed", color = "grey50") +
  scale_color_manual(values = c("orange", "blue")) +
  labs(x = "SOMS7-Itemnumber",
       y = "Eigenvalue",
       color = "Correlation") +
  theme_classic() +
  theme(legend.position = "bottom",
        axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 15, color = "black"))


Figure. Distribution of the Eigenvalues for Spearman (blue) and polychoric (orange) correlations. Distributions are shown for both Datasets (left Dataset 1, right Dataset 2). The polochoric correlation matrix was not positive defined, so the nearest positive defined matrix (NPD) was used.

Comparison of correlation coeffcients

The correlation between the Spearman and the polychoric correlation matrices was very low in both datasets (both r < .10). The average for Spearman and polychoric correlation, was higher in Dataset 2. While in Datset 1 the mean correlation was low, in Dataset 2 the correlation coefficient based on polychoric correlation was medium sized.

data.frame(sample = c("Dataset 1", "Dataset 2"),
           'mean r' = c(mean(dataset1_cor[lower.tri(dataset1_cor)]), mean(dataset2_cor[lower.tri(dataset2_cor)])),
           'mean rp' = c(mean(Dataset1_poly[lower.tri(Dataset1_poly)]), mean(Dataset2_poly[lower.tri(Dataset2_poly)])),
           r = c(cor(dataset1_cor %>% .[upper.tri(.)], Dataset1_poly[lower.tri(Dataset1_poly)]),
                 cor(dataset2_cor %>% .[upper.tri(.)], Dataset2_poly[lower.tri(Dataset2_poly)]))) %>% 
  kable(., "html",
        digit = 3,
        caption = "Table: Mean Spearman and polychoric correlation and correlation between both correlation matrices",
        col.names = c("Sample", "Spearman", "Polychoric", "r")) %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 1, "Mean correlation" = 2, " " = 1)) %>% 
  footnote(general = "r = correlation between Spearman and polychoric correlation matrices.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Mean Spearman and polychoric correlation and correlation between both correlation matrices
Mean correlation
Sample Spearman Polychoric r
Dataset 1 0.158 0.247 -0.043
Dataset 2 0.229 0.321 -0.074
Note. r = correlation between Spearman and polychoric correlation matrices.



For both datasets, a heatmap is shown in the figures below. Results indicate, that correlations in Dataset 1 are lower and more negative compared to Dataset 2.

# overwrite upper diagonal with polychoric correlations
# Note: don't use this correlation data frame later on, because the two diagonals differ. Use backup instead!
dataset1_cor[upper.tri(dataset1_cor)] = Dataset1_poly[upper.tri(Dataset1_poly)]
# plot correlations
dataset1_cor %>% 
  psych::corPlot(., xlas = 3,
                 main = "Dataset 1:\nSpearman correlation (below diagonal) and\npolychoric correlation (NPD-matrix, above diagonal)")


Figure. Heatmap for Spearman (below diagonal) and polychoric correlation (upper diagonal) matrices for Dataset 1.

# overwrite upper diagonal with polychoric correlations
# Note: don't use this correlation data frame later on, because the two diagonals differ. Use backup instead!
dataset2_cor[upper.tri(dataset2_cor)] = Dataset2_poly[upper.tri(Dataset2_poly)]
# plot correlations
dataset2_cor %>% 
  psych::corPlot(., xlas = 3,
                 main ="Dataset 2:\nSpearman correlation (below diagonal) and\npolychoric correlation (NPD-matrix, above diagonal)")


Figure. Heatmap for Spearman (below diagonal) and polychoric correlation (upper diagonal) matrices for Dataset 2.

Network

Estimation of \(\gamma\) and \(\lambda\)

Because we have ordinal items, Spearman correlations between all items and a Graphical Gaussian Model () were used to calculate the edges. In this model, connections between symptoms are represented as partial correlations (Epskamp, Borsboom, and Fried 2018), i.e. the relationship between two nodes under the control of all other relationships in the network. Thus, only the “true” relationship between two items is displayed. Therefore, the network also consists of a series of regressions, in which symptoms are presented as dependent variables and all other symptoms as potential predictors. Hence, the model is exploratory and data driven. Because the GGM estimates a large number of parameters, there is a risk of identifying false-positive edges. For this reason, a regularization technique, the so-called graphical lasso (glasso; for details see (Tibshirani 1994; Friedman, Hastie, and Tibshirani 2008) was used. Regularization means that the partial correlations between nodes are calculated in a very conservative way: all small edges are set to exactly zero, which means that the network has fewer connections (sparse network). This protects against the false assumption that a connection exists, whereas in reality it does not. The strength of the regulation depends on the so-called tuning parameter \(\lambda\). The larger this parameter is, the more connections are removed, i.e. it yields sparser networks. If \(\lambda\) is zero, no connections are removed. Because there is no knowledge of whether the true network is sparse or dense, \(\lambda\) was chosen empirically by modelling multiple networks with different \(\lambda\)-values. The network best fitted is selected by using the EBIC-criterion (Extended Bayesian Information Criterion) [Epskamp, Borsboom, and Fried (2018); Foygel.2010]. Smaller values indicate better models. For the EBIC-criterion the Hyperparamter \(\gamma\) can be determined. The larger \(\gamma\) is, the more models with fewer connections will be preferred. Hence, it is important to choose this parameter with caution. Commonly \(\gamma\) = .50 is used (Foygel and Drton 2010)). However, the choice of this value depends on how sensitive or specific a model is chosen – the lower, the more likely it is that correlations will be considered as “true” and remain within the network. Whereas a high value of \(\gamma\) excludes false positive correlations.
Since individuals from the present study suffer from different and multiple physical symptoms, it is assumed that various weak connections remain. Therefore, because many relationsships which are not relevant would remain, a very low value of \(\gamma\) is not advisable. However, a very high value of \(\gamma\) should also not be chosen, because the heterogenous and complex structure of persistent physical symptoms would not be reflected realistic. Since we can not refer to any existing literature, the optimal value for \(\gamma\) is determined exploratory. For this, various network with different \(\gamma\)-values (between 0 and .50) were performed by using .05 steps. A total of 100 \(\lambda\)‘s were calculated and for each model EBIC were determined.
This procedure resulted in a network with 172 (Dataset 1) and 331 edges (Dataset 2), i.e. in the first network 84%, in the second 69% if all possible edges were set to zero.

Dataset 1

# vector with gamma values to test for
.gamma_ds1 <- seq(0,0.50,0.05)
# testing models
set.seed(1701)
lapply(.gamma_ds1, function(x) {EBICglasso(S = .dataset1_cor,
                                           returnAllResults = T,
                                           n = nrow(dataset1),
                                           nlambda = 100,
                                           gamma = x)}) -> model_ds1

The \(\lambda\) for the commonly used hyperparameter (\(\gamma = .50\)) was very high. As shown above, the correlations in Dataset 1 were only small and the use of a high \(\lambda\)-value will lead to very few connection within the network. So we a \(\gamma\) of .20 was used for the network estimation.

cbind(.gamma_ds1,
      unlist(lapply(model_ds1, function(x) {x$ebic[which.min(x$ebic)]})),
      unlist(lapply(model_ds1, function(x) {x$lambda[which.min(x$ebic)]}))) %>% 
  as.data.frame() %>% 
  dplyr::select(gamma = .gamma_ds1, lambda = V3, min_EBIC = V2) %>% 
  kable(., "html",
        caption = "Table: Estimated EBIC for different tuning parameters in Dataset 1",
        col.names = c("Gamma", "Lambda", "EBIC Minimum")) %>%
  kable_styling("striped", full_width = F) %>%
  footnote(general = "In sum, networks for 100 different lamdas, starting with 0, were estimated.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Estimated EBIC for different tuning parameters in Dataset 1
Gamma Lambda EBIC Minimum
0.00 0.0988604 10648.57
0.05 0.0988604 10898.06
0.10 0.0988604 11147.55
0.15 0.2283804 11339.38
0.20 0.2283804 11472.59
0.25 0.3636461 11585.74
0.30 0.3636461 11604.22
0.35 0.3636461 11622.70
0.40 0.3636461 11641.18
0.45 0.3636461 11659.66
0.50 0.3636461 11678.15
Note. In sum, networks for 100 different lamdas, starting with 0, were estimated.

Dataset 2

# vector with gamma values to test for
.gamma_ds2 <- seq(0,0.50,0.05)
# testing models
set.seed(1701)
lapply(.gamma_ds2, function(x) {EBICglasso(S = .dataset2_cor,
                                           returnAllResults = T,
                                           n = nrow(dataset2),
                                           # here we adjust the lamda min ratio
                                           # this was a post hoc adjustment after a warning message
                                           lambda.min.ratio = 0.1,
                                           nlambda = 100,
                                           gamma = x)}) -> model_ds2

Since every \(\gamma\)-value, the same \(\lambda\) calculated the EBIC minimum, \(\gamma = .50\) was used as hyperparameter in the network estimation for Dataset 2.

cbind(.gamma_ds2,
      unlist(lapply(model_ds2, function(x) {x$ebic[which.min(x$ebic)]})),
      unlist(lapply(model_ds2, function(x) {x$lambda[which.min(x$ebic)]}))) %>% 
  as.data.frame() %>% 
  dplyr::select(gamma = .gamma_ds2, lambda = V3, min_EBIC = V2) %>% 
  kable(., "html",
        caption = "Table: Estimated EBIC for different tuning parameters in Dataset 2",
        col.names = c("Gamma", "Lambda", "EBIC Minimum")) %>%
  kable_styling("striped", full_width = F) %>%
  footnote(general = "In sum, networks for 100 different lamdas, starting with 0, were estimated.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Estimated EBIC for different tuning parameters in Dataset 2
Gamma Lambda EBIC Minimum
0.00 0.0764673 19830.59
0.05 0.0764673 20106.26
0.10 0.0764673 20381.93
0.15 0.0764673 20657.60
0.20 0.0764673 20933.27
0.25 0.0764673 21208.94
0.30 0.0764673 21484.61
0.35 0.0764673 21760.28
0.40 0.0764673 22035.95
0.45 0.0764673 22311.62
0.50 0.0764673 22587.29
Note. In sum, networks for 100 different lamdas, starting with 0, were estimated.

Network estimate

Dataset 1

net_Dataset1 <- estimateNetwork(dataset1[,1:47], 
                                default = "EBICglasso",
                                tuning = 0.20,
                                lambda.min.ratio = 0.01,
                                corMethod = "cor", 
                                corArgs = list(method = "spearman"))
Estimating Network. Using package::function:
  - qgraph::EBICglasso for EBIC model selection
    - using glasso::glasso
plot(net_Dataset1, layout = "spring")

Frequency of zero and non-zero (in absolute values) edges in Dataset 1

# number of edges in the dataset1 network

rbind(table(cut(abs(net_Dataset1$results$optnet[upper.tri(net_Dataset1$results$optnet)]), breaks = c(0, 1.0e-10, 1), include.lowest = T)), #173
  table(cut(abs(net_Dataset1$results$optnet[upper.tri(net_Dataset1$results$optnet)]), breaks = c(0, 1.0e-10, 1), include.lowest = T))/(908+173)*100) %>% 
  data.frame() %>% 
  mutate(label = c("n", "%")) %>% select(label, everything()) %>% 
  kable(., "html",
        digits = 2,
        caption = "Table: Zero and non-zero edge-weights in absolute values for dataset 1",
        col.names = c(" ", "Zero-edges", "Non-zero edges")) %>% 
  kable_styling("striped", full_width = F)
Table: Zero and non-zero edge-weights in absolute values for dataset 1
Zero-edges Non-zero edges
n 908 173
% 84 16
#173/1081 = 16%
#1-173/1081 = 84%

Frequency of positive and negative edge-weights in Dataset 1

# positive vs. negative edges in dataset1 network
data.frame(Var1 = net_Dataset1$results$optnet[upper.tri(net_Dataset1$results$optnet)]) %>% 
  mutate(Var1 = as.numeric(as.character(Var1))) %>% 
  mutate(posneg = ifelse(Var1 == 0, "zero",
                         ifelse(Var1 > 0, "positive", "negative"))) %>% 
  count(posneg) %>% mutate(p = n / sum(n)*100) %>% 
  filter(posneg != "zero") %>% mutate(p2 = n / sum(n)*100) %>% 
  kable(., "html",
        digits = 2,
        caption = "Table: Frequency of positive and negative edge-weights in dataset 1",
        col.names = c("sign of correlation", "Frequency", "% (of total edges)", "% (if zero edges deleted)")) %>% 
  kable_styling("striped", full_width = F)
Table: Frequency of positive and negative edge-weights in dataset 1
sign of correlation Frequency % (of total edges) % (if zero edges deleted)
positive 173 16 100
# 16% positive edges, 84% zero-edges, no negative edges
# if zeros are removes 100% positive edges

Dataset 2

net_Dataset2 <- estimateNetwork(dataset2[,1:47],
                                default = "EBICglasso",
                                tuning = 0.5,
                                lambda.min.ratio = 0.1,
                                corMethod = "cor",
                                corArgs = list(method = "spearman"))
Estimating Network. Using package::function:
  - qgraph::EBICglasso for EBIC model selection
    - using glasso::glasso
plot(net_Dataset2, layout = "spring")

Frequency of zero and non-zero (in absolute values) edges in Dataset 2

# number of edges in the dataset2 network
rbind(table(cut(abs(net_Dataset2$results$optnet[upper.tri(net_Dataset2$results$optnet)]), breaks = c(0, 1.0e-10, 1), include.lowest = T)), #358
      table(cut(abs(net_Dataset2$results$optnet[upper.tri(net_Dataset2$results$optnet)]), breaks = c(0, 1.0e-10, 1), include.lowest = T))/(723+358)*100) %>%
  data.frame() %>% 
  mutate(label = c("n", "%")) %>% select(label, everything()) %>% 
  kable(., "html",
        digits = 2,
        caption = "Table: Zero and non-zero edge-weights in absolute values for dataset 2",
        col.names = c(" ", "Zero-edges", "Non-zero edges")) %>% 
  kable_styling("striped", full_width = F)
Table: Zero and non-zero edge-weights in absolute values for dataset 2
Zero-edges Non-zero edges
n 723.00 358.00
% 66.88 33.12
#358/1081 = 33%
#1-358/1081 = 67%

Frequency of positive and negative edge-weights in Dataset 2

# positive vs. negative edges in dataset2 network
data.frame(Var1 = net_Dataset2$results$optnet[upper.tri(net_Dataset2$results$optnet)]) %>% 
  mutate(Var1 = as.numeric(as.character(Var1))) %>% 
  mutate(posneg = ifelse(Var1 == 0, "zero",
                         ifelse(Var1 > 0, "positive", "negative"))) %>% 
  count(posneg) %>% mutate(p = n / sum(n)*100) %>% 
  filter(posneg != "zero") %>% mutate(p2 = n / sum(n)*100) %>% 
  kable(., "html",
        digits = 2,
        caption = "Table: Frequency of positive and negative edge-weights in dataset 2",
        col.names = c("sign of correlation", "Frequency", "% (of total edges)", "% (if zero edges deleted)")) %>% 
  kable_styling("striped", full_width = F)
Table: Frequency of positive and negative edge-weights in dataset 2
sign of correlation Frequency % (of total edges) % (if zero edges deleted)
negative 1 0.09 0.28
positive 357 33.02 99.72
# 33% positive edges, 0.0925% negative and 66.9 zero-edges
# if zeros are removes 99.7% positive and 0.279% negative edges (one edge S04 <-> S17).

Bootstrapping

To analyse the stability of the network, the procedure described by (Epskamp and Fried 2018) were followed. To investigate the stability, bootstrapped confidence intervals (CI) for edge-weights and centrality measures were needed. Because bootstrapping analysis requires a long time, bootstrapped data were saved and uploaded everytime the script runs. The script below provides all necessary information to replicate our bootstrap.

# CAVE:
# all syntax below runs. please uncomment if needed
# Notice: set.seed changed with R Version 3.6.0 (see https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494)
# we used 6 cores for bootstrapping, if your pc has less cores, reduce this parameter
# total time was noted and is handwritten below each bootstrap. The elapsed time depends on nBoot, caseN and available cores

# # dataset1 Network
# set.seed(1701)
# boot1_ds1 <- bootnet(net_Dataset1, nBoots = 5000, caseN = 50, nCores = 6)
# # Total time: 7:46 + 8:45 minutes

# set.seed(1701)
# boot2_ds1 <- bootnet(net_Dataset1, nBoots = 5000, caseN = 50, nCores = 6, type = "case", statistics = c("edge", "strength", "closeness", "betweenness"))
# # Total time: 6:56 + 8:58 minutes

# # dataset2 Network
# set.seed(1701)
# boot1_ds2 <- bootnet(net_Dataset2, nBoots = 5000, caseN = 50, nCores = 6)
# # Total time: 6:20 + 9:54 minutes

# set.seed(1701)
# boot2_ds2 <- bootnet(net_Dataset2, nBoots = 5000, caseN = 50, nCores = 6, type = "case", statistics = c("edge", "strength", "closeness", "betweenness"))
# # Total time: 5:08 + 9:05 minutes

# # export bootstrap data to save time
# # be careful if you have older Version of R and you want do read the data.
# # You have to add version = 2 (see ?saveRDS)
# saveRDS(boot1_ds1, file = "Bootstrap_1 dataset1.RData")
# saveRDS(boot2_ds1, file = "Bootstrap_2 dataset1.RData")
# saveRDS(boot1_ds2, file = "Bootstrap_1 dataset2.Rdata")
# saveRDS(boot2_ds2, file = "Bootstrap_2 dataset2.Rdata")
# the syntax which creates this bootstrapped data can be seen above
boot1_ds1 <- readRDS("Bootstrap_1 dataset1.RData")
boot2_ds1 <- readRDS("Bootstrap_2 dataset1.RData")
boot1_ds2 <- readRDS("Bootstrap_1 dataset2.Rdata")
boot2_ds2 <- readRDS("Bootstrap_2 dataset2.Rdata")

Cluster

A community is a group of symptoms which are closely interconnected to each other and sparsely to the remaining nodes within the network. To identify such a cluster of items, the so-called Walktrap algorithm was used (Pons and Latapy 2006). This algorithm uses random walks within the network to test which nodes are closely connected. A random walk is a path between two nodes across a network created by taking random steps (Newman 2010). Shorter random walks indicate that two nodes belong to the same community.

Cluster identification

To identify the cluster, an igraph-object is needed.

# plots are hidden in output, because they are identical to the plots in section "Network estimate"
# we need only the qgraph object, that we want to transform into an igraph object.
.net1 <- qgraph(.dataset1_cor,
                 graph = "glasso",
                 sampleSize = 254,
                 label.cex = 2,
                 tuning = 0.20,
                 layout = "spring")
.net2 <- qgraph(.dataset2_cor, 
                 graph = "glasso",
                 sampleSize = 574,
                 label.cex = 2,
                 tuning = 0.50,       
                 layout = "spring")
# we need this as igraph object
.net_ds1 <- as.igraph(.net1)
.net_ds2 <- as.igraph(.net2)

After the correct objects were created, walktrap algorithm was used.

# Dataset 1 walktrap
.cluster_walk_ds1 <- cluster_walktrap(.net_ds1, weights = E(.net_ds1)$weight)

# Dataset 2 walktrap
.cluster_walk_ds2 <- cluster_walktrap(.net_ds2, weights = E(.net_ds2)$weight)

In Dataset 1 8 clusters were identified whereas Dataset 2 constisted of 5 cluster. The following table gives an insight which item belongs to which cluster.

.group_ds1 <- .cluster_walk_ds1$membership
.group_ds2 <- .cluster_walk_ds2$membership
# dummy-syntax to check cluster categories

cbind(data.frame(cluster = .group_ds1) %>%  mutate(item = rownames(.)),
      data.frame(cluster = .group_ds2) %>%  mutate(item = rownames(.))) %>% 
  data.frame() %>% 
  # no need of two columns with item-number
  dplyr::select(-4) %>% 
  dplyr::select(item, cluster, cluster.1) -> .cluster_member
  

.cluster_member %>% 
  mutate(label = .labels[1:47, "Englisch_Label"]) %>% 
  mutate(cluster = recode(cluster, 
                                 "1" = "Neurological symptoms",
                                 "2" = "Gastrointestinal symptoms",
                                 "3" = "Urogenital symptoms",
                                 "4" = "(Acute) Stress-associated symptoms",
                                 "5" = "Cardiovascular symptoms",
                                 "6" = "Musculoskeletal problems",
                                 "7" = "Rare Pseudoneurological symptoms",
                                 "8" = "Sexual indifference (loss of libido)"),
         cluster.1 = recode(cluster.1,
                                  "1" = "Gastrointestinal symptoms",
                                  "2" = "Neurological symptoms",
                                  "3" = "Cardiovascular symptoms",
                                  "4" = "Musculoskeletal problems",
                                  "5" = "Urogenital symptoms")) %>% 
  mutate(similar = ifelse(cluster == cluster.1, "yes", "no")) %>% 
  dplyr::select(item, label, everything()) %>%
  datatable(., rownames = F,
            caption = "Table: Item to cluster allocation for both networks",
            colnames = c("Item-No.", "Item label", "Cluster-ID in Dataset 1", " Cluster-ID in Dataset 2", "Similar"))

Cluster Plot Dataset 1

# manual cluster id 
.group_walk_ds1 <- list("Neurological symptoms" = c(1,34,35,39:42,44:46), #1
                        "Gastrointestinal symptoms" = c(2,7,10:17,20:21,23),  #2
                        "Musculoskeletal problems" = c(3:5), #6
                        "Cardiovascular symptoms" = c(6,18,19,22,24,25,28:30,36,37),#5
                        "Urogenital symptoms" = c(8,9,33,38), #3
                        "(Acute) Stress-associated\nsymptoms" = c(26,27,31), #4
                        "Rare Pseudoneurological\nsymptoms" = c(43,47), #7
                        "Sexual indifference\n(loss of libido)" = c(32)) #8

# plotting cluster

qgraph(.dataset1_cor, 
       graph = "glasso",
       groups = .group_walk_ds1,
       color = c("red", "blue", "green", "yellow", "orange", "white", "purple", "grey"),
       vsize = 7,
       sampleSize = 254,
       label.cex = 1,
       tuning = 0.20,
       layout = "spring",
       legend = F,
       mar = c(5,2,2,2))
# plot legend
legend(x = -1., y = -1.15, 
       legend = c("Neurological symptoms", 
                  "Gastrointestinal symptoms",
                  "Musculoskeletal problems",
                  "Cardiovascular symptoms",
                  "Urogenital symptoms",
                  "(Acute) Stress-associated symptoms",
                  "Rare Pseudoneurological symptoms",
                  "Sexual indifference (loss of libido)"),
       ncol = 3,
       bty = "n",
       pt.cex = 2,
       pch = 21,
       pt.bg = c("red", "blue", "green", "yellow", "orange", "white", "purple", "grey"),
       col = "black")

Figure. Visualization of Dataset 1 network communities based on the Walktrap algorithm. Items with the same color belong to the same community.

Cluster Plot Dataset 2

# manual cluster id
.group_walk_ds2 <- list("Neurological symptoms" = c(1, 30:32, 34,35,39:47),
                        "Gastrointestinal symptoms" = c(2, 7,10:17,20:23),
                        "Musculoskeletal problems" = c(3:5),                   
                        "Cardiovascular symptoms" = c(6,18,19,24:29,36,37),
                        "Urogenital symptoms" = c(8,9,33,38))
# plotting cluster
qgraph(.dataset2_cor, 
       graph = "glasso",
       groups = .group_walk_ds2,
       color = c("red", "blue", "green", "yellow", "orange"),
       vsize = 7,
       sampleSize = 574,
       label.cex = 1,
       tuning = 0.20,
       lambda.min.ratio = 0.1,
       layout = "spring",
       legend = F,
       mar = c(5,2,2,2))
# plot legend
legend(x = -1., y = -1.15, 
       legend = c("Neurological symptoms", 
                  "Gastrointestinal symptoms",
                  "Musculoskeletal problems",
                  "Cardiovascular symptoms",
                  "Urogenital symptoms"),
       ncol = 3,
       bty = "n",
       pt.cex = 2,
       pch = 21,
       pt.bg = c("red", "blue", "green", "yellow", "orange"),
       col = "black")

Figure. Visualization of Dataset 2 network communities based on the Walktrap algorithm. Items with the same color belong to the same community.

Network inference

Node centrality

Dataset 1

# lets create a big table with centrality measures for all items and cluster-id for dataset1
.x <- list(c("Neurological symptoms",1,34,35,39:42,44:46),
           c("Gastrointestinal symptoms", 2,7,10:17,20:21,23),
           c("Musculoskeletal problems", 3:5),
           c("Cardiovascular symptoms", 6,18,19,22,24,25,28:30,36,37),
           c("Urogenital symptoms", 8,9,33,38),
           c("(Acute) Stress-associated symptoms", 26,27,31),
           c("Rare Pseudoneurological symptoms", 43,47),
           c("Sexual indifference (loss of libido)", 32)) %>% 
  lapply(., function(x){
  as.data.frame(x) %>% 
    mutate(item = paste("S_", sep = "", x), c = x[[1]])
  }) %>% 
  lapply(., function(x){
    x[-1,]})
# merge list
.x <- plyr::ldply(.x, rbind)
# add cluster-id to centrality statistics
.x <- merge(.x, data.frame(item2 = names(dataset1),
                           Degree = as.vector(centrality(net_Dataset1)$InDegree),
                           Betweenness = as.vector(centrality(net_Dataset1)$Betweenness),
                           Closeness = as.vector(centrality(net_Dataset1)$Closeness)) %>% 
  mutate(item = paste("S_", sep = "", row.names(.))), by = "item")

.x %>% 
  dplyr::select(item = item2, c, Degree : Closeness) %>% 
  .[order(.$item),] -> .x

.x$label <- .labels[1:47, "Englisch_Label"]
.x %>% dplyr::select(item, label, everything()) -> .x

The following table shows a summary for node strength, betweenness and closeness for all symptoms in Dataset 1.

.x %>% 
  dplyr::select("Node strength" = Degree, Betweenness, Closeness) %>% 
  psych::describe() %>% t() %>% .[-c(1:2),] %>% 
  kable(., "html",
        caption = "Table: Central tendencies for the main centrality measures for Dataset 1") %>% 
  kable_styling("striped", full_width = F)
Table: Central tendencies for the main centrality measures for Dataset 1
Node strength Betweenness Closeness
mean 0.4618689 108.0851064 0.0004813
sd 0.2530268 106.5228756 0.0001165
median 0.4894712 88.0000000 0.0005061
trimmed 0.4638240 94.6666667 0.0004977
mad 0.2781987 106.7472000 0.0000819
min 0.0143394 0.0000000 0.0001438
max 0.9648651 416.0000000 0.0006333
range 0.9505257 416.0000000 0.0004894
skew -0.1333145 1.0397497 -1.3745027
kurtosis -0.9874402 0.3315381 1.1899919
se 0.0369078 15.5379583 0.0000170



A complete list of the centrality measures for each node in Dataset 1 is displayed in the next table.

# table
DT::datatable(.x, rownames = F, 
              filter= "top", 
              caption = "Table: Centrality measures for each item of Dataset 1",
              colnames = c("Item", "Itemlabel", "Cluster", "Node strength", "Betweenness", "Closeness"),
              extensions = "Buttons",
              options = list(dom = "Bfrtip", pageLength = 7,
                             buttons = c("csv", "pdf"))) %>% 
  formatRound("Degree", 3) %>% 
  formatRound("Betweenness", 0) %>% 
  formatRound("Closeness", 5)
.x %>% 
  dplyr::select(-c, -label) %>% 
  mutate_at(vars(Degree, Betweenness, Closeness), function(x) { scale(x, scale = T) }) %>% 
  gather(., centrality, measure, -item) %>% 
  mutate(centrality = recode_factor(centrality, 
                                    "Degree" = "Node strength",
                                    "Betweenness" = "Betweenness",
                                    "Closeness" = "Closeness")) %>% 
  ggplot(., aes(y = item, x = measure)) +
  facet_wrap(~ centrality, scales = "free_x") +
  geom_point() +
  geom_line(group = "1") +
  scale_x_continuous(limits = c(-3,3), breaks = seq(-3, 3, 1)) +
  labs(x = "z-value",
       y = "SOMS7 item",
       title = "Dataset 1") +
  theme_classic() + 
  theme(axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 18, color = "black"))
Warning: attributes are not identical across measure variables;
they will be dropped

Figure. standardized centrality measures for betweenness, closeness and node strength for Dataset 1.

Dataset 2

# table with centrality measures for all items and cluster-id for dataset2
.y <- list(c("Neurological symptoms", 1, 30:32, 34,35,39:47),
           c("Gastrointestinal symptoms", 2, 7,10:17,20:23),
           c("Musculoskeletal problems", 3:5),
           c("Cardiovascular symptoms", 6,18,19,24:29,36,37),
           c("Urogenital symptoms", 8,9,33,38)) %>% 
  lapply(., function(x){
  as.data.frame(x) %>% 
    mutate(item = paste("S_", sep = "", x),
           c2 = x[[1]])
  }) %>% 
  lapply(., function(x){
    x[-1,]})
.y <- plyr::ldply(.y, rbind)
# add cluster-id to centrality statistics
.y <- merge(.y, data.frame(item2 = names(dataset1),
                           Degree = as.vector(centrality(net_Dataset2)$InDegree),
                           Betweenness = as.vector(centrality(net_Dataset2)$Betweenness),
                           Closeness = as.vector(centrality(net_Dataset2)$Closeness)) %>% 
  mutate(item = paste("S_", sep = "", row.names(.))), by = "item")

.y %>%
  dplyr::select(item = item2, c2, Degree : Closeness) %>% 
  .[order(.$item),] -> .y

.y$label <- .labels[1:47, "Englisch_Label"]
.y %>% dplyr::select(item, label, everything()) -> .y

The following table shows a summary for node strength, betweenness and closeness for all symptoms in Dataset 2.

.y %>% 
  dplyr::select("Node strength" = Degree, Betweenness, Closeness) %>% 
  psych::describe() %>% t() %>% .[-c(1:2),] %>% 
  kable(., "html",
        caption = "Table: Central tendencies for the main centrality measures for Dataset 2") %>% 
  kable_styling("striped", full_width = F)
Table: Central tendencies for the main centrality measures for Dataset 2
Node strength Betweenness Closeness
mean 0.8884063 82.297872 0.0008958
sd 0.1765296 61.775159 0.0000721
median 0.8821866 68.000000 0.0008891
trimmed 0.8951438 76.000000 0.0008973
mad 0.1867661 35.582400 0.0000708
min 0.4599306 0.000000 0.0006930
max 1.2126870 298.000000 0.0010205
range 0.7527565 298.000000 0.0003276
skew -0.2919511 1.265499 -0.1905283
kurtosis -0.3657249 1.744582 -0.0002469
se 0.0257495 9.010833 0.0000105



A complete list of the centrality measures for each node in Dataset 2 is illustrated in the next table.

# table 
DT::datatable(.y, rownames = F, 
              filter= "top", 
              caption = "Table: Centrality measures for each item of Dataset 2",
              colnames = c("Item", "Itemlabel", "Cluster", "Node strength", "Betweenness", "Closeness"),
              extensions = "Buttons",
              options = list(dom = "Bfrtip", pageLength = 7,
                             buttons = c("csv", "pdf"))) %>% 
  formatRound("Degree", 3) %>% 
  formatRound("Betweenness", 0) %>% 
  formatRound("Closeness", 5)
.y %>% 
  dplyr::select(-c2, -label) %>% 
  mutate_at(vars(Degree, Betweenness, Closeness), function(x) { scale(x, scale = T) }) %>% 
  gather(., centrality, measure, -item) %>% 
  mutate(centrality = recode_factor(centrality, 
                                    "Degree" = "Node strength",
                                    "Betweenness" = "Betweenness",
                                    "Closeness" = "Closeness")) %>% 
  ggplot(., aes(y = item, x = measure)) +
  facet_wrap(~ centrality, scales = "free_x") +
  geom_point() +
  geom_line(group = "1") +
  scale_x_continuous(limits = c(-3.5,3.5), breaks = seq(-3, 3, 1)) +
  labs(x = "z-value",
       y = "SOMS7 item",
       title = "Dataset 2") +
  theme_classic() + 
  theme(axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 18, color = "black"))
Warning: attributes are not identical across measure variables;
they will be dropped

Figure. standardized centrality measures for betweenness, closeness and node strength for Dataset 2.

Edge weights

Dataset 1

merge(getWmat(net_Dataset1) %>% 
        as.data.frame() %>% 
        mutate(node1 = rownames(.)) %>% 
        gather(., node2, weight, -node1) %>% 
        .[order(.$weight, decreasing = T),] %>% 
        filter(row_number() %% 2 != 0) %>% 
        unite(., Edge, node1, node2, sep = " - "),
      summary(boot1_ds1) %>%        
        dplyr::select(type, node1, node2, starts_with("CI")) %>%
        unite(., Edge, node2, node1, sep = " - ") %>% 
        as.data.frame() %>% 
        dplyr::select(-type),
      by = "Edge") %>% 
  .[order(.$weight, decreasing = T),] %>% 
  datatable(., 
            rownames = F,
            caption = "Table: Edge-weights for each edge within the network for Dataset 1 with bootstrap confidence intervals",
            colnames = c("Edge", "Weight", "lower 95% CI", "upper 95% CI"),
            extensions = "Buttons",
            options = list(dom = "Bfrtip", buttons = c("csv", "pdf"))) %>% 
    formatRound(c("weight", "CIlower", "CIupper"), 3)

Dataset 2

merge(getWmat(net_Dataset2) %>% 
        as.data.frame() %>% 
        mutate(node1 = rownames(.)) %>% 
        gather(., node2, weight, -node1) %>% 
        .[order(.$weight, decreasing = T),] %>% 
        filter(row_number() %% 2 != 0) %>% 
        unite(., Edge, node1, node2, sep = " - "),
      summary(boot1_ds2) %>%        
        dplyr::select(type, node1, node2, starts_with("CI")) %>%
        unite(., Edge, node2, node1, sep = " - ") %>% 
        as.data.frame() %>% 
        dplyr::select(-type),
      by = "Edge") %>% 
  .[order(.$weight, decreasing = T),] %>% 
  datatable(., 
            rownames = F,
            caption = "Table: Edge-weights for each edge within the network for Dataset 2 with bootstrap confidence intervals",
            colnames = c("Edge", "Weight", "lower 95% CI", "upper 95% CI"),
            extensions = "Buttons",
            options = list(dom = "Bfrtip", buttons = c("csv", "pdf"))) %>% 
    formatRound(c("weight", "CIlower", "CIupper"), 3)

Cluster inference

After cluster identification, mean centrality indices for each cluster were caluclated. Additional, content-related similarities of items to label the identified cluster were investigated. For both data sets, five cluster are comparable.

Descriptives

The table below shows the results for both datasets.

#### Dataset 1
# part 1 for cluster summary table: mean strength, sum of nodes
merge(.x %>% 
        dplyr::select(cluster = c, Degree) %>%
        gather(., centrality, value, -cluster) %>%
        group_by(cluster, centrality) %>%
        summarise(strength = sum(value)) %>%
        spread(., centrality, strength) %>%
        as.data.frame() %>%
        mutate(cluster = as.character(cluster)),
      data.frame(cluster = 1:8,
                 n = as.vector(table(.cluster_walk_ds1$membership)),
                 p = as.vector(round(prop.table(table(.cluster_walk_ds1$membership))*100,2))) %>% 
        mutate(cluster = recode(cluster, 
                          "1" = "Neurological symptoms",
                          "2" = "Gastrointestinal symptoms",
                          "3" = "Urogenital symptoms",
                          "4" = "(Acute) Stress-associated symptoms",
                          "5" = "Cardiovascular symptoms",
                          "6" = "Musculoskeletal problems",
                          "7" = "Rare Pseudoneurological symptoms",
                          "8" = "Sexual indifference (loss of libido)")),
        by = "cluster") %>% 
  # part 2: add aggregate cluster-id with dataset..
  merge(., merge(dataset1 %>% gather(., item, value), 
                 .x %>% dplyr::select(item, cluster = c),
                 by = "item") %>% 
          # remove item id
          dplyr::select(-item) %>% 
          # calculate mean values for each cluster
  group_by(cluster) %>% 
  summarise(Mdn =median(value, na.rm = T),
            M = mean(value, na.rm = T),
            SD = sd(value, na.rm = T),
            Min = min(value, na.rm = T),
            Max = max(value, na.rm = T)), 
  by = "cluster") -> .cluster_summary_table_ds1
  
# part 3: cronbachs alpha
merge(dataset1 %>% 
        mutate(dummyid = as.character(1:nrow(dataset1))) %>% 
        gather(., item, value, -dummyid), 
      .x %>% dplyr::select(item, cluster = c),
      by = "item") %>% 
  group_split(cluster) %>% 
  lapply(function(x) { 
    x <- x %>% dplyr::select(-cluster) %>% 
      spread(item, value) %>%
      dplyr::select(-dummyid)
    fi <- names(x[1])
    a <- ifelse(length(x) < 3, NA, psych::alpha(x)$total$raw_alpha)
    rbind(fi, a)
    }) -> .cluster_scales_cronbach_ds1
# note: this procedure messed the cluster-id up.
# so the first item as identifier for the cluster!
# 1st = Neurological symptoms
# 2nd = Gastrointestinal symptoms
# 3rd = Musculoskeletal problems
# 4th = Cardiovascular symptoms
# 5th = Urogenital symptoms
# 6th = (Acute) Stress-associated symptoms
# because of less than three items, we do not calculate cronbachs alpha for the cluster 7 and 8
# add this to the previous code..
merge(.cluster_summary_table_ds1,
      .cluster_scales_cronbach_ds1 %>% 
        unlist() %>% 
        as.data.frame() %>% 
        mutate(a = as.character(`.`)) %>%
        filter(nchar(a) > 4 | is.na(a)) %>% 
        dplyr::select(a) %>% 
        mutate(cluster = c("Neurological symptoms", "Gastrointestinal symptoms", "Musculoskeletal problems", 
                           "Cardiovascular symptoms", "Urogenital symptoms", "(Acute) Stress-associated symptoms", 
                           "Rare Pseudoneurological symptoms", "Sexual indifference (loss of libido)")),
      by = "cluster") -> .cluster_summary_table_ds1
#### Dataset 2
# part 1 for cluster summary table: mean strength, sum of nodes
merge(.y %>%
        dplyr::select(cluster = c2, Degree) %>%
        gather(., centrality, value, -cluster) %>%
        group_by(cluster, centrality) %>%
        summarise(strength = sum(value)) %>%
        spread(., centrality, strength) %>%
        as.data.frame() %>%
        mutate(cluster = as.character(cluster)),
      data.frame(cluster = 1:5,
                 n = as.vector(table(.cluster_walk_ds2$membership)),
                 p = as.vector(round(prop.table(table(.cluster_walk_ds2$membership))*100,2))) %>% 
        mutate(cluster = recode(cluster,
                          "1" = "Gastrointestinal symptoms",
                          "2" = "Neurological symptoms",
                          "3" = "Cardiovascular symptoms",
                          "4" = "Musculoskeletal problems",
                          "5" = "Urogenital symptoms")),
      by = "cluster") %>% 
  # part 2: add aggregate cluster-id with dataset..
  merge(., merge(dataset2 %>% gather(., item, value), 
      .y %>% dplyr::select(item, cluster = c2),
      by = "item") %>% 
        # remove item id
        dplyr::select(-item) %>% 
        # calculate mean values for each cluster
        group_by(cluster) %>% 
        summarise(Mdn =median(value, na.rm = T),
                  M = mean(value, na.rm = T),
                  SD = sd(value, na.rm = T),
                  Min = min(value, na.rm = T),
                  Max = max(value, na.rm = T)),
      by = "cluster") -> .cluster_summary_table_ds2

# part 3: cronbachs alpha
merge(dataset2 %>% 
        mutate(dummyid = as.character(1:nrow(dataset2))) %>% 
        gather(., item, value, -dummyid), 
      .y %>% dplyr::select(item, cluster = c2),
      by = "item") %>% 
  group_split(cluster) %>% 
  lapply(function(x) { 
    x <- x %>% dplyr::select(-cluster) %>% 
      spread(item, value) %>%
      dplyr::select(-dummyid)
    fi <- names(x[1])
    a <- ifelse(length(x) < 3, NA, psych::alpha(x)$total$raw_alpha)
    rbind(fi, a)
    }) -> .cluster_scales_cronbach_ds2
# note: this procedure messed the cluster-id up.
# so the first item as identifier for the cluster!
# 1st = Neurological symptoms
# 2nd = Gastrointestinal symptoms
# 3rd = Musculoskeletal problems
# 4th = Cardiovascular symptoms
# 5th = Urogenital symptoms
# add this to the previous code.
merge(.cluster_summary_table_ds2,
      .cluster_scales_cronbach_ds2 %>% 
        unlist() %>% 
        as.data.frame() %>% 
        mutate(a = as.character(`.`)) %>%
        filter(nchar(a) > 4 | is.na(a)) %>% 
        dplyr::select(a) %>% 
        mutate(cluster = c("Neurological symptoms", "Gastrointestinal symptoms", "Musculoskeletal problems", 
                           "Cardiovascular symptoms", "Urogenital symptoms")),
      by = "cluster") -> .cluster_summary_table_ds2
# final table
rbind(.cluster_summary_table_ds1,
      .cluster_summary_table_ds2) %>% 
  mutate(a = as.numeric(as.character(a))) %>% 
  # new order
  dplyr::select(cluster, n, p, Degree, everything()) %>% 
  kable(., "html",
        digits = 3,
        caption = "Table: Key statistics for all clusters in both Datasets and for scales based on the items within each cluster",
        col.names =  c("Cluster", "Sum of nodes", "% of nodes", "Cluster strength", "Median", "M", "SD", "Min", "Max", "Cronbachs alpha")) %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 1, "Network cluster statistics" = 3, "Aggregated scale statistics" = 6)) %>% 
  pack_rows("Dataset 1", 1, 8) %>% 
  pack_rows("Dataset 2", 9, 13) %>% 
  footnote(general = "Cronbachs Alpha was calculated only if more than two items were available. Cluster strength refers to the sum score of node strength within the cluster.", 
           footnote_as_chunk = T, general_title = "Note.")
Table: Key statistics for all clusters in both Datasets and for scales based on the items within each cluster
Network cluster statistics
Aggregated scale statistics
Cluster Sum of nodes % of nodes Cluster strength Median M SD Min Max Cronbachs alpha
Dataset 1
(Acute) Stress-associated symptoms 3 6.38 1.441 0 1.021 1.278 0 4 0.708
Cardiovascular symptoms 11 23.40 6.397 0 0.873 1.193 0 4 0.810
Gastrointestinal symptoms 13 27.66 6.800 0 0.841 1.177 0 4 0.833
Musculoskeletal problems 3 6.38 1.630 2 1.859 1.409 0 4 0.731
Neurological symptoms 10 21.28 3.363 0 0.674 1.111 0 4 0.705
Rare Pseudoneurological symptoms 2 4.26 0.252 0 0.032 0.272 0 4 NA
Sexual indifference (loss of libido) 1 2.13 0.022 1 1.470 1.479 0 4 NA
Urogenital symptoms 4 8.51 1.804 0 0.354 0.861 0 4 0.709
Dataset 2
Cardiovascular symptoms 11 23.40 10.775 1 1.024 1.247 0 4 0.862
Gastrointestinal symptoms 14 29.79 12.002 0 0.946 1.231 0 4 0.850
Musculoskeletal problems 3 6.38 3.065 2 1.843 1.449 0 4 0.850
Neurological symptoms 15 31.91 12.517 0 0.802 1.220 0 4 0.832
Urogenital symptoms 4 8.51 3.397 0 0.366 0.878 0 4 0.689
Note. Cronbachs Alpha was calculated only if more than two items were available. Cluster strength refers to the sum score of node strength within the cluster.

Dataset 1

data.frame(Neuro = apply(dataset1[, names(dataset1) %in% .x[.x$c == "Neurological symptoms", "item"]], 1, mean, na.rm = T),
           Gasto = apply(dataset1[, names(dataset1) %in% .x[.x$c == "Gastrointestinal symptoms", "item"]], 1, mean, na.rm = T),
           Muscu = apply(dataset1[, names(dataset1) %in% .x[.x$c == "Musculoskeletal problems", "item"]], 1, mean, na.rm = T),
           Cardio = apply(dataset1[, names(dataset1) %in% .x[.x$c == "Cardiovascular symptoms", "item"]], 1, mean, na.rm = T),
           Uro = apply(dataset1[, names(dataset1) %in% .x[.x$c == "Urogenital symptoms", "item"]], 1, mean, na.rm = T),
           Stress = apply(dataset1[, names(dataset1) %in% .x[.x$c == "(Acute) Stress-associated symptoms", "item"]], 1, mean, na.rm = T),
           RPS = apply(dataset1[, names(dataset1) %in% .x[.x$c == "Rare Pseudoneurological symptoms", "item"]], 1, mean, na.rm = T),
           # cluster 8 has only one item...
           Sexual = dataset1[, names(dataset1) %in% .x[.x$c == "Sexual indifference (loss of libido)", "item"]]) %>%
  #cor(., use = "pairwise.complete.obs") %>% 
  corrgram(., main = "Correlogram of cluster in Dataset 1",
           lower.panel = panel.conf,
           upper.panel = panel.pts,
           diag.panel = panel.density)


Figure. Correlogram of the eight clusters found in Dataset 1. Correlations with their 95%-confidence intervals are printed below the diagonal, scatterplots are printed in upper diagonal. The diagonal shows the density. Cluster names are shown in the diagonal: Neuro = Neurological symptoms, Gastro = Gastrointestinal symptoms, Muscu = Musculoskeletal problems, Cardio = Cardiovascular symptoms, Uro = Urogenital symptoms, Stress = (Acute) Stress-associated symptoms, RPS = Rare Pseudoneurological symptoms, Sexual = Sexual indifference (loss of libido).

Dataset 2

data.frame(Neuro = apply(dataset2[, names(dataset2) %in% .y[.y$c == "Neurological symptoms", "item"]], 1, mean, na.rm = T),
           Gastro = apply(dataset2[, names(dataset2) %in% .y[.y$c == "Gastrointestinal symptoms", "item"]], 1, mean, na.rm = T),
           Muscu = apply(dataset2[, names(dataset2) %in% .y[.y$c == "Musculoskeletal problems", "item"]], 1, mean, na.rm = T),
           Cardio = apply(dataset2[, names(dataset2) %in% .y[.y$c == "Cardiovascular symptoms", "item"]], 1, mean, na.rm = T),
           Uro = apply(dataset2[, names(dataset2) %in% .y[.y$c == "Urogenital symptoms", "item"]], 1, mean, na.rm = T)) %>%
  #cor(., use = "pairwise.complete.obs") %>% 
  corrgram(., main = "Correlogram of cluster in Dataset 2",
           lower.panel = panel.conf,
           upper.panel = panel.pts,
           diag.panel = panel.density)


Figure. Correlogram of the five clusters found in Dataset 2. Correlations with their 95%-confidence intervals printed below the diagonal, scatterplots printed in upper diagonal. The diagonal shows the density of the cluster mean score. The cluster names are shown in the diagonal: Neuro = Neurological symptoms, Gastro = Gastrointestinal symptoms, Muscu = Musculoskeletal problems, Cardio = Cardiovascular symptoms, Uro = Urogenital symptoms.

Key nodes and edges

Central nodes

# in & out-degree are identical because it is an undirected network, so get one of them for edge-strength (= degree)
# which means you can chose if you use $InDegree or $OutDegree in the selection

cbind(
data.frame(head(sort(centrality(net_Dataset1)$OutDegree, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),
data.frame(head(sort(centrality(net_Dataset2)$OutDegree, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Betweenness
data.frame(head(sort(centrality(net_Dataset1)$Betweenness, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

data.frame(head(sort(centrality(net_Dataset2)$Betweenness, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Closeness
data.frame(head(sort(centrality(net_Dataset1)$Closeness, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),
data.frame(head(sort(centrality(net_Dataset2)$Closeness, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item)
) -> .mostcentralnodes
# coding dummy names...
names(.mostcentralnodes) <- c("a", "b", "c", "d", "e", "f")
.mostcentralnodes %>% 
  mutate(Rang = 1:nrow(.)) %>% 
  dplyr::select(Rang, everything()) %>% 
  # Top 5 of the most central nodes
  .[-6,] %>% 
  kable(., "html",
        caption = "Table: Top 5 of the most central nodes in both networks",
        col.names = c("Rang","Dataset 1", "Dataset 2", "Dataset 1", "Dataset 2", "Dataset 1", "Dataset 2"),
        align = "c") %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 1,"Node strength" = 2, "Betweenness" = 2, "Closeness" = 2)) %>% 
  footnote(general = "The rang of the nodes indicates its position. A higher rang indicates, that this node is more central then other nodes.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Top 5 of the most central nodes in both networks
Node strength
Betweenness
Closeness
Rang Dataset 1 Dataset 2 Dataset 1 Dataset 2 Dataset 1 Dataset 2
1 S_12 S_04 S_30 S_10 S_30 S_34
2 S_28 S_26 S_34 S_34 S_11 S_10
3 S_02 S_39 S_11 S_02 S_34 S_39
4 S_05 S_25 S_38 S_39 S_15 S_37
5 S_27 S_05 S_14 S_37 S_12 S_19
Note. The rang of the nodes indicates its position. A higher rang indicates, that this node is more central then other nodes.

Non-Central nodes

cbind(
data.frame(head(sort(centrality(net_Dataset1)$OutDegree, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),
data.frame(head(sort(centrality(net_Dataset2)$OutDegree, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Betweenness
data.frame(head(sort(centrality(net_Dataset1)$Betweenness, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

data.frame(head(sort(centrality(net_Dataset2)$Betweenness, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Closeness
data.frame(head(sort(centrality(net_Dataset1)$Closeness, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),
data.frame(head(sort(centrality(net_Dataset2)$Closeness, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item)
) -> .mostuncentralnodes
# coding dummy names...
names(.mostuncentralnodes) <- c("a", "b", "c", "d", "e", "f")
.mostuncentralnodes %>% 
  mutate(Rang = 1:nrow(.)) %>% 
  dplyr::select(Rang, everything()) %>% 
  # top 5 
  .[-6,] %>% 
  kable(., "html",
        caption = "Table: Top 5 of the less central nodes in both networks",
        col.names = c("Rang","Dataset 1", "Dataset 2", "Dataset 1", "Dataset 2", "Dataset 1", "Dataset 2"),
        align = "c") %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 1,"Node strength" = 2, "Betweenness" = 2, "Closeness" = 2)) %>% 
  footnote(general = "The rang of the nodes indicates its position. A higher rang indicates, that this node is less central then other nodes.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Top 5 of the less central nodes in both networks
Node strength
Betweenness
Closeness
Rang Dataset 1 Dataset 2 Dataset 1 Dataset 2 Dataset 1 Dataset 2
1 S_01 S_43 S_01 S_16 S_42 S_43
2 S_42 S_17 S_07 S_31 S_43 S_31
3 S_32 S_13 S_21 S_43 S_01 S_03
4 S_31 S_45 S_31 S_17 S_47 S_45
5 S_45 S_21 S_32 S_40 S_32 S_09
Note. The rang of the nodes indicates its position. A higher rang indicates, that this node is less central then other nodes.

Strongest edges

The top 5 of the strongest associations (edges) between nodes are listed below:

rbind(
  merge(getWmat(net_Dataset1) %>% 
        as.data.frame() %>% 
        mutate(node1 = rownames(.)) %>% 
        gather(., node2, weight, -node1) %>% 
        .[order(.$weight, decreasing = T),] %>% head(10) %>% 
        # every second row is a dublicate -> removing those
        filter(row_number() %% 2 != 0) %>% 
        unite(., Edge, node1, node2, sep = " - "),
      summary(boot1_ds1) %>%
        dplyr::select(type, node1, node2, starts_with("CI")) %>%
        filter(node1 == "S_26" & node2 == "S_27" |
                 node1 == "S_04" & node2 == "S_05" |
                 node1 == "S_28" & node2 == "S_29" |
                 node1 == "S_06" & node2 == "S_25" |
                 node1 == "S_20" & node2 == "S_23") %>%
        as.data.frame() %>% 
        unite(., Edge, node2, node1, sep = " - ") %>% 
        dplyr::select(-type),
      by = "Edge") %>% 
  .[order(.$weight, decreasing = T),],
  merge(getWmat(net_Dataset2) %>% 
        as.data.frame() %>% 
        mutate(node1 = rownames(.)) %>% 
        gather(., node2, weight, -node1) %>% 
        .[order(.$weight, decreasing = T),] %>% head(10) %>% 
        # every second row is a dublicate -> removing those
        filter(row_number() %% 2 != 0) %>% 
        unite(., Edge, node1, node2, sep = " - "),
      summary(boot1_ds2) %>%
        dplyr::select(type, node1, node2, starts_with("CI")) %>%
        filter(node1 == "S_26" & node2 == "S_27" |
                 node1 == "S_04" & node2 == "S_05" |
                 node1 == "S_20" & node2 == "S_23" |
                 node1 == "S_06" & node2 == "S_25" |
                 node1 == "S_09" & node2 == "S_38") %>%
        as.data.frame() %>% 
        unite(., Edge, node2, node1, sep = " - ") %>% 
        dplyr::select(-type),
      by = "Edge") %>% 
  .[order(.$weight, decreasing = T),]
) %>% 
  mutate(sample = c(rep("dataset1", 5), rep("dataset2", 5))) -> .top5edges


kable(.top5edges[,-5], "html",
        digits = 3,
        caption = "Table: Top 5 edges in both datasets with their weights and 95% confidence intervals",
        row.names = F,
        col.names = c("Edge", "Weight", "lower bound", "upper bound")) %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 2, "95% CI" = 2)) %>% 
  pack_rows("Dataset 1", 1, 5) %>% 
  pack_rows("Dataset 2", 6, 10) %>% 
  footnote(general = "The confidence intervals were bootstrapped (B = 5000).",
           footnote_as_chunk = T, general_title = "Note.")
Table: Top 5 edges in both datasets with their weights and 95% confidence intervals
95% CI
Edge Weight lower bound upper bound
Dataset 1
S_27 - S_26 0.424 0.325 0.523
S_05 - S_04 0.390 0.281 0.499
S_29 - S_28 0.309 0.193 0.424
S_25 - S_06 0.280 0.171 0.388
S_23 - S_20 0.270 0.156 0.384
Dataset 2
S_27 - S_26 0.479 0.409 0.549
S_05 - S_04 0.444 0.372 0.516
S_23 - S_20 0.384 0.309 0.459
S_25 - S_06 0.378 0.305 0.452
S_38 - S_09 0.372 0.279 0.465
Note. The confidence intervals were bootstrapped (B = 5000).

Network comparison

Pattern for centrality measures

data.frame("Node strength1" = as.vector(centrality(net_Dataset1)$InDegree),
           Betweenness1 = as.vector(centrality(net_Dataset1)$Betweenness),
           Closeness1 = as.vector(centrality(net_Dataset1)$Closeness),
           "Node strength2" = as.vector(centrality(net_Dataset2)$InDegree),
           Betweenness2 = as.vector(centrality(net_Dataset2)$Betweenness),
           Closeness2 = as.vector(centrality(net_Dataset2)$Closeness)) %>%
  mutate_all(., function(x){scale(x, scale = T)}) %>% 
  mutate(ItemNr = rownames(.)) %>% 
  dplyr::select(ItemNr, everything()) %>% 
  gather(., cond, val, -ItemNr) %>% 
  mutate(sample = ifelse(grepl("1$", cond), "Dataset 1", "Dataset 2")) %>% 
  mutate(cond = gsub("[[:digit:]]$", "", cond)) %>% 
  mutate(cond = recode_factor(cond,
                              "Node.strength" = "Node strength",
                              "Betweenness" = "Betweenness",
                              "Closeness" = "Closeness")) %>% 
  ggplot(., aes(y = val, x = as.numeric(ItemNr), color = sample)) +
  facet_wrap(~ cond, scales = "free_x") +
  geom_point(size = 1.5) +
  geom_line(aes(group = sample)) +
  scale_x_continuous(limits = c(1,47), breaks = seq(1,47,1)) +
  scale_y_continuous(limits = c(-3.5, 3.5), breaks = seq(-3, 3, 1)) +
  scale_color_manual(values = c("blue", "red"),
                     labels = c("Dataset 1", "Dataset 2")) +
  labs(x = "Item-No.",
       y = "z-value",
       color = "Sample") +
  coord_flip() +
  theme_classic() +
  theme(axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 16, color = "black"),
        panel.grid.major.y = element_line(color = "lightgrey", linetype = "dashed"),
        legend.position = "bottom")
Warning: attributes are not identical across measure variables;
they will be dropped

Coeffcient of similarity

# data preparation for coefficient of similarity
# we need a correlation matrix of the network
.dataset1.net <- EBICglasso(.dataset1_cor,
                            n = nrow(dataset1),
                            gamma = 0.20,
                            lambda.min.ratio = 0.01)
.Dataset2_net <- EBICglasso(.dataset2_cor,
                            n = nrow(dataset2),
                            gamma = 0.50,
                            lambda.min.ratio = 0.1)

# calculate coefficient of similarity -> written in text!
# cor(.dataset1.net[lower.tri(.dataset1.net)], .Dataset2_net[lower.tri(.Dataset2_net)], method = "spearman") #.43

The coefficient of similarity (Rhemtulla et al. 2016) is the correlation between the edge-weights of both networks. This correlations is r = 0.43 (R\(^2\) = 0.18). The edge-weights had only little shared variance, both matrices did not fit very well to each other.

Network comparison test

# we need the raw data without missing values for the NCT
dataset1 %>% dplyr::select(S_01 : S_46) %>% na.omit() -> .dat1
dataset2 %>% dplyr::select(S_01 : S_46) %>% na.omit() -> .dat2
# fix seed to reproduce same results every time
# set.seed(1701)
# NCT_output <- NetworkComparisonTest::NCT(.dat1, .dat2,
#                                          gamma = .50,
#                                          it  = 500,
#                                          progressbar = F,
#                                          test.edges = T,
#                                          edges = "all")

# saving NCT output to speed things a little bit up
# saveRDS(NCT_output, file = "NCT SOMS7.RData")
NCT_output <- readRDS("NCT SOMS7.RData")

To date, the NCT can only compare correlation matrices which are multivariate normal distributed (Van Borkulo et al. 2017). In order to use the NCT we frist compared our used spearman correlation matrices with one based on Pearson correlations.

# prepating data
# we need pearson and Spearman correlation matrix for both data sets
.Dataset1_s <- dataset1 %>% dplyr::select(S_01 : S_47) %>% corr.test(., method = "spearman") %>% .$r
.Dataset1_p <- dataset1 %>% dplyr::select(S_01 : S_47) %>% corr.test(., method = "pearson") %>% .$r

.Dataset2_s <- dataset2 %>% dplyr::select(S_01 : S_47) %>% corr.test(., method = "spearman") %>% .$r
.Dataset2_p <- dataset2 %>% dplyr::select(S_01 : S_47) %>% corr.test(., method = "pearson") %>% .$r

# now we calculate correlation coefficents
# output written directly in markdown so hiding it here
# cor(.Dataset1_s[lower.tri(.Dataset1_s)], .Dataset1_p[lower.tri(.Dataset1_p)], method = "spearman") #.94
# cor(.Dataset2_s[lower.tri(.Dataset2_s)], .Dataset2_p[lower.tri(.Dataset2_p)], method = "spearman") #.97

The correlation between the Spearman and Pearson correlation matrix for Dataset 1 was r = 0.94 (R\(^2\) = 0.88), while for Dataset 2 the correlation between both matrices was with r = 0.97 (R\(^2\) = 0.94). This correlations are high enough to justify NCT.

# # only to show the Code. R code was used inline with the text
# # mean global network strength of Dataset 1
# NCT_output$glstrinv.sep[1] # 2.245405
# # Mean global network strength of Dataset 2
# NCT_output$glstrinv.sep[2] # 19.64762
# # difference between mean global strength
# NCT_output$glstrinv.real   # 17.40222
# # significance test (p-value)
# NCT_output$glstrinv.pval      # 0.028

The results of the network comparison test indicated, that in Dataset 1 the mean global network strength was 2.25, while in Dataset 2 the global network strength was 19.65.

The network comparison tests cover three aspects: invariant network structure, invariant global strength and invariant edge strength. The test of invariant network structure was not significant, indicating that both networks had the same structure (M = 0.4, p = 0.11). The test of invariant global strength showed a significant difference (S = 17.4, p = 0.028) between the global strength of Dataset 1 (2.25) and Dataset 2 (19.65), indicating that the overall level of connectivity was different between both samples. A Bonferroni-corrected, post-hoc comparison of all individual edges reveals that only 58 out of 1035 (5,6%) possible edge weights differ significant between the two networks. Overall, most of the differences (95.46%) ranged between 0.00 < diffedge-weight < 0.10.
The next figure shows the results of the network comparision test.

plot_grid(ggplot() +
  geom_histogram(aes(x = NCT_output$glstrinv.perm), binwidth = 1, color = "black") +
  geom_rug(aes(x = NCT_output$glstrinv.perm)) +
  geom_point(aes(x = NCT_output$glstrinv.real, y = 0), fill = "red", size = 4, shape = 25) +
  labs(x = "Maximum difference in global network strength",
       y = "Total count") +
  theme_classic(),
  ggplot() +
  geom_histogram(aes(x = NCT_output$nwinv.perm), binwidth = 0.02, color = "black") +
  geom_rug(aes(x = NCT_output$nwinv.perm)) +
  geom_point(aes(x = NCT_output$nwinv.real, y = 0), fill = "red", size = 4, shape = 25) +
  labs(x = "Maximum difference in edge-weight",
       y = "Total count")+
  theme_classic())


Figure. Distribution of the differences in global network strength (left panel) and the differences in edge-weights (right panel). The red triangle marks the test difference between both samples.

Edge-weight differences

There are 1035 edges which can be tested whether the edge-weights of two nodes from one network differs from the same edge-weight in another network. After the test, a Bonferroni-Holm-Correction is used, so reduce the effect of multiple testing. Small sample sizes makes it difficult, to find (existing) significant differences. Because of this, not only the significant test should be used as guide, also the total difference as effect size may be considered.
The following table summarizes our findings. Most of the difference between the edge-weights was minimal (~70%). Around 95% of the tested edge-weights had a difference below .10. Only 0.2% of the differences showed medium sized differences.

.nct_diffs <- NCT_output$einv.real
.nct_diffs[upper.tri(.nct_diffs)] <- NA
diag(.nct_diffs) <- NA
data.frame(table(cut(.nct_diffs, 
                     breaks = c(0, .001,.05, .10, .15, .20, .25, .30, .35, .40), 
                     include.lowest = T)),
           p = as.vector(prop.table(table(cut(.nct_diffs, 
                                              breaks = c(0, .001,.05, .10, .15, .20, .25, .30, .35, .40), 
                                              include.lowest = T)))*100)) %>% 
  rename(Korrelation = Var1,
         n = Freq,
         p = p) %>% 
  mutate(k = cumsum(100*n/sum(n))) %>% 
  kable(., "html",
        digits = 2,
        caption = "Tabelle: Frequency of differences between edge-weighs (5er steps)",
        col.names = c("Range of values", "n", "%", "Cum%")) %>% 
  kable_styling("striped", full_width = F)
Tabelle: Frequency of differences between edge-weighs (5er steps)
Range of values n % Cum%
[0,0.001] 727 70.24 70.24
(0.001,0.05] 178 17.20 87.44
(0.05,0.1] 82 7.92 95.36
(0.1,0.15] 24 2.32 97.68
(0.15,0.2] 12 1.16 98.84
(0.2,0.25] 6 0.58 99.42
(0.25,0.3] 4 0.39 99.81
(0.3,0.35] 1 0.10 99.90
(0.35,0.4] 1 0.10 100.00

After the Bonferroni-Holm-Correction, only 58 of the tested differences (5.6%) remain statistical significant.

The next table provide data with edge differences for all tested edges with a difference greater than zero.

# part 1 (labels)
.nct <- NCT_output$einv.real
.nct[upper.tri(.nct)] <- NA
.nct %>% as.data.frame() -> .nct
names(.nct) <- .labels$Englisch_Label[1:46]
.nct %>% 
  mutate(to = .labels$Englisch_Label[1:46]) %>% 
  gather(., from, val, -to,) %>%
  # remove self
  filter(!to == from) %>% 
  # remove NA
  # 1035 edges remains
  na.omit() -> .nct
  

# part 2 (only item no.)
.nct2 <- NCT_output$einv.real
.nct2[upper.tri(.nct2)] <- NA
.nct2 %>% as.data.frame() -> .nct2
names(.nct2) <- .labels$Item[1:46]
.nct2 %>% 
  mutate(to = .labels$Item[1:46]) %>% 
  gather(., from, val, -to) %>%
  # remove self
  filter(!to == from) %>% 
  # remove NA
  na.omit() %>% 
  # 1035 edges remains
  unite(., edge, from, to, sep = "-", remove = T) -> .nct2

# aggregate part1 and part2 and remove double colum
.nct_edges <- cbind(.nct, .nct2)
.nct_edges <- .nct_edges[,-5]

.nct_edges %>% 
  dplyr::select(edge, from, to, everything()) %>%
  filter(val > 0) %>% 
  .[order(.$val, decreasing = T),] %>% 
  datatable(., 
            caption = "Table: Differences in edge-weight between both networks in descending order",
            rownames = F,
            colnames = c("Edge","Label Node 1", "Label Node 2", "Difference in edge-weight"),
            extensions = "Buttons",
            options = list(dom = "Bfrtip", pageLength = 10,
                             buttons = c("csv", "pdf"))) %>% 
  formatRound("val", 3)

Network stability

Stability of centrality measures

plot_grid(plot(boot2_ds1, statistics = "all") + ggtitle("Dataset 1"), 
          plot(boot2_ds2, statistics = "all") + ggtitle("Dataset 2"), nrow = 2)


Figure. Visualization of the correlation between centrality measures of the original sample (y-axis) and the case-dropping bootstrapped sample (x-axis). The percentage on the x-axis indicates the amount of cases which remain in the bootstrapped sample.

CS-Coefficient

The coefficient of stability (CS-Coefficient) was calculated for both networks. The output is shown below. A value of CS(r = .70) < .25 indicates no stability of centrality index. Stability can be argued if the values are CS(r = .70) > .50.
Summary Most of the CS-Coefficient were not high enough to meet the threshold.

Dataset 1
The CS coefficient was zero for betweenness and closeness. For node strength the coefficient was under .25, for the edges CS(r = .70) = .47. All values for the network in Dataset 1 indicate very low stability.

corStability(boot2_ds1, cor = 0.7)
=== Correlation Stability Analysis === 

Sampling levels tested:
   nPerson Drop%   n
1       64  74.8  31
2       67  73.6  32
3       71  72.0  46
4       74  70.9  45
5       78  69.3  45
6       82  67.7  46
7       85  66.5  55
8       89  65.0  53
9       93  63.4  65
10      96  62.2  55
11     100  60.6  63
12     103  59.4  78
13     107  57.9  68
14     111  56.3  71
15     114  55.1  55
16     118  53.5  79
17     122  52.0  83
18     125  50.8  69
19     129  49.2  80
20     132  48.0  94
21     136  46.5  85
22     140  44.9  89
23     143  43.7 111
24     147  42.1  88
25     151  40.6 111
26     154  39.4 108
27     158  37.8 110
28     161  36.6 121
29     165  35.0  96
30     169  33.5 102
31     172  32.3 113
32     176  30.7 108
33     180  29.1 121
34     183  28.0 118
35     187  26.4 130
36     190  25.2 126
37     194  23.6 128
38     198  22.0 135
39     201  20.9 126
40     205  19.3 136
41     209  17.7 158
42     212  16.5 134
43     216  15.0 174
44     220  13.4 150
45     223  12.2 147
46     227  10.6 159
47     230   9.4 153
48     234   7.9 154
49     238   6.3 147
50     241   5.1 149

Maximum drop proportions to retain correlation of 0.7 in at least 95% of the samples:

betweenness: 0 
  - For more accuracy, run bootnet(..., caseMin = 0, caseMax = 0.051) 

closeness: 0 
  - For more accuracy, run bootnet(..., caseMin = 0, caseMax = 0.051) 

edge: 0.465 
  - For more accuracy, run bootnet(..., caseMin = 0.449, caseMax = 0.48) 

strength: 0.22 
  - For more accuracy, run bootnet(..., caseMin = 0.209, caseMax = 0.236) 

Accuracy can also be increased by increasing both 'nBoots' and 'caseN'.



Dataset 2 The CS-coefficient for betweenness and closeness are both > .25. But both are way to low to be interpretated as stable. Node strength was under .50. Only the CS-coefficient for the edges was above .50.

corStability(boot2_ds2, cor = 0.7)
=== Correlation Stability Analysis === 

Sampling levels tested:
   nPerson Drop%   n
1      144  74.9  85
2      152  73.5  97
3      160  72.1  88
4      168  70.7 108
5      176  69.3  89
6      184  67.9 102
7      193  66.4  92
8      201  65.0 101
9      209  63.6 103
10     217  62.2 123
11     225  60.8 117
12     234  59.2  99
13     242  57.8 113
14     250  56.4  79
15     258  55.1 103
16     266  53.7 101
17     275  52.1 110
18     283  50.7  92
19     291  49.3 100
20     299  47.9  89
21     308  46.3 100
22     316  44.9 100
23     324  43.6 101
24     332  42.2 101
25     340  40.8 100
26     349  39.2  95
27     357  37.8 101
28     365  36.4  78
29     373  35.0  93
30     381  33.6 105
31     390  32.1 100
32     398  30.7  94
33     406  29.3 100
34     414  27.9  89
35     422  26.5  93
36     430  25.1 101
37     439  23.5 117
38     447  22.1  87
39     455  20.7 103
40     463  19.3 117
41     472  17.8  99
42     480  16.4 120
43     488  15.0 115
44     496  13.6 105
45     504  12.2 113
46     512  10.8  89
47     521   9.2 108
48     529   7.8  90
49     537   6.4  97
50     545   5.1  98

Maximum drop proportions to retain correlation of 0.7 in at least 95% of the samples:

betweenness: 0.279 
  - For more accuracy, run bootnet(..., caseMin = 0.265, caseMax = 0.293) 

closeness: 0.321 
  - For more accuracy, run bootnet(..., caseMin = 0.307, caseMax = 0.336) 

edge: 0.578 
  - For more accuracy, run bootnet(..., caseMin = 0.564, caseMax = 0.592) 

strength: 0.436 
  - For more accuracy, run bootnet(..., caseMin = 0.422, caseMax = 0.449) 

Accuracy can also be increased by increasing both 'nBoots' and 'caseN'.

Node Strength Differences

Significant differences in the node strength are displayed graphically. Black areas mark significant differences between two nodes, grey areas mark not significant differences. In the diagonal, the estimated mean node strength is displayed across all bootstrap samples, starting with the weakest node in the lower left to the strongest node in the upper right.
This test did not correct for multiple testing.

Summary
There were less significant differences between two nodes in Dataset 1 compared to Dataset 2.

Dataset 1

plot(boot1_ds1, "strength", order = "sample", labels = T, plot = "difference") + ggtitle("Dataset 1")
Expected significance level given number of bootstrap samples is approximately: 0.05

Figure. Significant differences in node strength for Dataset 1.

Dataset 2

plot(boot1_ds2, "strength", order = "sample", labels = T, plot = "difference") + ggtitle("Dataset 2")
Expected significance level given number of bootstrap samples is approximately: 0.05

Figure. Significant differences in node strength for Dataset 2.

Edge-weight accuracy

The following figures represent the edges (y-axis) between two nodes and (on x-axis) their weighting in the sense of a partial correlation. The edges are sorted by the height of the partial correlation. The red dots represent the edge weights of the original network. The grey areas represent the 95% confidence interval around this weight. The estimate of the confidence intervals is directly depending on the sample size. The black dots represent the mean values across all bootstraps.

Summary
Most of the confidence intervals overlap, which means that the edges between the nodes do not differ significantly in their strength.

Dataset 1

plot(boot1_ds1, labels = F, order = "sample") + ggtitle("Dataset 1")

Figure. Edges (y-axis) between two nodes and (on x-axis) their weighting in the sense of a partial correlation.

Dataset 2

plot(boot1_ds2, labels = F, order = "sample") + ggtitle("Dataset 2")

Figure. Edges (y-axis) between two nodes and (on x-axis) their weighting in the sense of a partial correlation.

Edge-weight differences

Because of many edges, the calculation and plotting of the edge-weight differences was not readable. The code is shown below. Running the following lines, takes up a lot of working memory.

Dataset 1

# you will need more than 16GB RAM for this to render...
# plot(boot1_ds1, "edge", plot = "difference", onlyNonZero = TRUE, order = "sample", labels = F)

Dataset 2

# u will need more than 40GB RAM for this to render...
# plot(boot1_ds2, "edge", plot = "difference", onlyNonZero = TRUE, order = "sample", labels = F)

Cross sample network

We aggregated both data sets into one dataset and determined a cross sample network. Again, stability, the number of clusters and the most important nodes and edges were determined. The estimation of this cross-sample network is based on the assumption that both samples are comparable. In addition, such a network gives an idea of the average relationship between symptoms across all 828 patients. One strength is, that there were significantly more people to estimate the network parameters.

# aggregate data
both <- rbind(dataset1, dataset2)

# correlation matrix for cross-sample
.both_cor <- cor(both, use = "pairwise", method = "spearman")
# cross network estimate
net_both <- estimateNetwork(both,
                            default = "EBICglasso",
                            tuning = 0.50,
                            lambda.min.ratio = 0.001,
                            corMethod = "cor", 
                            corArgs = list(method = "spearman"))
Estimating Network. Using package::function:
  - qgraph::EBICglasso for EBIC model selection
    - using glasso::glasso
# bootstrapping...
# set.seed(1701)
# boot1_both <- bootnet(net_both, nBoots = 5000, caseN = 50, nCores = 6)
# # Total time: 9:12 + 5:01 minutes

# set.seed(1701)
# boot2_both <-bootnet(net_both, nBoots = 5000, caseN = 50, nCores = 6, type = "case", statistics = c("edge", "strength", "closeness", "betweenness"))
# # Total time: 07:38 + 5:02 minutes

# # save it
# saveRDS(boot1_both, file = "Bootstrap_1 AllDatasets.RData")
# saveRDS(boot2_both, file = "Bootstrap_2 AllDatasets.RData")

# load bootstrap data
boot1_both <- readRDS("Bootstrap_1 AllDatasets.RData")
boot2_both <- readRDS("Bootstrap_2 AllDatasets.RData")
.net3 <- qgraph(.both_cor,
                 graph = "glasso",
                 sampleSize = 828,
                 label.cex = 2,
                 tuning = .50,
                 lambda.min.ratio = 0.001,
                 layout = "spring")
# as igraph
.net_both <- as.igraph(.net3)
# walktrap
.cluster_walk_both <- cluster_walktrap(.net_both, weights = E(.net_both)$weight)

.group_walk_both <- list("Gastrointestinal symptoms" = c(2, 7, 10:17, 20:23),
                         "Neurological symptoms" = c(1, 30,34,35,37,39,40:47),
                         "Cardiovascular symptoms" = c(6,18,19,24:29,31,32,36),
                         "Musculoskeletal problems" = c(3,4,5),
                         "Urogenital symptoms" = c(8,9,33,38))

Network with Cluster

qgraph(.both_cor, 
       graph = "glasso",
       groups = .group_walk_both,
       color = c("blue", "red", "yellow", "green", "orange"),
       vsize = 7,
       sampleSize = 828,
       label.cex = 1,
       tuning = 0.50,
       lambda.min.ratio = 0.001,
       layout = "spring",
       legend = F,
       mar = c(5,2,2,2))

# plot legend
legend(x = -1., y = -1.15, 
       legend = c("Neurological symptoms", 
                  "Gastrointestinal symptoms",
                  "Musculoskeletal problems",
                  "Cardiovascular symptoms",
                  "Urogenital symptoms"),
       ncol = 3,
       bty = "n",
       pt.cex = 2,
       pch = 21,
       pt.bg = c("red", "blue", "green", "yellow", "orange"),
       col = "black")


Figure. Network structure for 47 symptoms from the SOMS-7T based on an aggregated cross-sample dataset (N = 828). Items with the same color belong to the same community.

Network Inference

.z <- list(c("Neurological symptoms",1, 30,34,35,37,39,40:47),
           c("Gastrointestinal symptoms", 2, 7, 10:17, 20:23),
           c("Musculoskeletal problems", 3:5),
           c("Cardiovascular symptoms", 6,18,19,24:29,31,32,36),
           c("Urogenital symptoms", 8,9,33,38)) %>% 
  lapply(., function(x){
  as.data.frame(x) %>% 
    mutate(item = paste("S_", sep = "", x), c = x[[1]])
  }) %>% 
  lapply(., function(x){
    x[-1,]})
# merge list
.z <- plyr::ldply(.z, rbind)
# add cluster-id to centrality statistics
.z <- merge(.z, data.frame(item2 = names(both),
                           Degree = as.vector(centrality(net_both)$InDegree),
                           Betweenness = as.vector(centrality(net_both)$Betweenness),
                           Closeness = as.vector(centrality(net_both)$Closeness)) %>% 
  mutate(item = paste("S_", sep = "", row.names(.))), by = "item")

.z %>% 
  dplyr::select(item = item2, c, Degree : Closeness) %>% 
  .[order(.$item),] -> .z

.z$label <- .labels[1:47, "Englisch_Label"]
.z %>% dplyr::select(item, label, everything()) -> .z



Summary

.z %>% 
  dplyr::select("Node strength" = Degree, Betweenness, Closeness) %>% 
  psych::describe() %>% t() %>% .[-c(1:2),] %>% 
  kable(., "html",
        caption = "Table: Central tendencies for the main centrality measures of the cross-sample network") %>% 
  kable_styling("striped", full_width = F)
Table: Central tendencies for the main centrality measures of the cross-sample network
Node strength Betweenness Closeness
mean 0.8791340 74.553191 0.0008931
sd 0.1754636 60.981397 0.0000780
median 0.8966017 60.000000 0.0008901
trimmed 0.8866111 66.820513 0.0008948
mad 0.2169545 41.512800 0.0000593
min 0.4087010 0.000000 0.0006773
max 1.1849385 286.000000 0.0010436
range 0.7762374 286.000000 0.0003662
skew -0.3891301 1.422432 -0.2703184
kurtosis -0.4948183 2.092059 0.5843139
se 0.0255940 8.895051 0.0000114



Key statistics for each node in the network

# table 
DT::datatable(.z, rownames = F, 
              filter= "top", 
              caption = "Table: Centrality measures for each item in the cross-sample network",
              colnames = c("Item", "Itemlabel", "Cluster", "Node strength", "Betweenness", "Closeness"),
              extensions = "Buttons",
              options = list(dom = "Bfrtip", pageLength = 7,
                             buttons = c("csv", "pdf"))) %>% 
  formatRound("Degree", 3) %>% 
  formatRound("Betweenness", 0) %>% 
  formatRound("Closeness", 5)



Centrality plot

.z %>% 
  dplyr::select(-c, -label) %>% 
  mutate_at(vars(Degree, Betweenness, Closeness), function(x) { scale(x, scale = T) }) %>% 
  gather(., centrality, measure, -item) %>% 
  mutate(centrality = recode_factor(centrality, 
                                    "Degree" = "Node strength",
                                    "Betweenness" = "Betweenness",
                                    "Closeness" = "Closeness")) %>% 
  ggplot(., aes(y = item, x = measure)) +
  facet_wrap(~ centrality, scales = "free_x") +
  geom_point() +
  geom_line(group = "1") +
  scale_x_continuous(limits = c(-3,3), breaks = seq(-3, 3, 1)) +
  labs(x = "z-value",
       y = "SOMS7 item",
       title = "Dataset 1") +
  theme_classic() + 
  theme(axis.text = element_text(size = 12, color = "black"),
        axis.title = element_text(size = 13, color = "black"),
        strip.text = element_text(size = 18, color = "black"))
Warning: attributes are not identical across measure variables;
they will be dropped
Warning: Removed 1 rows containing missing values (geom_point).

Figure. Standardized centrality indices (betweenness, closeness, node strength) for the cross-sample network (Dataset 1 and 2).
## Edge weights

merge(getWmat(net_both) %>% 
        as.data.frame() %>% 
        mutate(node1 = rownames(.)) %>% 
        gather(., node2, weight, -node1) %>% 
        .[order(.$weight, decreasing = T),] %>% 
        filter(row_number() %% 2 != 0) %>% 
        unite(., Edge, node1, node2, sep = " - "),
      summary(boot1_both) %>%        
        dplyr::select(type, node1, node2, starts_with("CI")) %>%
        unite(., Edge, node2, node1, sep = " - ") %>% 
        as.data.frame() %>% 
        dplyr::select(-type),
      by = "Edge") %>% 
  .[order(.$weight, decreasing = T),] %>% 
  datatable(., 
            rownames = F,
            caption = "Table: Edge-weights for each edge within the cross-sample network with bootstrap confidence intervals",
            colnames = c("Edge", "Weight", "lower 95% CI", "upper 95% CI"),
            extensions = "Buttons",
            options = list(dom = "Bfrtip", buttons = c("csv", "pdf"))) %>% 
    formatRound(c("weight", "CIlower", "CIupper"), 3)

Key nodes & edges

Nodes with highest centrality

cbind(
data.frame(head(sort(centrality(net_both)$OutDegree, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Betweenness
data.frame(head(sort(centrality(net_both)$Betweenness, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Closeness
data.frame(head(sort(centrality(net_both)$Closeness, decreasing = T))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item)
) -> .mostcentralnodes_both
# coding dummy names...
names(.mostcentralnodes_both) <- c("a", "b", "c")
.mostcentralnodes_both %>% 
  mutate(Rang = 1:nrow(.)) %>% 
  dplyr::select(Rang, everything()) %>% 
  # top 5 
  .[-6,] %>% 
  kable(., "html",
        caption = "Table: Top 5 of the most central nodes in the cross-sample Dataset",
        col.names = c("Rang", "Node strength", "Betweenness", "Closeness"),
        align = "c") %>% 
  kable_styling("striped", full_width = F) %>% 
  footnote(general = "The rang of the nodes indicates its position. A higher rang indicates, that this node is more central then other nodes.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Top 5 of the most central nodes in the cross-sample Dataset
Rang Node strength Betweenness Closeness
1 S_26 S_10 S_02
2 S_12 S_02 S_19
3 S_05 S_34 S_10
4 S_25 S_22 S_18
5 S_28 S_35 S_34
Note. The rang of the nodes indicates its position. A higher rang indicates, that this node is more central then other nodes.



Nodes with lowest centrality

cbind(
data.frame(head(sort(centrality(net_both)$OutDegree, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Betweenness
data.frame(head(sort(centrality(net_both)$Betweenness, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item),

# Closeness
data.frame(head(sort(centrality(net_both)$Closeness, decreasing = F))) %>% 
  mutate(Item = rownames(.)) %>% dplyr::select(Item)
) -> .mostcentralnodes_both
# coding dummy names
names(.mostcentralnodes_both) <- c("a", "b", "c")
.mostcentralnodes_both %>% 
  mutate(Rang = 1:nrow(.)) %>% 
  dplyr::select(Rang, everything()) %>% 
  # top 5 
  .[-6,] %>% 
  kable(., "html",
        caption = "Table: Top 5 of the less central nodes in the cross-sample Dataset",
        col.names = c("Rang", "Node strength", "Betweenness", "Closeness"),
        align = "c") %>% 
  kable_styling("striped", full_width = F) %>% 
  footnote(general = "The rang of the nodes indicates its position. A higher rang indicates, that this node is less central then other nodes.",
           footnote_as_chunk = T, general_title = "Note.")
Table: Top 5 of the less central nodes in the cross-sample Dataset
Rang Node strength Betweenness Closeness
1 S_43 S_16 S_43
2 S_17 S_31 S_31
3 S_13 S_42 S_21
4 S_21 S_43 S_16
5 S_31 S_21 S_45
Note. The rang of the nodes indicates its position. A higher rang indicates, that this node is less central then other nodes.



Strongest edge

merge(getWmat(net_both) %>% 
        as.data.frame() %>% 
        mutate(node1 = rownames(.)) %>% 
        gather(., node2, weight, -node1) %>% 
        .[order(.$weight, decreasing = T),] %>% head(10) %>% 
        # every second row is a dublicate -> removing those
        filter(row_number() %% 2 != 0) %>% 
        unite(., Edge, node1, node2, sep = " - "),
      summary(boot1_both) %>%
        dplyr::select(type, node1, node2, starts_with("CI")) %>%
        filter(node1 == "S_26" & node2 == "S_27" |
                 node1 == "S_04" & node2 == "S_05" |
                 node1 == "S_06" & node2 == "S_25" |
                 node1 == "S_20" & node2 == "S_23" |
                 node1 == "S_28" & node2 == "S_29") %>%
        as.data.frame() %>% 
        unite(., Edge, node2, node1, sep = " - ") %>% 
        dplyr::select(-type),
      by = "Edge") %>% 
  .[order(.$weight, decreasing = T),] %>% 
  kable(., "html",
        digits = 3,
        caption = "Table: Top 5 edges in the cross-sample dataset with weights and 95% confidence intervals",
        row.names = F,
        col.names = c("Edge", "Weight", "lower bound", "upper bound")) %>% 
  kable_styling("striped", full_width = F) %>% 
  add_header_above(., c(" " = 2, "95% CI" = 2)) %>% 
  footnote(general = "The confidence intervals were bootstrapped (B = 5000).",
           footnote_as_chunk = T, general_title = "Note.")
Table: Top 5 edges in the cross-sample dataset with weights and 95% confidence intervals
95% CI
Edge Weight lower bound upper bound
S_27 - S_26 0.498 0.441 0.555
S_05 - S_04 0.464 0.401 0.526
S_25 - S_06 0.381 0.320 0.442
S_23 - S_20 0.380 0.316 0.444
S_29 - S_28 0.368 0.299 0.438
Note. The confidence intervals were bootstrapped (B = 5000).

Stability & Accuracy

Stability Plot

plot(boot2_both, statistics = "all") + ggtitle("Cross-Sample Dataset")

Figure. Visualization of the correlation between centrality measures of the original sample (y-axis) and the case-dropping bootstrapped sample (x-axis). The percentage on the x-axis indicates the amount of cases which remain in the bootstrapped sample.

CS-Coefficient

corStability(boot2_both, cor = 0.7)
=== Correlation Stability Analysis === 

Sampling levels tested:
   nPerson Drop%   n
1      207  75.0  93
2      219  73.6  84
3      231  72.1 103
4      242  70.8 101
5      254  69.3  87
6      266  67.9 104
7      278  66.4 119
8      290  65.0  98
9      302  63.5 101
10     313  62.2  93
11     325  60.7 101
12     337  59.3  95
13     349  57.9  95
14     361  56.4 112
15     373  55.0  99
16     384  53.6 107
17     396  52.2 112
18     408  50.7 104
19     420  49.3  99
20     432  47.8 102
21     444  46.4 105
22     455  45.0 108
23     467  43.6 110
24     479  42.1  98
25     491  40.7  93
26     503  39.3 101
27     515  37.8  92
28     526  36.5 103
29     538  35.0 113
30     550  33.6 106
31     562  32.1  86
32     574  30.7 113
33     586  29.2  92
34     597  27.9 102
35     609  26.4  97
36     621  25.0  98
37     633  23.6 102
38     645  22.1  97
39     656  20.8  90
40     668  19.3  96
41     680  17.9 103
42     692  16.4  84
43     704  15.0  89
44     716  13.5 108
45     727  12.2 106
46     739  10.7 114
47     751   9.3  88
48     763   7.9  98
49     775   6.4 106
50     787   5.0  93

Maximum drop proportions to retain correlation of 0.7 in at least 95% of the samples:

betweenness: 0.393 
  - For more accuracy, run bootnet(..., caseMin = 0.378, caseMax = 0.407) 

closeness: 0.393 
  - For more accuracy, run bootnet(..., caseMin = 0.378, caseMax = 0.407) 

edge: 0.693 
  - For more accuracy, run bootnet(..., caseMin = 0.679, caseMax = 0.708) 

strength: 0.55 
  - For more accuracy, run bootnet(..., caseMin = 0.536, caseMax = 0.564) 

Accuracy can also be increased by increasing both 'nBoots' and 'caseN'.



Node-Strength Differences

plot(boot1_both, "strength", order = "sample", labels = T, plot = "difference") + ggtitle("Cross-Sample Dataset")
Expected significance level given number of bootstrap samples is approximately: 0.05

Figure. Significant differences in the node strength for Cross-sample Network. Black areas = significant differences between two nodes, grey = no significant differences. Diagonal = estimated mean node strength across all bootstrap samples, starting with the weakest node in the lower left to the strongest node in the upper right.

Edge-Weights Stability

plot(boot1_both, labels = F, order = "sample") + ggtitle("Cross-Sample Dataset")

Figure. Edges (y-axis) between two nodes and their weighting (x-axis) in the sense of a partial correlation. Red dots = Edge weights of the original network. Grey areas = 95% confidence interval around this weight.Black dots = mean values across all bootstraps.

End of Analysis

timestamp()
##------ Fri May 29 13:05:10 2020 ------##
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lsr_0.5                     corrgram_1.13              
 [3] DT_0.11                     labelled_2.2.1             
 [5] cowplot_1.0.0               NetworkComparisonTest_2.2.1
 [7] kableExtra_1.1.0            bootnet_1.2.4              
 [9] ggplot2_3.2.1               igraph_1.2.4.2             
[11] qgraph_1.6.4                psych_1.9.12.31            
[13] tidyr_1.0.0                 dplyr_0.8.3                
[15] knitr_1.26                  pacman_0.5.1               

loaded via a namespace (and not attached):
  [1] R.utils_2.9.2        tidyselect_0.2.5     lme4_1.1-21         
  [4] htmlwidgets_1.5.1    grid_3.6.1           TSP_1.1-7           
  [7] munsell_0.5.0        codetools_0.2-16     withr_2.1.2         
 [10] parcor_0.2-6         colorspace_1.4-1     NetworkToolbox_1.3.3
 [13] highr_0.8            rstudioapi_0.10      stats4_3.6.1        
 [16] labeling_0.3         huge_1.3.4           mnormt_1.5-5        
 [19] farver_2.0.1         vctrs_0.2.1          generics_0.0.2      
 [22] xfun_0.9             gclus_1.3.2          R6_2.4.1            
 [25] doParallel_1.0.15    smacof_2.0-0         seriation_1.2-8     
 [28] bitops_1.0-6         assertthat_0.2.1     promises_1.1.0      
 [31] scales_1.1.0         nnet_7.3-12          gtable_0.3.0        
 [34] weights_1.0          rlang_0.4.2          zeallot_0.1.0       
 [37] cmprsk_2.2-9         splines_3.6.1        lazyeval_0.2.2      
 [40] acepack_1.4.1        wordcloud_2.6        broom_0.5.3         
 [43] checkmate_1.9.4      yaml_2.2.0           reshape2_1.4.3      
 [46] abind_1.4-5          longitudinal_1.1.12  crosstalk_1.0.0     
 [49] d3Network_0.5.2.1    ppls_1.6-1.1         backports_1.1.5     
 [52] httpuv_1.5.2         Hmisc_4.3-0          tools_3.6.1         
 [55] lavaan_0.6-5         ggm_2.3              ellipsis_0.3.0      
 [58] gplots_3.0.1.2       RColorBrewer_1.1-2   polynom_1.4-0       
 [61] Rcpp_1.0.3           plyr_1.8.5           base64enc_0.1-3     
 [64] purrr_0.3.3          rpart_4.1-15         pbapply_1.4-2       
 [67] viridis_0.5.1        zoo_1.8-6            haven_2.2.0         
 [70] cluster_2.1.0        survey_3.36          magrittr_1.5        
 [73] data.table_1.12.8    openxlsx_4.1.4       mvtnorm_1.0-12      
 [76] matrixcalc_1.0-3     whisker_0.4          mitml_0.3-7         
 [79] xtable_1.8-4         mime_0.8             hms_0.5.3           
 [82] evaluate_0.14        rio_0.5.16           jpeg_0.1-8.1        
 [85] etm_1.0.5            readxl_1.3.1         gridExtra_2.3       
 [88] shape_1.4.4          compiler_3.6.1       ellipse_0.4.1       
 [91] tibble_2.1.3         mice_3.7.0           KernSmooth_2.23-15  
 [94] crayon_1.3.4         minqa_1.2.4          R.oo_1.23.0         
 [97] htmltools_0.4.0      later_1.0.0          mgcv_1.8-28         
[100] corpcor_1.6.9        Formula_1.2-3        DBI_1.1.0           
[103] relaimpo_2.2-3       mgm_1.2-7            MASS_7.3-51.4       
[106] boot_1.3-22          IsingSampler_0.2     Matrix_1.2-17       
[109] IsingFit_0.3.1       car_3.0-6            readr_1.3.1         
[112] heplots_1.3-5        mitools_2.4          R.methodsS3_1.7.1   
[115] gdata_2.18.0         parallel_3.6.1       pan_1.6             
[118] BDgraph_2.62         forcats_0.4.0        pkgconfig_2.0.3     
[121] registry_0.5-1       numDeriv_2016.8-1.1  foreign_0.8-71      
[124] xml2_1.2.2           foreach_1.4.7        pbivnorm_0.6.0      
[127] webshot_0.5.2        rvest_0.3.5          stringr_1.4.0       
[130] digest_0.6.23        Epi_2.40             rmarkdown_2.0       
[133] cellranger_1.1.0     htmlTable_1.13.3     dendextend_1.13.2   
[136] curl_4.3             shiny_1.4.0          gtools_3.8.1        
[139] jomo_2.6-10          rjson_0.2.20         nloptr_1.2.1        
[142] jsonlite_1.6         lifecycle_0.1.0      nlme_3.1-140        
[145] glasso_1.11          carData_3.0-3        viridisLite_0.3.0   
[148] pillar_1.4.3         lattice_0.20-38      fastmap_1.0.1       
[151] GeneNet_1.2.13       httr_1.4.1           plotrix_3.7-7       
[154] survival_2.44-1.1    glue_1.3.1           networktools_1.2.1  
[157] zip_2.0.4            fdrtool_1.2.15       png_0.1-7           
[160] iterators_1.0.12     candisc_0.8-0        glmnet_3.0-2        
[163] stringi_1.4.3        nnls_1.4             latticeExtra_0.6-29 
[166] caTools_1.17.1.3     eigenmodel_1.11     

Code Matthias Sehlbrede
Text Katharina Senger & Matthias Sehlbrede

References

Csárdi, Gábor, and Tamás Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems 1695. http://www.interjournal.org/manuscript_abstract.php?361100992.

Epskamp, Sacha, Denny Borsboom, and Eiko I. Fried. 2018. “Estimating Psychological Networks and Their Accuracy: A Tutorial Paper.” Behavior Research Methods 50 (1): 195–212. https://doi.org/10.3758/s13428-017-0862-1.

Epskamp, Sacha, Angélique O. J. Cramer, Lourens J. Waldorp, Verena d. Schmittmann, and Denny Borsboom. 2012. “Qgraph : Network Visualizations of Relationships in Psychometric Data.” Journal of Statistical Software 48 (4). https://doi.org/10.18637/jss.v048.i04.

Epskamp, Sacha, and Eiko I. Fried. 2018. “A Tutorial on Regularized Partial Correlation Networks.” Psychological Methods 23 (4): 617–34. https://doi.org/10.1037/met0000167.

Foygel, Rina, and Mathias Drton. 2010. “Extended Bayesian Information Criteria for Gaussian Graphical Models.” http://arxiv.org/pdf/1011.6640v1.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. “Sparse Inverse Covariance Estimation with the Graphical Lasso.” Biostatistics (Oxford, England) 9 (3): 432–41. https://doi.org/10.1093/biostatistics/kxm045.

Larmarange, Joseph. 2019. “Labelled: Manipulating Labelled Data.” https://CRAN.R-project.org/package=labelled.

McKnight, Patrick E. 2007. Missing Data: A Gentle Introduction. New York: ebrary, Inc; Guilford Press.

Navarro, Daniel J. 2015. Learning Statistics with R: A Tutorial for Psychology Students and Other Beginners. Adelaide, Australien: University of Adelaide; University of Adelaide.

Newman, Mark E. J. 2010. Networks: An Introduction. Oxford: Oxford University Press.

Pons, Pascal, and Matthieu Latapy. 2006. “Computing Communities in Large Networksusing Random Walks.” Journal of Graph Algorithms and Applications 10 (2): 191–218.

Revelle, William. 2018. Psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. http://CRAN.R-project.org/package=psych.

Rhemtulla, Mijke, Eiko I. Fried, Steven H. Aggen, Francis Tuerlinckx, Kenneth S. Kendler, and Denny Borsboom. 2016. “Network Analysis of Substance Abuse and Dependence Symptoms.” Drug and Alcohol Dependence 161: 230–37. https://doi.org/10.1016/j.drugalcdep.2016.02.005.

Rief, Winfried, W Hiller, and J Heuser. 1997. SOMS - Screening Für Somatoforme Störungen. Bern: Huber.

Tibshirani, Robert. 1994. “Regression Shrinkage and Selection via the Lasso.”

Van Borkulo, Claudia, Lynn Boschloo, Jolanda J. Kossakowski, Pia Tio, Robert Schoevers, Denny Borsboom, and Lourens Waldorp. 2017. “Comparing Network Structures on Three Aspects: A Permutation Test.” Unpublished. https://doi.org/10.13140/RG.2.2.29455.38569.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. New York: Springer.

Wickham, Hadley, Romain Francois, Lionel Henry, and Kirill Müller. 2017. “Dplyr: A Grammar of Data Manipulation.” https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Lionel Henry. 2017. “Tidyr: Easily Tidy Data with ’Spread()’ and ’Gather()’ Functions.” https://CRAN.R-project.org/package=tidyr.

Wilke, Claus O. 2019. “Cowplot: Streamlined Plot Theme and Plot Annotations.” https://CRAN.R-project.org/package=cowplot.

Wright, Kevin. 2018. “Corrgram: Plot a Correlogram.” https://CRAN.R-project.org/package=corrgram.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2019. “DT: A Wrapper of the Javascript Library ’Datatables’.” https://CRAN.R-project.org/package=DT.

Zhu, Hao. 2019. “KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax.” https://CRAN.R-project.org/package=kableExtra.