본문 바로가기
Statistics/Bayesian

GLMM 이해하기 - 2 (+EDA 꿀팁)

by bents 2021. 2. 22.

LMMs model the variance hierarchically,

estimating the processes that generate among-group variation in means, as well as variation within groups.

 

Fixed effects, Random effects로 어떤 변수를 선택해야 할까?

Absolute rules for how to classify something as a fixedor random effect generally are not useful because that decision can change depending the goals of the analysis
*각 비교집단의 모평균에 차이점이 있는지 확인해보자

- Fixed effects : assumes the five ‘group’ means(mu) are all independent of one another, and share a common residual variance (등분산)

- Random intercept effects : assumes that the five measured group means(mu) are only a subset of the realised possibilities drawn from a ‘global’ set of population means

 

내가 알아내고자 하는 관계에 해당하는 변수들이 fixed effect와 response에 해당함.

따로 계층변수를 가공해서 설명변수처럼 사용할 수 도 있다.

 

GLMM'S properties

1. Controlling for non-independence among data points

hypothesis : the probability of successful breeding for an animal is a function of its body mass.

If we had measured animals from multiple sampling sites, we might wish to fit ‘sampling site’ as a random intercept, and estimate a common slope (change in breeding success) for body mass across all sampling sites by fitting it as a fixed effect:

glmer(successful.breed ∼ body.mass + (1|sample.site), family=binomial)

hypothesis : the strength of the effect (slope) of body mass on breeding success varies depending on the sampling sites

glmer(successful.breed ∼ body.mass + (body.mass|sample.site), family=binomial)

2. Improving the accuracy of parameter estimation

Random Effect는 각 그룹별 평균이 전체 평균의 분포에서 가져온 값이라고 가정하기 때문에  그룹별 평균은 전체 모평균에 회귀함. 이를 Shirinkage라고 하며 표준오차를 줄여주는 효과가 있음. 각 그룹별 표본수가 없을 때에 유용하며, CLT에 의해 정규분포가정도 사용할만함.

3. Estimating variance components

Hypothesis : whether different females tend to produce consistently different clutch masses

‘FemaleID’ : a random intercept ,

we estimate the among-female variance in our trait of interest.& estimate the residual variance term.

--> ‘intra-class correlation coefficient’ that measures individual repeatability in our trait

Variance component analysis is a powerful tool for partitioning variation in a focal trait among interesting groups

Model <- lmer(ClutchMass ∼ 1 + (1|FemaleID)

4. Making predictions for unmeasured groups

분산분석도 되고, 관측치의 평균에서 그룹의 평균을 예측할 수 도 있음.

 

모델링시, 주의사항

  1. 그룹의 종류가 5개 이상이어야 한다. they are quite ‘data hungry’; requiring at least five ‘levels’ (groups) for a random intercept term to achieve robust estimates of variance (Gelman & Hill, 2007; Harrison, 2015). With <5 levels, the mixed model may not be able to estimate the among-population variance accurately. In this case, the variance estimate will either collapse to zero, making the model equivalent to an ordinary GLM (Gelman & Hill, 2007, p. 275) or be non-zero but incorrect if the small number of groups that were sampled are not representative of true distribution of means (Harrison, 2015).

  2. 그룹별 표본크기의 불균형이 심각하면 안된다.  models can be unstable if sample sizes across groups are highly unbalanced i.e. if some groups contain very few data. These issues are especially relevant to random slope models (Grueber et al., 2011).

  3. 그룹별 분산의 유효성을 결정해야 한다. an important issue is the difficulty in deciding the ‘significance’ or ‘importance’ of variance among groups. The variance of a random effect is inevitably at least zero, but how big does it need to be considered of interest? Fitting a factor as a fixed effect provides a statement of the significance of differences (variation) among groups relatively easily. Testing differences among levels of a random effect is made much more difficult for frequentist analyses, though not so in a Bayesian framework (Kéry, 2010, see ‘Testing Significance of Random Effects’ section).

  4. 아래의 경우를 따져서 잘못된 모수추정을 막자. an issue that is not often addressed is that of mis-specification of random effects. GLMMs are powerful tools, but incorrectly parameterising the random effects in the model could yield model estimates that are as unreliable as ignoring the need for random effects altogether.

Examples include: 

(i) failure to recognise non-independence caused by nested structures in the data
e.g. multiple clutch measures from a single bird;
(ii) failing to specify random 
slopes to prevent constraining slopes of predictors to be identical across clusters 
(iii) testing the significance of fixed effects at the wrong ‘level’ of hierarchical models that ultimately leads to pseudoreplication and inflated Type I error rates. Traditionally users of LMMs might have used 
F-tests of significance. F-tests are ill-advised for unbalanced experimental designs and irrelevant for non-Gaussian error structures, but they at least provide a check of model hierarchy using residual degrees of freedom for fixed effects. The now-standard use of likelihood ratio tests (LRT) of significance in LMMs means that users and readers have little opportunity to check the position of significance tests in the hierarchy of likelihoods.

모델 설계하기

1. 모델 가정 충족시키기

additivity of the linear predictors,

independence of errors,

equal variance of errors (homoscedasticity) 

normality of errors 

- 모델 가정을 충족시키지 못하면, 모델의 성능을 담보할 수 없다.

 

1) (이분산성) 종속변수를 변환시킴.

- mean-variance stabilising transformations 을 통해 그룹별 이분산성을 완화시킬 수 있음.

- varIdent (R package)

 

2) (비정규성) 

: 포아송분포(개수), 이항분포(조건부 개수), 베르누이분포(반응여부)를 따르는 데이터..

- 정규근사 시키는 변형시킴 ( 박스콕스 ...)

- GLMM과 링크함수 사용하기

*링크함수가 일반 데이터 변형과 무엇이 다른가?
the mean of a log-transformed response (using a data transformation)
is not identical to the logarithm of a fitted mean (using a link function)
.

*그러나 변형은 어디까지나 상황마다 다를 수 있으니 둘다 써보고 "선택"해야 함.

*데이터 변형은 변수간의 관계를 변화시킬수 있음.

*GLMM은 이런 조정(변형)하는 부분을 제거시킨 모델임.(가정없어도 모델링가능)

 

2. 랜덤효과 변수 간의 관계에 따라 모델링이 달라진다.

- Random intercept : 클러스터 변수끼리 계층관계가 존재하면 nested / 서로 독립적이면 cross

# nested : Woodland & Female ID
# cross : Year
Clutch Mass ∼ Foraging Rate + (1|Woodland/Female ID)+ (1|Year)

3. 랜덤 slope는 데이터가 충분하다면 무조건 넣는게 좋음. 

random slope models require large numbers of data to estimate variances and covariances accurately

 

4. 랜덤효과 외에 상호작용효과, 고정효과 들도 모델링시, 고려대상.

모든 효과를 넣은 모델을  Maximal model, global model 이라고 함.

과연 predictor가 유의미한지는 연구자의 지식와 데이터 이해도에 달려 있다.

- 가장 최악 : 가진 데이터에 비해 과도하게 많은 모수를 가진 모델

- 변수선택은 stepwise ,  LRT, p-value로 함.

-  예측변수의 공산성, 표준화, 중심화 하고, 모델 성능은 AIC로 하면 됨.

Stability of variance components

+ testing significance of random effects

 

5. 시뮬레이션으로 모델 적합성 평가하기 (+top AIC)

 

출처 ] peerj.com/articles/4794/#


표준화

# 정규화/표준화/스케일링
from sklearn import preprocessing

data["age_scaled"] = preprocessing.scale(data.age.values)

단순회귀 시각화

from sklearn.linear_model import LinearRegression

# construct our linear regression model
model = LinearRegression(fit_intercept=True)
x = data.age_scaled
y = data.bounce_time

# fit our model to the data
model.fit(x[:, np.newaxis], y)

# and let's plot what this relationship looks like 
xfit = np.linspace(-3, 3, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);

회귀 잔차 시각화

# 방법1
import yellowbrick
from yellowbrick.regressor import ResidualsPlot

# Instantiate the linear model and visualizer
visualizer = ResidualsPlot(model = model)

visualizer.fit(x[:, np.newaxis], y)  # Fit the training data to the model
visualizer.poof()                    # Draw/show/poof the data


#방법2
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

ax = sns.residplot(x = "age_scaled", y= "bounce_time", data = data, lowess = True)
ax.set(ylabel='Observed - Prediction')
plt.show()

산점도 (카테고리변수 - 수치형 변수)

sns.catplot(x="county", y="bounce_time", data=data, kind = "swarm")

산점도 (카테고리별 산점도)

sns.catplot(x="county", y="bounce_time", data=data, kind = "swarm")

 

카테고리별 회귀선 

grid = sns.lmplot(x = "age_scaled", y = "bounce_time", col = "county", 
sharex=False, col_wrap = 4, data = data, height=4)

#residplot! 꼭 같이 쓰기

Source]

www.kaggle.com/ojwatson/mixed-models