Pdf gaussian process




















Smriti Kumar Sinha. A short summary of this paper. Download Download PDF. Translate PDF. Trust and reputation modeling and management based on social approach is proved to provide the necessary safeguards against malicious interacting partners. In the heart of any trust modeling and management mechanism, predicting trust values for making a decision for interaction at future time is a key part.

Trust prediction is a method of predicting potentially unknown trust of a target partner using its previously observed behaviour and also the recommendations received from other peers.

In this paper, a trust prediction model based on detection of behavior pattern that may prevail at future time point using a Markov model is proposed. The trust value is obtained from a Gaussian process using the detected pattern. Chang, E. Farookh K. Hussain, Elizabeth Chang and Hussain, O.

Hussain, F. Alonso, G. Maamar, Z. Pathak, J. Hassan, M. Kalman, R. ASME J. Basic Eng. Holt, C. Drossu, R. Rapid design of neural networks for time series prediction. Our Bayesian methods require minimal tuning and provide results with uncertainty estimates, so they can be easily used by a healthcare provider for assisting with patient care. EHR data have been analyzed using Bayesian nonparametric methods, but the literature on using ICD codes as covariates is sparsely populated.

Henao et al. They summarize the EHR data in terms of counts of different medications usage, laboratory tests, and diagnoses specified using ICD codes. If the EHR data only contains information about the presence or absence of disease, then Bayesian nonparametric extensions of low-rank matrix factorization perform better in discovering latent disease groups [11]. Recently, deep learning has been widely used for automated regression and classification using EHR data, but the major focus remains on mining clinical notes and not on using the information encoded in ICD codes or on providing uncertainty estimates [15].

A major limitation of all these approaches is that they ignore the additional structure provided by the membership of diagnoses in different chronic conditions. A widespread practice is to define the similarity of two patients to be proportional to the number of common ICD codes in their diagnoses or Jaccard index. Our similarity measure is superior to this practice in that its value could be positive for a pair of non-identical ICD codes, and its magnitude depends on the degree of overlap between the pair.

In fact, our similarity measure belongs to the class of string kernels and extends a restricted version of the boundrange kernel, which has been widely used in text mining and information retrieval [9, 18]. The similarity between a pair of ICD codes is defined as the total number of common sub-strings between them, where a sub-string always begins at the first character of an ICD code. This similarity measure is extended to a pair of subsets as a weighted average of similarities of all ICD code pairs formed using the two subsets.

Our second major contribution is to use the similarity between patients for nonparametric re- gression and classification using GP priors. This is done in two stages.

First, using the similarity measure, we define the equivalents of polynomial, exponential, and squared exponential kernels on subsets of ICD codes, which represent chronic conditions. Second, we use these kernels as covari- ance functions for defining GP priors indexed by subsets of ICD codes.

These GP priors are used for fitting Bayesian nonparametric regression and classification models that are tuned for EHR data analysi [16].

We develop MCMC algorithms for automated fitting of these models that provide the- oretically guaranteed estimation of posterior uncertainty [4]. Minimal tuning requirements makes our method extremely attractive for routine applications in analyzing EHR data and for assisting healthcare providers in predicting risks related to chronic conditions.

We verify our claims through empirical studies and show that the proposed method provides a more detailed quantification of dependence of primary cancer sites on chronic conditions. Developing operational measures of multimorbidity is an active research area due to its importance in improving patient care. This serves as a motivation for developing a classifier that accounts for the grouping of diagnoses into chronic conditions.

The cohort included 18 years or older patients who were diagnosed with malignant solid neoplasm of any type or site. The data included diagnoses of the patients encoded as ICD codes and the race and marital status of every patient. We filtered patients who had at least one of the 58 non-cancerous chronic conditions, resulting in a sample size of Unfortunately, the Bayesian toolkit provides limited options for answering such biomedical questions.

The main goal is to predict marginal associations between chronic conditions and cancer pri- mary sites and a minor goal is to predict the primary cancer site using the ICD codes, which are structured strings; for example, J encodes allergy and D50, D51 denote different kinds of anaemia. Clearly, using dummy variables for representing these ICD codes is inappropriate because D50, D51 denote the same disease but are assigned different dummy variables.

Motivated from the text mining literature, a useful similarity measure for strings is defined by counting the number of matching substrings of different lengths. Accounting for the hierarchical structure of ICD codes, if we modify this definition by enforcing the substrings to always start at the beginning, then the similarities of the three pairs J, D50 , J, D51 , and D50, D51 are 0, 0, 2, respectively, which are more realistic. In the next section, we develop this idea further to account for the extra structure provided by the 58 chronic conditions while down-weighting the contributions of commonly occurring diagnoses.

The kernel function on two ICD codes is defined as a Euclidean dot product between their feature maps. The kernel in 2 belongs to the class of string kernels that are widely used in text mining and information retrieval [18]. Two popular string kernels are spectrum and boundrange kernels [9].

The k-spectrum kernel defines the similarity to be the number of matching substrings between two strings of length k exactly. The k-boundrange kernel instead defines the similarity to be the number of matching substrings between two strings of length less than or equal to k. The substrings can be non-contiguous in both kernels and the substring matches can be weighted by a factor that decays exponentially with the substring length.

In the motivating EHR data, covariates include subsets of ICD codes that have an additional structure specified by their relation to a chronic condition. Theorem 3. The proof of this theorem is in the supplementary material along with other proofs. The polynomial kernel has finite dimensional feature. We use the SE kernel in our experiments due to its popularity. We illustrate the application of six kernel functions in capturing the similarity of patients in the UIHC EHR data, where we expect patients with common primary sites to have similar sets of chronic conditions Figure 1.

To highlight the similarity of patients, we have grouped the patients into six blocks depending on their primary cancer sites. The diagonal and off-diagonal blocks are clearest for radial basis function kernel. The spectrum and boundrange kernel matrices fail to capture similarities of patients except those with cancer at brain and nervous system, where the diagnoses are very similar compared to cancers at other sites Figures 1e and 1f.

We model the effects of xi and Ti independently. The color palette represents 0 and 1 using dark blue and dark red colors, respectively. The last two kernel matrices8 are obtained using spectrum and boundrange kernels, which are popular string kernels in text mining, with substring length equalling 3.

The block correlation structures are better captured in b , c , and d than a , e , and f , indicating the superiority of the proposed set of kernel functions. The derivation of this algorithm is in the supplementary materials along with the others. The algorithm in steps 1 — 4 a variant of Gibbs sampling algorithm for posterior inference in univariate spatial linear models in that we replace the Metropolis-Hastings step by an ESS step in 2 and T replaces a spatial location [1].

Unlike the Metropolis-Hastings step, the ESS step is free of any proposal tuning, which is preferred in automated applications. This setup ensures that the MCMC algorithms for inference and predictions in classification and regression models are very similar. Furthermore, Theorem 1 in Polson et al. The MCMC algorithm for posterior inference and predictions in 16 follows from arguments similar to those used in Section 4.

First, Polson et al. Second, Wang and Roy [19] develop a sampling algorithm based on the PG-DA strategy for posterior inference in a Bayesian logistic linear mixed model with independent and Gaussian random effects and prove its geometric ergodicity. The GP prior in Theorem 3. This is equivalent to embedding the string T in [0, 1]d using the feature map. The GP covariance kernels are defined using this embedding. We now define the regularity of functions using three function spaces.

We introduce some notations for stating the theoretical results. Reformulating the models in Sections 4. The training data y1 ,. Theorem 4. The proof of Theorem 4. In this section, covariance kernel of the GP is SE and the focus is on classification because our main goal is to address questions relevant to the UIHC EHR data; supplementary materials contain the results for GP-based regression using 8. In the simulated and real data analyses, the cutoff for predicting the response is estimated using Receiver Operating Characteristics ROC curve.

We compare the predictive performance of the all the methods using accuracy, area under the ROC curve AUC , sensitivity, and specificity on the test data. We compare GP-based classification with parametric and nonparametric models. The para- metric competitors are logistic regression and its regularized versions that penalize the regression coefficients using the lasso and ridge penalties.

Because the SE kernel in 5 is a restriction of the boundrange kernel, we use support vector machine SVM and kernel ridge regression KRR with the spectrum and boundrange kernels as the nonparametric competitors.

We also use the kernel function in 5 for defining feature vectors. We used glmnet [3] for regularized logistic regression, ranger [20] for random forest, and kernlab [8] for SVM and KRR.

The performance metrics are evaluated using the pROC package [17]. The tuning parameters in all the methods are selected using the recommended settings. We use the spectrum and boundrange kernels in kernlab with 3 and 4 as the length of the matching substring, respectively. This means that the matches in the two kernels include the first three and four contiguous characters in a pair of ICD codes, which are also the most informative [2]; therefore, we expect the results for kernlab and our method to be very similar.

The MCMC algorithms run for 10, iterations. The draws from the first 5, iterations are discarded as burn-ins, and every fifth draw in the remaining chain is chosen for posterior inference and prediction. We expect all methods to have comparable performances in the first two simulations and differ significantly in the third one. The three simulations are replicated ten times. First two simulations. Both simulations have four chronic conditions.

The chronic conditions are further structured into two groups: the first group includes the first and third chronic conditions and the second group includes the remaining two. We assign nine codes to every patient, where six out of the nine codes are from the 12 codes defining first or second group of chronic conditions and the remaining three come from the other chronic condition group.

The response yi is simulated independently from Bernoulli pi , where 4 6 pi XX c ind. The second simulation is a slight modification of the first. The first simulation has a linear decision boundary in terms of the 24 codes.

Logistic regression and its regularized versions perform the best in the first simulation, but their performance deteriorates slightly in the second simulation due to the presence of interactions between dummy variables. Random forest using the codes as covariates has the worst performance in both simulations. After including SE kernel-based covariates, random forest achieves the same level of performance as other method in both simulations.

The remaining methods are based on non-linear kernels, so they are more flexible in adapting to both linear and non-linear decision boundaries. Compared to the spectrum and boundrange kernels, the SE kernel in 5 is better tuned for modeling the extra structure provided by the chronic conditions.

Third simulation. The response yi is simulated independently from Bernoulli pi , where pi ind. GP with SE covariance kernel is still among the top performers Table 1. This simulation has a nonlinear periodic decision boundary; therefore, the performance of random forest and parametric models, including logistic regression and its regularized, deteriorate significantly.

While all the kernel-based methods are suited for modeling the non-linear periodic decision boundary in 29 , the SE kernel is better tuned than spectrum and boundrange kernels for modeling the hierarchical structure of ICD codes.

Our simulation studies suggest that the kernel-based methods are better suited for applications in practice where we expect interactions among the diagnoses. Furthermore, GP with the SE kernel is easily extended to account for any biomedical information, is tuned for modeling the structure of ICD codes, and produces similar results as SVM or KRR with spectrum and boundrange kernels in the absence of any additional structure. Table 1: Performance comparisons for the three simulation studies.

Every entry in the table is the average of its values across ten replications. SE-GP is the proposed method and SE kernel-based features are obtained from the kernel matrix estimated using the proposed method.

There are six types of cancer sites: brain and other nervous system brain , breast, urinary system, respiratory system, female genital system, and digestive system. The major and minor goals of the analysis are to estimate the marginal associations of the 58 chronic conditions with the cancer sites and to predict the cancer primary site using the patient diagnoses and marital status, respectively.

We use the methods from the previous section to achieve both goals. There is no principled way of using chronic conditions as covariates in logistic regression and its penalized extension, so we only use KRR, SVM, and SE-GP for estimating the marginal associations between the 58 chronic conditions and primary cancer site.

To this end, we include T1 ,. SE-GP is among the top performers in all the five classification models Tables 2—6.

We do not provide results for KRR and logistic regression with no penalty because the former fails due to a line search error and the latter cannot be used because the number of dummy variables is larger than the sample size. This is confirmed by a relatively high accuracy of logistic regression with the ridge and lasso penalties for predicting the cancer in brain and urinary system; however, these methods do not use the information encoded in ICD codes, so their sensitivity and specificity vary a lot; for example, logistic regression with the lasso penalty has a high accuracy of 0.

On the other hand, SE-GP is among the top performers in terms of accuracy, sensitivity, and specificity in all the five classification models. The features obtained from the SE kernel matrix are also promising in that if we use them as covariates in logistic regression and its penalized extensions, then the AUCs are relatively large in all the five classification models.

The performance of default random forest is the worst in all five classification models. SE-GP also performs better than SVM with spectrum and boundrange kernels because SE kernel accounts for the additional structure provided by the 58 chronic conditions. The SE-GP is a restriction of the boundrange kernel that is tuned for EHR data analysis, so its sensitivity and specificity are much higher than those of the SVM in all the five classification models.

Most importantly, the estimates of marginal associations between chronic conditions and primary cancer sites agree closely with those obtained using the SVM with spectrum and boundrange kernels Table 7. Additionally, these credible intervals cover the corresponding estimates obtained using SVM. Based on our results in simulations and this section, we conclude that SE-GP outperforms its competitors in estimating the marginal associations among chronic conditions and primary cancer sites and in predicting the cancer primary site using diagnoses and demographic information as predictors.

First, the sample size in the motivating application is small so that repeated computations and evaluation of the kernel function is relatively inexpensive. This, however, becomes problematic as the sample size becomes moderately large.

The kernels developed in this work can be immediately extended for nonparametric regression and classification in large sample settings using low-rank kernels based on inducing points [14].

Second, Every element of the kernel matrix is a similarity measure between a pair of patients. This can be used an input in an algorithm for unsupervised learning using ICD codes as inputs, such as clustering of subsets, data visualization, and dimension reduction [18, Chapters 6, 8]. Finally, We are exploring application of product partition model with regression on diagnoses, where the similarity measure on diagnoses is defined using the kernel matrix.

References [1] Banerjee, S. Carlin, and A. Gelfand Hierarchical Modeling and Analysis for Spatial Data. Vetrano, G. Onder, L. Gimeno-Feliu, C. Coscollar- Santaliestra, A.

Pisciotta, S. Angleman, R. Melis, and G. Santoni Assessing and measuring chronic multimorbidity in the older population: a proposal for its operational- ization. Hastie, and R. Tibshirani



0コメント

  • 1000 / 1000