Statistics Question

Suppose I conduct a survey of 10 people asking whether to rank a movie as 0 to 4 stars. Allowable answers are 0, 1, 2, 3, and 4. The mean is 2.0 stars. How do I calculate the certainty (or uncertainty) about this 2.0 star rating? Ideally, I would like a number between 0 and 1, where 0 represents complete uncertainty and 1 represents complete certainty. It seems clear that the case where the 10 people choose ( 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ) would be the most certain, while the case where the

Statistics How do I evaluate the effectiveness of an algorithm that predicts probabilities?

I need to evaluate the effectiveness of algorithms which predict the probability of something occurring. My current approach is to use "root mean squared error", ie. the square root of the mean of the errors squared, where the error is 1.0-prediction if the event occurred, or prediction if the event did not occur. The algorithms have no specific applications, but a common one will be to come up with a prediction of an event occurring for each of a variety of options, and then selecting the opt

Statistics Entropy and Information Gain

Simple question I hope. If I have a set of data like this: Classification attribute-1 attribute-2 Correct dog dog Correct dog dog Wrong dog cat Correct cat cat Wrong cat dog Wrong cat dog Then what is the information gain of attribute-2 relative to attribute-1? I've computed the entropy of the whole data set: -(3/6)log2(3/6)-(3/6)log2(3/6)=1 Then I'm stuck! I think you need to

Statistics DistributionFitTest[] for custom distributions in Mathematica

I have PDFs and CDFs for two custom distributions, a means of generating RandomVariates for each, and code for fitting parameters to data. Some of this code I've posted previously at: Calculating expectation for a custom distribution in Mathematica Some of it follows: nlDist /: PDF[nlDist[alpha_, beta_, mu_, sigma_], x_] := (1/(2*(alpha + beta)))*alpha* beta*(E^(alpha*(mu + (alpha*sigma^2)/2 - x))* Erfc[(mu + alpha*sigma^2 - x)/(Sqrt[2]*sigma)] + E^(beta*(-mu + (beta*sig

Statistics Calculating accuracy of a predicted value

I have a multi-layer neural network based estimator that takes inputs the past arrival times of vehicles and estimates the arrival time of next vehicle (with a backpropagation algorithm). Based on a certain threshold (e.g, 10sec), the estimator classifies the predicted time to be high or low (1 or 0). My problem is that, based on the observed and predicted/estimated arrival times (1's & 0's), how do I calculate the accuracy (or the correct prediction rate) of the overall prediction?

Statistics How can I set "999" as the DEFAULT missing value in SPSS/PASW?

I'm importing a very large dataset into SPSS. Many fields in the dataset contain a "999" value, indicating a missing value. I want to instruct SPSS to view them as such. However, default each variable in SPSS is set to having "no missing values". In variable view, you have to define "999" as being the "discrete missing value" for each variable. With hundreds of variables though, this is a lot of work: Therefore: is there a way to define "discrete missing value 999" as the default missing valu

Statistics Estimate a distribution parameters only by data mean and std. dev

I need to estimate a truncated gamma distribution parameters (shape , scale). But, I only know the data mean and std. dev. I do not know the data set. Given the mean and std. dev. of a data set from a truncated gamma distribution, how to find shape and scale of the distribution parameters ? I know MLE may be useful for solving this problem. But, they depend on knowing the whole data set. Any help would be appreciated.

Statistics A non-parametric test on SAS

I have a small dataset consisting of three distinct observations on each of three variables, say x1 x2, x3 and the accompanying response y, on which I would like to perform Analysis of Variance to test whether the means are equal. data anova; input var obs resp; cards; 1 1 1.1 1 2 .5 1 3 -2.1 2 1 4.2 2 2 3.7 2 3 .8 3 1 3.2 3 2 2.8 3 3 6.3 ; proc anova data=anova; class var; model resp=var; run; All good so far. Now however, I would like to use a permutation test to check the p-value of the

Statistics Logistc regression - changes in the deviance

I'm reading about the logistic regression and i came across a phrase that i can't understand. The sentence is as follows (from the book: Introductory Statistics with R, Peter Dalgaard): "Changes in the deviance caused by a model reduction will be approximately Chi-squared distributed with degrees of freedom equal to the change in the number of parameters" Could someone explain this phrase to me? To calculate this change i use the Probability density function or the Cumulative distribution fun

Statistics Statistical test for checking stationary/non-stationary of a categorial time series

I am searching for a statistical test for checking if a categorial (nominal) time series is stationary or not, however, all the tests I have found so far (Dickey-Fuller, Priestley-Subba Rao (PSR), Wavelet Spectrum) are for real values. Does someone know such a test for categorial data? e.g., the series= ['dog','cat', 'mouse',....,'dog','rabbit','cat'....] is an example for the series I deal with. Thanks

Statistics weighted average by category filtered by time period

I've got a headache which I would love some help with So I have the following table: Table 1: Date Hour Volume Value Price 10/09/2018 1 10 400 40.0 10/09/2018 2 80 200 2.5 10/09/2018 3 14 190 13.6 10/09/2018 4 74 140 1.9 11/09/2018 1 34 547 16.1 11/09/2018 2 26 849 32.7 11/09/2018 3 95 279 2.9 11/09/2018 4 31 216 7.0 Then what I wan to do is view

Statistics What is the difference between drift and trend in a time series?

I am analyzing a time series using the unit root test. I am stuck in that I do not understand the difference between trend and drift in the time series. Is it correct to say that trend is a feature of a time series, whose average changes over time and that drift is a feature of a time series, whose variance changes over time?

Statistics How to extract comparable results from TukeyHSD test (after two-way ANOVA) in R?

I did a two-way ANOVA test and ran a post-hoc Tukey test in R. I also extracted the significant rows from the post-hoc test. My question is: is there a way to select only the rows that are comparable (at least one of the two independent variables matches)? Using my data as an example, I am comparing differences between months and sites. Detected significance is only meaningful if one of the two IV's remains the same. Hence, even significant difference is detected between two sites at two diffe

Statistics Confidence Distribution Definition - Modern Definition: Uniform Condition Requirement

I am currently working on the concept of confidence distribution and based on Wiki for modern definition we have two condition. The second condition states:"The true parameter value θ = θ0, Hn(θ0) ≡ Hn(Xn, θ0), as a function of the sample Xn, follows the uniform distribution U[0, 1]" I am not sure what does that mean, and how can I check if the distribution is uniform over the sample in this case.

Statistics Is SRM in Google Optimize (Bayesian Model) a thing

So checking for Sample Ratio Mismatch is good for data quality. But in Google Optimize i can't influence the sample size or do something against it. My problem is, out of 15 A/B Tests I only got 2 Experiment with no SRM. (Used this tool In the other hand the bayesian model deals with things like different sample sizes and I dont need to worry about, but the opinions on this topic are different. Is SRM really a problem in Google Optimize or can I ignor

Statistics Calculating the sum of individual squared deviations in SAS

I'm trying to calculate the individual squared deviations to perform some calculations based on these values. I have the following dataset: data have; input testid $ level $ values; datalines; HITT1D LC1 0.45 HITT1D LC1 0.49 HITT1D LC1 0.47 HITT1D LC2 0.43 HITT1D LC2 0.39 HITT1D LC2 0.42 HITT1D LC3 0.66 HITT1D LC3 0.63 HITT1D LC3 0.64 HBEF5D LC1 0.45 HBEF5D LC1 0.49 HBEF5D LC1 0.47 HBEF5D LC2 0.43 HBEF5D LC2 0.39 HBEF5D LC2 0.42 HBEF5D

Statistics Programmatically Create Survival Curves

i am using / my team created a web application (research database for cancer center) i am wondering if anyone has an idea about drawing survival curves programmatically i searched every where and couldnt find any idea

Statistics Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile). Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints. What would you recommend? Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints. In that case, can you think of something that would work well enough?

Statistics How to give weightage to values while calculating similarities/disimilarities

If I have the following data: Empid Salary Age Experience 1 25000 24 4 2 40000 27 5 3 55000 32 7 4 27000 25 5 5 53000 30 5 and if I normalize all the above values using Min-Max normalization technique so that all values lie between 0 and 1 to get the following normalized data: Empid Salary Age Experience 1 0.0000333 0.1000000 0.2000000 2 0.5000000 0.4000000 0.4000000 3 1.0000000 0.9000000 0.8000000

Statistics Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase. However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period. One precaution I took was: When multiple stocks can be bought on the same day the GA

Statistics Statistical average as the number of items averaged increases, regardless of their value

A statistics question (I think) for calculating potential profits as I increase the number of products I sell: If I have 5 individual products that I know from historical data will sell an average of 10 each per day, can I count on that average staying the same if I increase my product number to 10 items? Or should I assume more items will force my average down, all else being equal? What about if the number of products increases to 50, or 500? Statistically speaking, what should happen to th

Statistics Modified Cox method in calculating confidence intervals for Log-Normal distribution

As titled, I want to find confidence intervals for something Log-Normally distributed. I found this paper online that suggests the use of t instead of the usual z (Section 3.4) for better coverage (Modified Cox method) when the number of observations is small. However, I do not understand how the degree of freedom for t is selected. The paper says: t, with degrees of freedom based on the d.f. for the estimate of (variance). I'm unable to figure out how the author ended up with a degree of free

Statistics how to find a distribution of values in dataset and generate random values based on this distribution?

I have a dataset of 100 cases. Each case has a class {I,II,III,IV,V} and a value A and V, each class appears exactly 20 times in the dataset: Class A V 5 2 3 1 3 5 3 2 3 2 3 5 3 2 3 1 2 4 1 2 4 1 4 4 2 3 3 2 3 4 I want to generate another 100 cases based on this set. Am i correct in assuming that I should find the distribution of A and the distribution of V per class? calculate the joint distribution of A &

Statistics Multiple linear regression with missing covariates

Imagine I have a dataset like df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8)) df y x1 x2 1 11 23 NA 2 12 NA 9 3 13 27 2 4 14 20 9 5 15 20 7 6 16 21 8 If I perform a multiple linear regression, I get m <- lm(y~x1+x2, data=df) summary(m) Call: lm(formula = y ~ x1 + x2, data = df) Residuals: 3 4 5 6 -1.744e-01 -1.047e+00 -4.233e-16 1.221e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Inter

Statistics calculation of variance function equation

I have an error in this code as I want to calculate the variance between the values in the(x1) and (x2) list. any recommendation?! def my_var(L): s = 0 t = 0 u = 0 for i in range(0, len(L)): s += L[i] t = s/len(L) u += ((L[i]-t)*(L[i]-t)) return u / len(L) x1 = [1, 3, 4, -3, 8] x2 = [1, -4, 7, 2] v1 = my_var(x1) v2 = my_var(x2) print(v1) print(v2)

Statistics What software with a good GUI can do normalization of a large matrix of spectroscopy data?

What software with a good GUI can do normalization of a large matrix of spectroscopy data? I have done that in Python codes and tried StatsDirect. What other software have implemented algorithms to perform data normalization on a Raman spectroscopy data or similar 300X5000 data table? is there any way to do it using SPSS, STATA, MiniTab, Statistica, SAS? thx.

Statistics Is the akaike information criterion (AIC) unit-dependent?

One formula for AIC is: AIC = 2k + n*Log(RSS/n) Intuitively, if you add a parameter to your model, your AIC will decrease (and hence you should keep the parameter), if the increase in the 2k term due to the new parameter is offset by the decrease in the n*Log(RSS/n) term due to the decreased residual sum of squares. But isn't this RSS value unit-specific? So if I'm modeling money, and my units are in millions of dollars, the change in RSS with adding a parameter might be very small, and won't

Statistics How to calculate error in pre/post condition regression analysis?

I'm comparing machinery upgrade data and evaluating the performance gain after a machinery upgrade. I have data (GT Power, ST Power) for both pre and post conditions. These are my x=GT Power and Y = ST Power variables. I ask this question, after the upgrade at the same GT Power (independent variable) how much did the ST power increase (dependent variable). Having a regression analysis as show in the below table, I can take the difference of line pre and post for a one full year. From there I

Statistics Idiomatic Mode function in Clojure

I'm learning Clojure and would like some advice on idiomatic usage. As part of a small statistics package, I have a function to calculate the mode of a set of data. (Background: The mode is the most common value in a set of data. There are almost a dozen published algorithms to calculate it. The one used here is from "Fundamentals of Biostatistics" 6th Ed by Bernard Rosner.) (defn tally-map " Create a map where the keys are all of the unique elements in the input sequence and the values rep

Statistics Calculating quartiles of distributed data

Not entirely sure if this is an appropriate forum for this. I have a small database cluster(4 boxes), each machine has a shard of the overall dataset. I need to calculate quartiles for a specific data point, but I need to do it without ever having access to the entire dataset at once. Is this even possible? edit I would prefer the exact answer, but a reasonable approximation would probably work as well.

Statistics calculating confidence while doing classification

I am using a Naive Bayes algorithm to predict movie ratings as positive or negative. I have been able to rate movies with 81% accuracy. I am, however, trying to assign a 'confidence level' for each of the ratings as well. I am trying to identify how I can tell the user something like "we think that the review is positive with 80% confidence". Can someone help me understand how I can calculate a confidence level to our classification result?

ehCache Statistics with spring boot

I have spring boot application with ehcache as below @Bean public EhCacheManagerFactoryBean ehCacheManagerFactoryBean() { EhCacheManagerFactoryBean ehCacheManagerFactoryBean = new EhCacheManagerFactoryBean(); ehCacheManagerFactoryBean.setConfigLocation(new ClassPathResource("ehcache.xml")); //ehCacheManagerFactoryBean.setCacheManagerName("messageCache"); ehCacheManagerFactoryBean.setShared(true); return ehCacheManagerFactoryBean; } @Bean public EhCacheCacheManager cacheMan

Statistics PSPP multi-response questions

We are totally new to PSPP and have a question. We have imported data from LimeSurvey and that has imported fine but we have multiple response questions in the survey. After a lot of Googling we have found details about creating MRSets at but do not understand how we go about creating this. Can anyone point us in the right direction Rich

Statistics If I want to conduct analysis using a quartile, what is the correct way to say this?

Do I "use a quartile" or do I "use quartiles"? Is "use" even the correct word? Do I need to be more wordy and say "use quartile summary statistics?" Should I say "use quartile analysis" or something like that? EDIT: We have a large team of reviewers reviewing a large group of applicants. Each applicant will be ranked by each reviewer using a number scale. When completed we will use a quartile to analyze the ranking.

Statistics How do I use request latency as a statistic in Spinnaker automated canary analysis?

I'm using Spinnaker on Google Kubernetes Engine with Stackdriver as a log and statistics source. I'm having trouble getting timing/distribution metrics working with the Spinnaker automated canary analysis setup. My application logs request information for each API request coming in. They include all the request details so I can create log-based stats. A fairly truncated log entry looks like this: { "jsonPayload": { "format_parameters": { "ElapsedMilliseconds": "4.9541", "Met

Statistics Running an analysis separately for every category of a variable

I want to recreate the following table using my own dataset. Note that there is a p-value for every row (in this case every medication), based on Fisher's exact test. I take this to mean the p-value for the association between the individual medication and BPD. When I try to run Fisher's exact test in SPSS, I get a p-value for the whole table. Assume I have a dataset showing all my patients, with a column on whether they receieved a specific drug coded as yes/no, and a column on whether th

Statistics Tuning bandwidth and find the covariance(A,B)

I have 2 questions I want to find the covariance of two matrices A and B. Before standardized them, cov(A,B) looks like this:(a 10*10 diagonal matrix where the main diagonal is about 1.25) Columns 1 through 9 1.2510 -0.0024 -0.0024 0.0015 -0.0034 -0.0002 -0.0002 0.0009 0.0008 -0.0024 1.2494 -0.0033 0.0021 -0.0048 -0.0003 -0.0002 0.0012 0.0012 -0.0024 -0.0033 1.2495 0.0021 -0.0047 -0.0003 -0.0002 0.0012 0.0011 0.0015

Statistics stata Calculating probabilities from a T-distribution

My greatest nemesis is understanding the commands of STATA to manipulate the data to solve for each question. I am working with Stata to solve for these sets of question: Calculating probabilities from a T-distribution. Use the Stata functions “display” “t” “ttail”, “invt” and “invttail” to answer the following questions. For each question, use a T-distribution with degrees of freedom equal to 16. In your answers, give the full stata command you use, as well as the numerical answer. NOTES: stata

Statistics How can write the probability density function of generalized exponential distribution as exponential family?

I want to use GAM method and generalized exponential distribution for my project. I know GAM method is a generalized GLM method and distribution of response variable must be in exponential family. The probability density (pdf) of generalized exponential distribution is as following : f(x; alpha, eta) = alpha * eta * exp(-eta*x) * ((1-exp(-eta*x))^(alpha-1)) CDF of this distribution is as following : F(x; alpha, eta) = (1-exp(-eta*x))^alpha The alpha is shape parameter and the eta is scale para

Statistics Performance vs weather Regression analysis

I am pretty new to statistics and tring to get my head around what is the best way and how to analyse the performance data against weather data. My hypothesis is that the performance data is affected by the weather data and I want to prove that. I tried to plot the daily performance data & meantemp data on a scatterplot to do regression analysis and it looks odd. I think this is due to negative values in weather data. Below are all the weather information that is available to me: fog,

Statistics How to create window and find mode for created window in pandas(Aggregate, window and finding mode)

I have data frame(df) like this: cpt units 36430 2 36440 5 36450 10 36430 1 36440 5 36450 10 36430 2 36440 6 36450 11 I need to add new column(called- mode) to above dataframe after aggregating on column(CPT), windowing and finding mode on column(units). I mean, I expect to see like this: cpt units mode 36430 2 2 36440 5 5 36450 10 10 36430 1 2 36440 5 5 36450 10 10 36430 2 2 36440 6 5 36450 11 10 I have tried with following code d

  1    2   3   4   5   6  ... 下一页 共 6 页