Suppose I conduct a survey of 10 people asking whether to rank a movie as 0 to 4 stars. Allowable answers are 0, 1, 2, 3, and 4.
The mean is 2.0 stars.
How do I calculate the certainty (or uncertainty) about this 2.0 star rating? Ideally, I would like a number between 0 and 1, where 0 represents complete uncertainty and 1 represents complete certainty.
It seems clear that the case where the 10 people choose ( 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ) would be the most certain, while the case where the

I need to evaluate the effectiveness of algorithms which predict the probability of something occurring.
My current approach is to use "root mean squared error", ie. the square root of the mean of the errors squared, where the error is 1.0-prediction if the event occurred, or prediction if the event did not occur.
The algorithms have no specific applications, but a common one will be to come up with a prediction of an event occurring for each of a variety of options, and then selecting the opt

Simple question I hope.
If I have a set of data like this:
Classification attribute-1 attribute-2
Correct dog dog
Correct dog dog
Wrong dog cat
Correct cat cat
Wrong cat dog
Wrong cat dog
Then what is the information gain of attribute-2 relative to attribute-1?
I've computed the entropy of the whole data set: -(3/6)log2(3/6)-(3/6)log2(3/6)=1
Then I'm stuck! I think you need to

I have PDFs and CDFs for two custom distributions, a means of generating RandomVariates for each, and code for fitting parameters to data. Some of this code I've posted previously at:
Calculating expectation for a custom distribution in Mathematica
Some of it follows:
nlDist /: PDF[nlDist[alpha_, beta_, mu_, sigma_],
x_] := (1/(2*(alpha + beta)))*alpha*
beta*(E^(alpha*(mu + (alpha*sigma^2)/2 - x))*
Erfc[(mu + alpha*sigma^2 - x)/(Sqrt[2]*sigma)] +
E^(beta*(-mu + (beta*sig

I have a multi-layer neural network based estimator that takes inputs the past arrival times of vehicles and estimates the arrival time of next vehicle (with a backpropagation algorithm). Based on a certain threshold (e.g, 10sec), the estimator classifies the predicted time to be high or low (1 or 0). My problem is that, based on the observed and predicted/estimated arrival times (1's & 0's), how do I calculate the accuracy (or the correct prediction rate) of the overall prediction?

I'm importing a very large dataset into SPSS. Many fields in the dataset contain a "999" value, indicating a missing value. I want to instruct SPSS to view them as such. However, default each variable in SPSS is set to having "no missing values". In variable view, you have to define "999" as being the "discrete missing value" for each variable. With hundreds of variables though, this is a lot of work:
Therefore: is there a way to define "discrete missing value 999" as the default missing valu

I have run the following command in SPSS. But It's showing error
STRING NSAL(A8).
IF(EDU>12 AND GENDER='M') RECODE SAL (0 THRU 75000='A') (75001 THRU HI='B') INTO NSAL.
EXECUTE.
Where have I done mistake?

I need to estimate a truncated gamma distribution parameters (shape , scale).
But, I only know the data mean and std. dev. I do not know the data set.
Given the mean and std. dev. of a data set from a truncated gamma
distribution, how to find shape and scale of the distribution parameters ?
I know MLE may be useful for solving this problem. But, they depend on
knowing the whole data set.
Any help would be appreciated.

I have a small dataset consisting of three distinct observations on each of three variables, say x1 x2, x3 and the accompanying response y, on which I would like to perform Analysis of Variance to test whether the means are equal.
data anova;
input var obs resp;
cards;
1 1 1.1
1 2 .5
1 3 -2.1
2 1 4.2
2 2 3.7
2 3 .8
3 1 3.2
3 2 2.8
3 3 6.3
;
proc anova data=anova;
class var;
model resp=var;
run;
All good so far. Now however, I would like to use a permutation test to check the p-value of the

I'm reading about the logistic regression and i came across a phrase that i can't understand. The sentence is as follows (from the book: Introductory Statistics with R, Peter Dalgaard):
"Changes in the deviance caused by a model reduction will be approximately Chi-squared distributed with degrees of freedom equal to the change in the number of parameters"
Could someone explain this phrase to me? To calculate this change i use the Probability density function or the Cumulative distribution fun

I am searching for a statistical test for checking if a categorial (nominal) time series is stationary or not, however, all the tests I have found so far (Dickey-Fuller, Priestley-Subba Rao (PSR), Wavelet Spectrum) are for real values. Does someone know such a test for categorial data?
e.g., the series= ['dog','cat', 'mouse',....,'dog','rabbit','cat'....] is an example for the series I deal with.
Thanks

I've got a headache which I would love some help with
So I have the following table:
Table 1:
Date Hour Volume Value Price
10/09/2018 1 10 400 40.0
10/09/2018 2 80 200 2.5
10/09/2018 3 14 190 13.6
10/09/2018 4 74 140 1.9
11/09/2018 1 34 547 16.1
11/09/2018 2 26 849 32.7
11/09/2018 3 95 279 2.9
11/09/2018 4 31 216 7.0
Then what I wan to do is view

I am analyzing a time series using the unit root test.
I am stuck in that I do not understand the difference between trend and drift in the time series.
Is it correct to say that trend is a feature of a time series, whose average changes over time and that drift is a feature of a time series, whose variance changes over time?

Tags： Statistics
anovaposthocstatistical-testtukey
I did a two-way ANOVA test and ran a post-hoc Tukey test in R. I also extracted the significant rows from the post-hoc test.
My question is: is there a way to select only the rows that are comparable (at least one of the two independent variables matches)?
Using my data as an example, I am comparing differences between months and sites. Detected significance is only meaningful if one of the two IV's remains the same. Hence, even significant difference is detected between two sites at two diffe

Tags： Statistics
bayesianconfidence-intervaluniformhypothesis-test
I am currently working on the concept of confidence distribution and based on Wiki for modern definition we have two condition. The second condition states:"The true parameter value θ = θ0, Hn(θ0) ≡ Hn(Xn, θ0), as a function of the sample Xn, follows the uniform distribution U[0, 1]"
I am not sure what does that mean, and how can I check if the distribution is uniform over the sample in this case.

I'm trying to calculate the standard error of the log rate ratio (equals exp{3.5}) by using the information from R regression summary table. Thanks!

Tags： Statistics
bayesianab-testinggoogle-optimize
So checking for Sample Ratio Mismatch is good for data quality.
But in Google Optimize i can't influence the sample size or do something against it.
My problem is, out of 15 A/B Tests I only got 2 Experiment with no SRM.
(Used this tool https://www.lukasvermeer.nl/srm/microsite/)
In the other hand the bayesian model deals with things like different sample sizes and I dont need to worry about, but the opinions on this topic are different.
Is SRM really a problem in Google Optimize or can I ignor

I'm trying to calculate the individual squared deviations to perform some calculations based on these values.
I have the following dataset:
data have;
input testid $ level $ values;
datalines;
HITT1D LC1 0.45
HITT1D LC1 0.49
HITT1D LC1 0.47
HITT1D LC2 0.43
HITT1D LC2 0.39
HITT1D LC2 0.42
HITT1D LC3 0.66
HITT1D LC3 0.63
HITT1D LC3 0.64
HBEF5D LC1 0.45
HBEF5D LC1 0.49
HBEF5D LC1 0.47
HBEF5D LC2 0.43
HBEF5D LC2 0.39
HBEF5D LC2 0.42
HBEF5D

i am using vb.net / asp.net
my team created a web application (research database for cancer center)
i am wondering if anyone has an idea about drawing survival curves programmatically
i searched every where and couldnt find any idea

How do I compute the generalized mean for extreme values of p (very close to 0, or very large) with reasonable computational error?

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?

I'm building an OLAP Analysis with Pentaho's BI Suite (Community Edition). Many of my measures are standard deviations of the variables in my fact tables.
Does someone has a tip on how to define a Standard Deviation aggregation function in Schema Workbench? Lot's of my jobs could benefit of it.
Thanks in advance!

By LPM I mean that the dependent variable is polychotomous (e.g. 1,2,3 4) and (NOT binary 1 or 0).
I know how to transform the coefficients manually by reverse calculating the PDF. Is there any command in SAS that would do it automatically?
If I start lets say
proc logistic
OR
proc probit
I want to explain what happens if the independent variable changes by 1 unit?

Tags： Statistics
normalizationdata-miningrecommendation-engine
If I have the following data:
Empid Salary Age Experience
1 25000 24 4
2 40000 27 5
3 55000 32 7
4 27000 25 5
5 53000 30 5
and if I normalize all the above values using Min-Max normalization technique so that all values lie between 0 and 1 to get the following normalized data:
Empid Salary Age Experience
1 0.0000333 0.1000000 0.2000000
2 0.5000000 0.4000000 0.4000000
3 1.0000000 0.9000000 0.8000000

Tags： Statistics
genetic-algorithmpredictiongeneralization
I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA

A statistics question (I think) for calculating potential profits as I increase the number of products I sell:
If I have 5 individual products that I know from historical data will sell an average of 10 each per day, can I count on that average staying the same if I increase my product number to 10 items? Or should I assume more items will force my average down, all else being equal?
What about if the number of products increases to 50, or 500? Statistically speaking, what should happen to th

Tags： Statistics
varstatsmodelsstatistical-testcausality
I've got two overlapping time series, i.e. there doesn't graphically appear to be any lag at all, but when constructing a VAR model of the two columns, an unsually high lag (e.g. 10) is returned.
Could someone explain why this is?

As titled, I want to find confidence intervals for something Log-Normally distributed.
I found this paper online that suggests the use of t instead of the usual z (Section 3.4) for better coverage (Modified Cox method) when the number of observations is small. However, I do not understand how the degree of freedom for t is selected. The paper says:
t, with degrees of freedom based on the d.f. for the estimate of (variance).
I'm unable to figure out how the author ended up with a degree of free

I have a dataset of 100 cases. Each case has a class {I,II,III,IV,V} and a value A and V, each class appears exactly 20 times in the dataset:
Class A V
5 2 3
1 3 5
3 2 3
2 3 5
3 2 3
1 2 4
1 2 4
1 4 4
2 3 3
2 3 4
I want to generate another 100 cases based on this set. Am i correct in assuming that I should
find the distribution of A and the distribution of V per class?
calculate the joint distribution of A &

Let X1, X2,...,Xn be discrete random variables. I'm looking for a way to prove the random variables are independent but not identically distributed.
Can anyone suggest some ideas ?

Imagine I have a dataset like
df <- data.frame(y=c(11:16), x1=c(23,NA,27,20,20,21), x2=c(NA,9,2,9,7,8))
df
y x1 x2
1 11 23 NA
2 12 NA 9
3 13 27 2
4 14 20 9
5 15 20 7
6 16 21 8
If I perform a multiple linear regression, I get
m <- lm(y~x1+x2, data=df)
summary(m)
Call:
lm(formula = y ~ x1 + x2, data = df)
Residuals:
3 4 5 6
-1.744e-01 -1.047e+00 -4.233e-16 1.221e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Inter

I have an error in this code as I want to calculate the variance between the values in the(x1) and (x2) list. any recommendation?!
def my_var(L):
s = 0
t = 0
u = 0
for i in range(0, len(L)):
s += L[i]
t = s/len(L)
u += ((L[i]-t)*(L[i]-t))
return u / len(L)
x1 = [1, 3, 4, -3, 8]
x2 = [1, -4, 7, 2]
v1 = my_var(x1)
v2 = my_var(x2)
print(v1)
print(v2)

What software with a good GUI can do normalization of a large matrix of spectroscopy data? I have done that in Python codes and tried StatsDirect. What other software have implemented algorithms to perform data normalization on a Raman spectroscopy data or similar 300X5000 data table? is there any way to do it using SPSS, STATA, MiniTab, Statistica, SAS? thx.

One formula for AIC is:
AIC = 2k + n*Log(RSS/n)
Intuitively, if you add a parameter to your model, your AIC will decrease (and hence you should keep the parameter), if the increase in the 2k term due to the new parameter is offset by the decrease in the n*Log(RSS/n) term due to the decreased residual sum of squares. But isn't this RSS value unit-specific? So if I'm modeling money, and my units are in millions of dollars, the change in RSS with adding a parameter might be very small, and won't

I'm comparing machinery upgrade data and evaluating the performance gain after a machinery upgrade.
I have data (GT Power, ST Power) for both pre and post conditions. These are my x=GT Power and Y = ST Power variables.
I ask this question, after the upgrade at the same GT Power (independent variable) how much did the ST power increase (dependent variable).
Having a regression analysis as show in the below table, I can take the difference of line pre and post for a one full year. From there I

I have some SoMe data - number of likes and a timestamp (when it is posted). Based on this, I want to determine what time a day, it is best to post to maximize the number of like received.
What analysis can I do to determine this based only on the number of likes a post receive and the timestamp?
Is it possible to visualize in a plot - maybe a boxplot?

I'm learning Clojure and would like some advice on idiomatic usage. As part of a small statistics package, I have a function to calculate the mode of a set of data. (Background: The mode is the most common value in a set of data. There are almost a dozen published algorithms to calculate it. The one used here is from "Fundamentals of Biostatistics" 6th Ed by Bernard Rosner.)
(defn tally-map
" Create a map where the keys are all of the unique elements in the input
sequence and the values rep

Not entirely sure if this is an appropriate forum for this.
I have a small database cluster(4 boxes), each machine has a shard of the overall dataset.
I need to calculate quartiles for a specific data point, but I need to do it without ever having access to the entire dataset at once.
Is this even possible?
edit I would prefer the exact answer, but a reasonable approximation would probably work as well.

I am using a Naive Bayes algorithm to predict movie ratings as positive or negative. I have been able to rate movies with 81% accuracy. I am, however, trying to assign a 'confidence level' for each of the ratings as well.
I am trying to identify how I can tell the user something like "we think that the review is positive with 80% confidence". Can someone help me understand how I can calculate a confidence level to our classification result?

I have spring boot application with ehcache as below
@Bean
public EhCacheManagerFactoryBean ehCacheManagerFactoryBean() {
EhCacheManagerFactoryBean ehCacheManagerFactoryBean = new EhCacheManagerFactoryBean();
ehCacheManagerFactoryBean.setConfigLocation(new ClassPathResource("ehcache.xml"));
//ehCacheManagerFactoryBean.setCacheManagerName("messageCache");
ehCacheManagerFactoryBean.setShared(true);
return ehCacheManagerFactoryBean;
}
@Bean
public EhCacheCacheManager cacheMan

We are totally new to PSPP and have a question.
We have imported data from LimeSurvey and that has imported fine but we have multiple response questions in the survey.
After a lot of Googling we have found details about creating MRSets at http://www.gnu.org/software/pspp/manual/html_node/MRSETS.html but do not understand how we go about creating this.
Can anyone point us in the right direction
Rich

Do I "use a quartile" or do I "use quartiles"? Is "use" even the correct word?
Do I need to be more wordy and say "use quartile summary statistics?"
Should I say "use quartile analysis" or something like that?
EDIT:
We have a large team of reviewers reviewing a large group of applicants. Each applicant will be ranked by each reviewer using a number scale. When completed we will use a quartile to analyze the ranking.

I'm using Spinnaker on Google Kubernetes Engine with Stackdriver as a log and statistics source. I'm having trouble getting timing/distribution metrics working with the Spinnaker automated canary analysis setup.
My application logs request information for each API request coming in. They include all the request details so I can create log-based stats.
A fairly truncated log entry looks like this:
{
"jsonPayload": {
"format_parameters": {
"ElapsedMilliseconds": "4.9541",
"Met

I want to recreate the following table using my own dataset.
Note that there is a p-value for every row (in this case every medication), based on Fisher's exact test. I take this to mean the p-value for the association between the individual medication and BPD.
When I try to run Fisher's exact test in SPSS, I get a p-value for the whole table.
Assume I have a dataset showing all my patients, with a column on whether they receieved a specific drug coded as yes/no, and a column on whether th

I´am trying to figure out the differences between some diversity index on Past and I can´t plot the standard deviation of my data so I can´t tell if there is any significant difference between different populations. enter image description here
I´am starting to doubt if I shouldn´t use another software like R.

I have 2 questions
I want to find the covariance of two matrices A and B. Before standardized them, cov(A,B) looks like this:(a 10*10 diagonal matrix where the main diagonal is about 1.25)
Columns 1 through 9
1.2510 -0.0024 -0.0024 0.0015 -0.0034 -0.0002 -0.0002 0.0009 0.0008
-0.0024 1.2494 -0.0033 0.0021 -0.0048 -0.0003 -0.0002 0.0012 0.0012
-0.0024 -0.0033 1.2495 0.0021 -0.0047 -0.0003 -0.0002 0.0012 0.0011
0.0015

My greatest nemesis is understanding the commands of STATA to manipulate the data to solve for each question. I am working with Stata to solve for these sets of question:
Calculating probabilities from a T-distribution. Use the Stata functions “display” “t” “ttail”, “invt” and “invttail” to answer the following questions. For each question, use a T-distribution with degrees of freedom equal to 16. In your answers, give the full stata command you use, as well as the numerical answer. NOTES: stata

I want to use GAM method and generalized exponential distribution for my project. I know GAM method is a generalized GLM method and distribution of response variable must be in exponential family. The probability density (pdf) of generalized exponential distribution is as following :
f(x; alpha, eta) = alpha * eta * exp(-eta*x) * ((1-exp(-eta*x))^(alpha-1))
CDF of this distribution is as following :
F(x; alpha, eta) = (1-exp(-eta*x))^alpha
The alpha is shape parameter and the eta is scale para

Tags： Statistics
regressionlinear-regressionweatherpearson-correlation
I am pretty new to statistics and tring to get my head around what is the best way and how to analyse the performance data against weather data.
My hypothesis is that the performance data is affected by the weather data and I want to prove that.
I tried to plot the daily performance data & meantemp data on a scatterplot to do regression analysis and it looks odd. I think this is due to negative values in weather data.
Below are all the weather information that is available to me:
fog,

I have data frame(df) like this:
cpt units
36430 2
36440 5
36450 10
36430 1
36440 5
36450 10
36430 2
36440 6
36450 11
I need to add new column(called- mode) to above dataframe after aggregating on column(CPT), windowing and finding mode on column(units). I mean, I expect to see like this:
cpt units mode
36430 2 2
36440 5 5
36450 10 10
36430 1 2
36440 5 5
36450 10 10
36430 2 2
36440 6 5
36450 11 10
I have tried with following code
d

1 2 3 4 5 6 ...

下一页 共 6 页