1. ## Statistics help

Any expert statisticians in the Playground? I'm trying to do something that goes a bit beyond what my high school statistics class covered, and I want a second opinion on whether I got it right. Details here.

2. ## Re: Statistics help

Originally Posted by Douglas
Any expert statisticians in the Playground? I'm trying to do something that goes a bit beyond what my high school statistics class covered, and I want a second opinion on whether I got it right. Details here.
warty goblin is the main professional statistician (and expert woodworker) I know of on this forum. Maybe you can use your mod powers to summon him?

3. ## Re: Statistics help

My statistics is not good enough (anymore) to reliably make a comment on the procedure (though I see no obvious problem).
But I wonder what you mean by 'first' and 'last' cards in the deck. The order in which you constructed your deck? Or the order Arena displays the cards when looking at it? I feel unless it was a weird intended bug it's unlikely the client remembers in what order you constructed the deck.

4. ## Re: Statistics help

I'm sure just glancing at it whether your method will work or not. I'll have a harder look this weekend; right now I have to finish my cornflakes so I can head back to the statistics mines...

Originally Posted by Brother Oni
warty goblin is the main professional statistician (and expert woodworker) I know of on this forum. Maybe you can use your mod powers to summon him?
Thanks, but I'm hardly an expert woodworker. If I was, I'd have figured out how to carve faces by now.

5. ## Re: Statistics help

My gut says you should be testing to see if you can reject the null hypothesis of shuffling is fair and random first.

I may be mistaken, but I think a multivariate binomial distribution is more appropriate for what you're doing, or the small population expansion.

6. ## Re: Statistics help

This sounds awesome.

I'm doing a Master's in statistics currently, so I'll try commenting on a couple things. I definitely defer to the more experienced, though.

I think you are going about it in the right way, if I'm reading you right.

My plan:

1. Run my implementation of the bug one billion times, recording frequency distributions for the first and last 24 cards before shuffling showing up in the first 7 after shuffling.
2. Use a chi-square two sample test to compute two p-values - one for the first 24 cards in the deck, the other for the last 24, in both cases comparing data from the game vs data from my simulation. As I understand it, I need the two sample variation rather than the more common Pearson's version because my predicted distribution is itself generated by a random sample rather than derived theoretically.
3. Use Fisher's method to combine these p-values into one.
4. Compare the result with the chosen significance level of 0.05.
In step #1, you are simulating what you expect is happening during shuffling. That is, a bug is making it non-random.
You use these simulations to make up two distributions: for the first 24 cards, and for the last 24 cards.

In step #2, you run two tests to see if "first 24 real" is similar to "first 24 simulated", and same for last 24. I'd defer to warty goblin about the correctness of steps 2-4, but that sounds reasonable.

Note that, instead of using a p-value of 0.05 as the straight cut-off, you could take the attitude of degrees of evidence. < 0.05 is "strong evidence"; < 0.01 is "very strong evidence".
If you plan on presenting this to Wizards of the Coast, they might find it intriguing even if you have a p-value < .1.

As a side note, if you wanted to simply prove non-randomness, I'd recommend simulating truly random (well, computer pseudo-random RNG) shuffling and comparing its distribution to your real data, to see if there's a difference.
But here you're trying to prove a specific sort of non-randomness.
On the other hand, it might be worthwhile for you to test just if the shuffling is non-random in general as well as for what you expect is the particular bug.

I have 8 bins and unequal sample sizes, so 8 degrees of freedom.
Usually, the degrees of freedom are the number of bins - 1. (It can be less if you have to estimate any of your parameters.)

Originally Posted by Astral Avenger
My gut says you should be testing to see if you can reject the null hypothesis of shuffling is fair and random first.

I may be mistaken, but I think a multivariate binomial distribution is more appropriate for what you're doing, or the small population expansion.
I'd agree with his gut response. At least, that would make folk take the "now I'm testing for this specific sort of non-randomness" more seriously, since they'd already be fairly convinced something is off.

Hypergeometric (or some sort of multivariate hypergeometric?) might also be more appropriate, although for large samples the difference between it and binomial mostly disappear.

7. ## Re: Statistics help

Originally Posted by Kato
My statistics is not good enough (anymore) to reliably make a comment on the procedure (though I see no obvious problem).
But I wonder what you mean by 'first' and 'last' cards in the deck. The order in which you constructed your deck? Or the order Arena displays the cards when looking at it? I feel unless it was a weird intended bug it's unlikely the client remembers in what order you constructed the deck.
You can export a deck from Arena, which works by putting a list of the cards in it into your clipboard so you can paste it somewhere. The order of that list is always the same, and is determined by the order you added cards to the deck when you built it. When a deck is written to the game logs, the same order is used. I'm pretty sure this is the order that the game uses internally to store the deck, and most likely is also the order that gets input to the shuffler.

Originally Posted by Astral Avenger
My gut says you should be testing to see if you can reject the null hypothesis of shuffling is fair and random first.

Originally Posted by Astral Avenger
I may be mistaken, but I think a multivariate binomial distribution is more appropriate for what you're doing, or the small population expansion.
On looking it up, it looks like that is indeed the type of distribution I'm working with, but I didn't find anything on how to test whether two samples are from the same such distribution.

Originally Posted by JeenLeen
Note that, instead of using a p-value of 0.05 as the straight cut-off, you could take the attitude of degrees of evidence. < 0.05 is "strong evidence"; < 0.01 is "very strong evidence".
If you plan on presenting this to Wizards of the Coast, they might find it intriguing even if you have a p-value < .1.
A low p-value would actually be an indication that I'm wrong, not that I'm right.

Originally Posted by JeenLeen
As a side note, if you wanted to simply prove non-randomness, I'd recommend simulating truly random (well, computer pseudo-random RNG) shuffling and comparing its distribution to your real data, to see if there's a difference.
But here you're trying to prove a specific sort of non-randomness.
On the other hand, it might be worthwhile for you to test just if the shuffling is non-random in general as well as for what you expect is the particular bug.
Already did that, as linked above. The distribution a correct shuffler is supposed to have is easy to derive from pure theory, and that's the first thing I tested. My choice of hypothesis for the specific bug is based on the patterns I observed in that first test.

Originally Posted by JeenLeen
Usually, the degrees of freedom are the number of bins - 1. (It can be less if you have to estimate any of your parameters.)
According to the source I'm using for how to do the two sample chi-squared test, if the sample sizes are different then it's just the number of bins. If you have a more authoritative source that says otherwise, please tell me where I can find it. When I try searching for such things, the overwhelming majority of results are about a regular Pearson's chi-squared test, which has an assumption that doesn't match my situation - that the predicted distribution is a theoretical one, known exactly with no variance.

Originally Posted by JeenLeen
Hypergeometric (or some sort of multivariate hypergeometric?) might also be more appropriate, although for large samples the difference between it and binomial mostly disappear.
Hypergeometric is what a correct shuffle is supposed to have. This bug results in something different.

Thanks for the comments! I'm hoping to post my detailed study plan on reddit today, but I want to be reasonably sure I'm doing the analysis right first.

Incidentally, the numbers for the simulation row in my example are real. Running the billion shuffles and tabulating the results took somewhere around an hour, I think.

8. ## Re: Statistics help

You're using the two sample chi square correctly so far as I can tell. However, combining the two p-values via Fisher's Method isn't quite right here, because the tests aren't independent. Since all the cards have to end up somewhere in the deck, the location of the first card in the shuffled deck is not independent of the location of the second.

The most obvious solution is to do one test from the beginning by working directly with the simulated joint distribution. So when you simulate/tabulate the data, record the probabilities for all 60 cards ending up in your hand, then calculate the marginal distribution of cards from the first 24 and last 24 by summing over that. This directly gets you a Monte Carlo approximation to the appropriate distribution, so you can calculate a single p-value via the chi square test.

Because the deck is fairly large however, the dependence between the cards will be fairly weak, so this won't change your answer very much. Further, because the dependence is by necessity negative, your current method is very slightly conservative, which, if you're going to be wrong, is the direction to be wrong in.

9. ## Re: Statistics help

How is the non-independence of positions of different cards relevant to my use of Fisher's method? That non-independence affects the results for how many early/late cards are in the opening hand, which is what goes into the two sample chi square test. Fisher's method doesn't come in until that's already done.

10. ## Re: Statistics help

Huh, I never felt like MGA's shuffling was eskew... but then again I'm not too attentive or whatever counts here. If it was really bad I guess I would have noticed.

So, I know I cannot really be of much help but if I may ask anyway...(assuming this isn't just an exercise to improve your statistics background)
what do you think the bug is and how much does it diverge from the 'proper' distribution? (going by your simulation)

11. ## Re: Statistics help

Originally Posted by Douglas
How is the non-independence of positions of different cards relevant to my use of Fisher's method? That non-independence affects the results for how many early/late cards are in the opening hand, which is what goes into the two sample chi square test. Fisher's method doesn't come in until that's already done.
Fisher's method combines p-values from independent tests. Your tests aren't independent because you're p-values are derived from tests of dependent random variables. If A is the number of first 24 cards in your hand, and B is the number of the last 24, then you necessarily have that A + B <= 7.

12. ## Re: Statistics help

Originally Posted by Kato
Huh, I never felt like MGA's shuffling was eskew... but then again I'm not too attentive or whatever counts here. If it was really bad I guess I would have noticed.

So, I know I cannot really be of much help but if I may ask anyway...(assuming this isn't just an exercise to improve your statistics background)
what do you think the bug is and how much does it diverge from the 'proper' distribution? (going by your simulation)
The correct way to do a Fisher-Yates shuffle (which is what lead developer Chris Clay has said they're using) goes like this:
Code:
```for (int i = 0; i < deck.length; i++) {
int swapIndex = random.nextInt(deck.length - i) + i;
int temp = deck[i];
deck[i] = deck[swapIndex];
deck[swapIndex] = temp;
}```
I think the bug is that Arena is actually doing this:
Code:
```for (int i = 0; i < deck.length; i++) {
int swapIndex = random.nextInt(deck.length);
int temp = deck[i];
deck[i] = deck[swapIndex];
deck[swapIndex] = temp;
}```
For a 60 card deck, counting how many of the first 24 cards in the decklist get drawn and how many of the last 24, the distributions from my simulation look like this. Each value is the probability of drawing that many of those cards in the opening 7 card hand.
 0 in hand 1 in hand 2 in hand 3 in hand 4 in hand 5 in hand 6 in hand 7 in hand first 24 0.009336 0.068686 0.201692 0.306143 0.259227 0.122308 0.029739 0.002869 last 24 0.046986 0.194165 0.319792 0.271807 0.128615 0.033814 0.004575 0.000245

Originally Posted by warty goblin
Fisher's method combines p-values from independent tests. Your tests aren't independent because you're p-values are derived from tests of dependent random variables. If A is the number of first 24 cards in your hand, and B is the number of the last 24, then you necessarily have that A + B <= 7.
I see. I expect the effect of that on the aggregate statistics is really tiny, especially over a large sample size, and weakened even further by the fact that some games will only be counted in one or the other because, for example, the 24th and 25th cards are the same (and I can't distinguish which copy got drawn).

If I really want to be rigorous about this detail, it would be far simpler to separate games into two groups, and check only the first 24 cards for one group and the last 24 for the other. I actually did that for the simulation results, doing a separate set of 1 billion shuffles for each distribution.

Sounds like you're saying this is ok to ignore because of how small it is and what direction it's in?

13. ## Re: Statistics help

Wow, it took me way too long to code this (mostly because I made stupid mistakes not because I couldn't figure it out but still.. and I used octave) (Also, not sure if my implementation is the most efficient but it works)

Okay, the difference seems really obvious now so if you have the raw data it should be clear if they use the wrong algorithm which would be a bit embarassing... Also, it seems weird MGA doesn't sort cards alphabetically or something. But apparently not. So let me know if I should abuse this bug in the future

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•