programming the programmatically impractical

**JeenLeen** · 2018-03-13, 09:02 PM (ISO 8601)

Point of thread: any tips on efficiently trying to do way too many things in a program, particularly in R but general guidelines are welcome. Or solid 'give up' advice is welcome, too

I'm trying to write a program to do something checking all combinations of answering questions right on a test. Assume binary answers, so either a 1 (got it right) or 0 (got it wrong).
If I recall the professor correctly, this amounts to checking approximately 2^x things, where x is the number of items on a test. That is, for a 30-item test, I'm already over 1 billion things to check.
And it's possible checking each thing might take a minute or so, plus my software (R) errors-out if I try to store 2^30 things.

So I have two issues.
1) processing time
2) storing results
I think I can get around #2 by not storing the response of each check, but just keeping a counter of the number of fails and exiting after so many fails. Maybe record the fails and output those. (It's failures we care about. Ideally, nothing fails. I might set it so that the program exits after 100 fails.*)

I think I can get around #1 basically just by tolerating is the program runs for a few hours. To test this, I tried a simple procedure like

Code:

x = 2^30
Sys.time() #displays system time
for (i in 1:x) y=Sys.time()

and checking the time it takes to run 2^30 loops of a mere function call, then compare the value of y to the time before the check started.
Since I'm trying something that's hard, I think it's okay if it takes a long time to run as long as it, well, runs. Or at least runs overnight for most test sizes. (But it's run for almost an hour so far, and not close to finishing... I should probably add a checker like it print something every 10 million operations or something, so I have some idea of how long it's taking. Probably gonna stop it for tonight.)

More Context
In case anyone here knows about the topic, what I'm trying to do is write a R program to check test data for strong local independence. In Item Response Theory (sorta like the modern statistics for test/assessment data), most tests simply check for weak local independence (or something essentially equivalent) and work on the assumption that weak LI implies strong LI. For a class project (and maybe publication in a journal), I'd like to try to make a strong LI procedure.
If you know of any journal articles talking about strong LI, please point them to me. I've found a lot of cool articles on weak LI, essential unidimensionality, DIMTEST/DETECT, and similar concepts, but they just aren't what I'm looking at.

*caveat
To be more precise: weak LI usually checks independence between each pairwise combination (e.g., items 1 and 2, then 1 and 3, then 1 and 4... and so on), rejecting LI if a significant correlation is found. I plan to check the pairwise combinations first. If I'm getting fails at those, early on in the program, then I'm pretty sure it'll have a lot more fails when getting into the more complicated stuff.
That is, weak LI implies strong LI, but if I see a lack of weak LI, I have no real reason to waste time checking strong LI.
On the other hand, if I'm seeing my fails scattered amongst the later combinations, then just outputting the first 100 fails is probably enough for the person using the program to determine if they care or not. Failing strong LI might not matter (from a practical standpoint) if it's just a few scattered item combinations that show dependence.

**warty goblin** · 2018-03-13, 11:12 PM (ISO 8601)

Well number one, if you're checking a billion things, you are going to find dependence whether it's real or not. You'll need some form of false discovery rate control. I'm also not sure exactly what condition you're checking for failure, but that's likely to be where most of your computing time goes, so make sure that code is optimized. If you aren't storing the results, then the overhead should be negligible, and therefore you can estimate your running time fairly accurately through running a reasonably sized case (say 2^10), and multiplication.

This also strikes me as a problem that is likely to be pretty easy to run in parallel. R has somewhat indifferent support for this, but it does do it, and should cut your running time by a factor of the number of cores you can throw at it. Particularly handy if you can get your hands on access to a multi-core server, then set 16 or 32 processors to hammer away at it.

Also, if you're worried about speed, and are doing fairly basic operations, you really want to not use R. R sucks for speed, even for an interpreted language. You'll do better in Java, and better again in C or C++.

**Kato** · 2018-03-14, 03:02 AM (ISO 8601)

I'm not entirely sure what you mean to do... It sounds to me like you want to study all 2^30 possible combinations of answers but I wonder what you expect to get from this.. It seems to me much more fruitful to analyze a specific set of actual tests and look for correlations among them. But then unless you have a massive sample size that should be far less than the 100 million the former task would require you to check.
Though I'm probably misunderstanding what you are looking for...

Edit: the bigger issue to me seems to be that if I understand strong (in)dependence correctly, you will possibly need to check 30! ~ 10^30 combinations... I highly recommend doing some previous check for weak independence and only then look for strong, or at least start with a shorter test, but even 10! seems like a lot.. Unless, again, I'm misunderstanding.

**NichG** · 2018-03-14, 05:45 AM (ISO 8601)

Assuming I understand correctly that you want to use observational data to check independence relationships...

If you're talking about comparing all tabular entries in the distribution p(x1,x2,x3,...,xn|z) with p(x1|z)*p(x2|z)*p(x3|z)*...*p(xn|z) or combinations p(x1,x2|z)*p(x3|z)*p(x4,x5|z) etc, then keep in mind that you aren't going to have 2^30 tests worth of data. This means that no more than a number of cells of the joint distribution equal to the number of tests you have will actually be non-zero, while at the same time assuming that none of your questions have either 100% correct or incorrect answers, the product of component distributions is always going to be non-zero.

The issue here is not that it's computationally expensive to check all the independence relationships, but rather that you probably will have insufficient data such to actually resolve the question of independence over all possible independence relationships admitted by this particular set of variables.

Thread: programming the programmatically impractical

Thread Tools

Spoilers