PDA

View Full Version : Anyone want to help me with my statistics?

potatocubed
10-22-2009, 06:15 AM
So, I've got some data. It looks a lot like this:

Alice|Pam|302
Bob|Pam|251
Alice|Quentin|64
Bob|Quentin|58

Except, you know, I've got hundreds of samples.

I'm trying to judge how well the copyeditors are doing, based on how many errors the proofreaders turn up in their work. The complication is that the proofers have wildly variable standards regarding which things are worth correcting and which things aren't. What I'm trying to do is remove the 'proofreader bias' from the data.

My plan to do so is this:

1. Find the mean average of all the results from a given proofreader. (M)
2. Multiply that number by x, so that it becomes 50. (xM = 50)
3. Multiply all the original results by x to get a 'normalised' (is that even the right term?) set of results.

Would that work to do what I want? If not, what would? It's more important that the final method be simple rather than super-rigorously accurate - I'm looking for general tendencies rather than exact values.

douglas
10-22-2009, 06:21 AM
3. Multiply all the original results by x to get a 'normalised' (is that even the right term?) set of results.
This will exaggerate proofreader bias, not eliminate it. You want to divide by M instead. If your data has to be integers or you just prefer not working with results that are all near 1, multiply the original data by 100 before dividing by M.

Telonius
10-22-2009, 11:12 AM
One other variable you might want to consider - how many pages/words are the copyeditors doing apiece? If Alice is doing 100 pages and Bob's only doing 20, that won't show up if you're just looking at number of errors found.

Tirian
10-22-2009, 12:22 PM
It seems to me like a sound strategy (considering douglas' correction, of course). If you feel like your description is handwavey, it would be very intuitive to describe it with percentages instead of normalization (which seems like a sensible word from a linear algebra perspective, and of course I put a 'z' in there being on the other side of the Atlantic.)

In other words, in the sample you give, Alice makes up 55% of Pam's corrections and 52% of Quentin's, and based on those two compatible data points (and the hundreds of others that you have, of course) you can evaluate the validity of the hypothesis that Alice is a more error-prone copyeditor than Bob.

Roukon
10-22-2009, 12:37 PM
One other variable you might want to consider - how many pages/words are the copyeditors doing apiece? If Alice is doing 100 pages and Bob's only doing 20, that won't show up if you're just looking at number of errors found.

This is one thing I would stress. You want to make sure that they have similar stimuli so you have the best basis for comparison. Ideally, you would want them to copyedit and proofread the same item.

As to your original question, the easiest way to eliminate the "proofreader bias" would be to pre-test them by giving your proofreaders the same document and comparing the errors the find with each other until you have some of them that have similar score for the proofreading. Then you use those individuals for the actual proofreading and since they have a similar "proofreading score" you won't have as much of a bias as if you used random people.

However, it sounds like you have all ready collected your data. You can still use the method I just described, but you will be throwing out many of the data points you collected, as they will only complicate or distort your findings. If this is for a report of any type (even a school one, which I assume it is) you will need to report that you excluded some data points and why you did this.

If you want more help, PM me. I am a psychology graduate, who has an interest in statistics. However, I feel I should mention, stats are rarely simple and quick to do.

Later Days,
Roukon

potatocubed
10-22-2009, 04:14 PM
Thanks for all your help guys. I'll make douglas' correction and see if I can't do something with the page counts as well - I have that data, it's just less convenient to pull out of the system.

Pyrian
10-22-2009, 04:47 PM
Without page counts or a similar measure of comparison, the data is meaningless, unless they're generally evenly distributed anyway. "Joe has four times as many errors as Hugo!" "Um, but Joe's been working here eight times as long..." Even with page counts, the method you're using can easily introduce errors if certain proofreaders are significantly more likely to have proofread certain copyeditors - i.e., if Pam proofread a lot of Alice's work and Alice was very good, Pam would get weighted as a lenient proofreader.

Thajocoth
10-22-2009, 08:11 PM
Assuming no more columns than that, I'd say modify corrections based on each proofreader's standard deviations.

If there are any more columns besides what's shown, I'd recommend using 10-fold cross-validation, but that requires:
A) Explaining it
B) You to write code (or draw very long decision trees)

It's a form of machine learning... A way of extracting information from a data table. Basically, you'd declare Copyeditor to be your "class" variable, since that's what you want to learn about. Then you normalize your results a bit. (Personally, I'd take an average of each proofreader's corrections, and divide each set of corrections by that, but a better way to handle it than that would be to calculate each proofreader's standard deviations and use that data to group their corrections into A, B, C, D & E (It's bad for any column to have too many values... 5 is actually pushing it, but acceptable). Then remove the proofreader column, as it's incorporated into the corrections column.) Any remaining quantitative columns would be broken down into ranges instead of using flat numbers. That's all preparation for machine learning in general. 10-fold cross-validation means you divide the data into 10 equal pieces. For each piece, you calculate the rules of the other 9 pieces, then test those rules on the current piece to see how accurate they are. The average of those is the accuracy of your resulting rule-set. Rule generation would wind up being, like, "If A and B and C, then Copyeditor is Bob." This is, of course, very much a summary and not enough information really to do this.

EDIT:
...the page counts as well - I have that data, it's just less convenient to pull out of the system.

Definitely incorporate the page counts, but again, that should adjust corrections (corrections per 100 pages?), so it wouldn't bring you into machine learning territory.

RS14
10-22-2009, 08:50 PM
This will exaggerate proofreader bias, not eliminate it. You want to divide by M instead. If your data has to be integers or you just prefer not working with results that are all near 1, multiply the original data by 100 before dividing by M.

x=50/M. Multiplying the data set by x is the same as dividing by 50, with a scaling factor, if I understand what is being done correctly.