Statistical methods for research workers.docx
《Statistical methods for research workers.docx》由会员分享,可在线阅读,更多相关《Statistical methods for research workers.docx(18页珍藏版)》请在冰豆网上搜索。
Statisticalmethodsforresearchworkers
Statisticalmethodsforresearchworkers
STATISTICALMETHODSFORRESEARCHWORKERS
ByRonaldA.Fisher(1925)
PostedApril2000
VI
THECORRELATIONCOEFFICIENT
30. Noquantityismorecharacteristicofmodernstatisticalworkthanthecorrelationcoefficient,andnomethodhasbeenappliedsuccessfullytosuchvariousdataasthemethodofcorrelation.Observationaldatainparticular,incaseswherewecanobservetheoccurrenceofvariouspossiblecontributorycausesofaphenomenon,butcannotcontrolthem,hasbeengivenbyitsmeansanaltogethernewimportance.Inexperimentalworkproperitspositionismuchlesscentral;itwillbefoundusefulintheexploratorystagesofanenquiry,aswhentwofactorswhichhadbeenthoughtindependentappeartobeassociatedintheiroccurrence;butitisseldom,withcontrolledexperimentalconditions,thatitisdesiredtoexpressourconclusionintheformofacorrelationcoefficient.
Oneoftheearliestandmoststrikingsuccessesofthemethodofcorrelationwasinthebiometricalstudyofinheritance.Atatimewhennothingwasknownofthemechanismofinheritance,orofthestructureofthegerminalmaterial,itwaspossiblebythismethodtodemonstratetheexistenceofinheritance,andto[p.139]"measureitsintensity";andthisinanorganisminwhichexperimentalbreedingcouldnotbepractised,namely,Man.Bycomparisonoftheresultsobtainedfromthephysicalmeasurementsinmanwiththoseobtainedfromotherorganisms,itwasestablishedthatman'snatureisnotlessgovernedbyhereditythanthatoftherestoftheanimateworld.Thescopeoftheanalogywasfurtherwidenedbydemonstratingthatcorrelationcoefficientsofthesamemagnitudewereobtainedforthementalandmoralqualitiesinmanasforthephysicalmeasurements.
Theseresultsarestilloffundamentalimportance,fornotonlyisinheritanceinmanstillincapableofexperimentalstudy,andexistingmethodsofmentaltestingarestillunabletoanalysethementaldisposition,butevenwithorganismssuitableforexperimentandmeasurement,itisonlyinthemostfavourablecasesthattheseveralfactorscausingfluctuatingvariabilitycanberesolved,andtheireffectsstudied,byMendelianmethods.Suchfluctuatingvariability,withanapproximatelynormaldistribution,ischaracteristicofthemajorityoftheusefulqualitiesofdomesticplantsandanimals;andalthoughthereisstrongreasontothinkthatinheritanceinsuchcasesisultimatelyMendelian,thebiometricalmethodofstudyisatpresentalonecapableofholdingouthopesofimmediateprogress.
WegiveinTable31anexampleofacorrelationtable.Itconsistsofarecordincompactformofthestatureof1376fathersanddaughters.(PearsonandLee'sdata.)Themeasurementsaregroupedin[p.140-141][table][p.142]inches,andthosewhosemeasurementwasrecordedasanintegralnumberofincheshavebeensplit;thusafatherrecordedasof67incheswouldappearas1/2under66.5and1/2under67.5.Similarlywiththedaughters;inconsequence,whenbothmeasurementsarewholenumbersthecaseappearsinfourquarters.Thisgivesthetableaconfusingappearance,sincethemajorityofentriesarefractional,althoughtheyrepresentfrequencies.Itispreferable,ifbiasinmeasurementcanbeavoided,togrouptheobservationsinsuchawaythateachpossibleobservationlieswhollywithinonegroup.
Themostobviousfeature of thetableisthatcasesdonotoccurinwhichthefatherisverytallandthedaughterveryshort,and viceversa ;theupperright-handandlowerleft-handcornersofthetableareblank,sothatwemayconcludethatsuchoccurrencesaretooraretooccurinasampleofabout1400cases.Theobservationsrecordedlieinaroughlyellipticalfigurelyingdiagonallyacrossthetable.Ifwemarkouttheregioninwhichthefrequenciesexceed10itappearsthatthisregion,apartfromnaturalirregularities,issimilar,andsimilarlysituated.Thefrequencyofoccurrenceincreasesfromallsidestothecentralregionofthetable,whereafewfrequenciesover30maybeseen.Thelinesofequalfrequencyareroughlysimilarandsimilarlysituatedellipses.Intheouterzoneobservationsoccuronlyoccasionally,andthereforeirregularly;beyondthiswecouldonlyexplorebytakingamuchlargersample.
Thetablehasbeendividedintofourquadrantsby[p.143]markingoutcentralvaluesofthetwovariates;thesevalues,67.5inchesforthefathersand63.5inchesforthedaughters,arenearthemeans.Whenthetableissodivideditisobviousthatthelowerright-handandupperleft-handquadrantsaredistinctlymorepopulousthantheothertwo;notonlyaremoresquaresoccupied,butthefrequenciesarehigher.Itisapparentthattallmenhavetalldaughtersmorefrequentlythantheshortmen,and viceversa. Themethodofcorrelationaimsatmeasuringthedegreetowhichthisassociationexists.
Themarginaltotalsshowthefrequencydistributionsofthefathersandthedaughtersrespectively.Thesearebothapproximatelynormaldistributions,asisfrequentlythecasewithbiometricaldatacollectedwithoutselection.Thismarksafrequentdifferencebetweenbiometricalandexperimentaldata.Anexperimenterwouldperhapshavebredfromtwocontrastedgroupsoffathersof,forexample,63and72inchesinheight;allhisfatherswouldthenbelongtothesetwoclasses,andthecorrelationcoefficient,ifused,wouldbealmostmeaningless.Suchanexperimentwouldservetoascertaintheregressionofdaughter'sheightinfather'sheight,andsotodeterminetheeffectonthedaughtersofselectionappliedtothefathers,butitwouldnotgiveusthecorrelationcoefficientwhichisadescriptiveobservationalfeatureofthepopulationasitis,andmaybewhollyvitiatedbyselection.
Justasnormalvariationwithonevariatemaybespecifiedbyafrequencyformulainwhichthe[p.144]logarithmofthefrequencyisaquadraticfunctionofthevariate,sowithtwovariatesthefrequencymaybeexpressibleintermsofaquadraticfunctionofthevaluesofthetwovariates.Wethenhaveanormalcorrelationsurface,forwhichthefrequencymayconvenientlybewrittenintheform
Inthisexpression x and y arethedeviationsofthetwovariatesfromtheirmeans, σ1 and σ2 arethetwostandarddeviations,and ρ isthecorrelation between x and y. Thecorrelationintheaboveexpressionmaybepositiveornegative,butcannotexceedunityinmagnitude;itisapurenumberwithoutphysicaldimensions.If ρ=0,theexpressionforthefrequencydegeneratesintotheproductofthetwofactors
showingthatthelimitofthenormalcorrelationsurface,whenthecorrelationvanishes,ismerelythatoftwonormallydistributedvariatesvaryingincompleteindependence.Attheotherextreme,when p is+1or-1,thevariationofthetwovariatesisinstrictproportion,sothatthevalueofeithermaybecalculatedaccuratelyfromthatoftheother.Inotherwords,weceasestrictlytohavetwovariates,butmerelytwomeasuresofthesamevariablequantity.
Ifwepickoutthecasesinwhichonevariatehasanassignedvalue,wehavewhatistermedanarray;[p.145]thecolumnsandrowsofthetablemay,exceptasregardsvariationwithinthegrouplimits,beregardedasarrays.Withnormalcorrelationthevariationwithinanarraymaybeobtainedfromthegeneralformula,bygiving x aconstantvalue,(say) a,anddividingbythetotalfrequencywithwhichthisvalueoccurs;thenwehave
showing(i.)thatthevariationof y withinthearrayisnormal;(ii.)thatthemeanvalueof y forthatarrayis ρaσ2/σ1,sothattheregressionofy on x islinear,withregressioncoefficient
and(iii.)thatthevarianceof y withinthearrayis σ22(1-ρ2),andisthesamewithineacharray.Wemayexpressthisbysayingthatofthetotalvarianceof y thefraction(1-ρ2)isindependentof x,whiletheremainingfraction, ρ2,isdeterminedby,orcalculablefrom,thevalueof x.
Theserelationsarereciprocal,theregressionof x on y islinear,withregressioncoefficient ρσ1/σ2;thecorrelation ρ isthusthegeometricmeanofthetworegressions.Thetworegressionlinesrepresentingthemeanvalueof x forgiven y,andthemeanvalueof y forgiven x,cannotcoincideunless ρ=[plusorminus]1.Thevariationof x withinanarrayinwhich y isfixed,isnormalwithvarianceequalto σ12(1-ρ2),sothatwemaysaythatofthevarianceof x thefraction(1-ρ2)[p.146]isindependentof y,andtheremainingfraction, ρ2,isdeterminedby,orcalculablefrom,thevalueof y.
Sucharetheformalmathematicalconsequencesofnormalcorrelation.Muchbiometricdatacertainlyshowsageneralagreementwiththefeaturestobeexpectedonthisassumption;thoughIamnotawarethatthequestionhasbeensubjectedtoanysufficientlycriticalenquiry.Approximateagreementisperhapsallthatisneededtojustifytheuseofthecorrelationasaquantitydescriptiveofthepopulation;itsefficacyinthisrespectisundoubted,anditisnotimprobablethatinsomecasesitaffordsacompletedescriptionofthesimultaneousvariationofthevariates.
31.TheStatisticalEstimationoftheCorrelation
Justasthemeanandthestandarddeviationofanormalpopulationinonevariatemaybemostsatisfactorilyestimatedfromthefirsttwomomentsoftheobserveddistribution,sotheonlysatisfactoryestimateofthecorrelation,whenthevariatesarenormallycorrelated,isfoundfromthe"productmoment."If x and y representthedeviationsofthetwovariatesfromtheirmeans,wecalculatethethreestatistics s1, s2, r bythethreeequations
ns12 =S(x2), ns22 =S(y2), nrs1s2 =S(xy);
then s1 and s2 areestimatesofthestandarddeviations σ1,and σ2,and