<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The AstroStat Slog &#187; Efron</title>
	<atom:link href="http://groundtruth.info/AstroStat/slog/tag/efron/feed/" rel="self" type="application/rss+xml" />
	<link>http://groundtruth.info/AstroStat/slog</link>
	<description>Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders</description>
	<lastBuildDate>Fri, 09 Sep 2011 17:05:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>missing data</title>
		<link>http://groundtruth.info/AstroStat/slog/2008/missing-data/</link>
		<comments>http://groundtruth.info/AstroStat/slog/2008/missing-data/#comments</comments>
		<pubDate>Mon, 27 Oct 2008 13:24:22 +0000</pubDate>
		<dc:creator>hlee</dc:creator>
				<category><![CDATA[Astro]]></category>
		<category><![CDATA[Cross-Cultural]]></category>
		<category><![CDATA[Data Processing]]></category>
		<category><![CDATA[Stat]]></category>
		<category><![CDATA[bootstrap]]></category>
		<category><![CDATA[catalog]]></category>
		<category><![CDATA[Efron]]></category>
		<category><![CDATA[estimator]]></category>
		<category><![CDATA[ignorable]]></category>
		<category><![CDATA[imputation]]></category>
		<category><![CDATA[incompleteness]]></category>
		<category><![CDATA[Little]]></category>
		<category><![CDATA[MAR]]></category>
		<category><![CDATA[MCAR]]></category>
		<category><![CDATA[missing data]]></category>
		<category><![CDATA[nonparametric]]></category>
		<category><![CDATA[Rubin]]></category>
		<category><![CDATA[Schafer]]></category>
		<category><![CDATA[survey]]></category>

		<guid isPermaLink="false">http://groundtruth.info/AstroStat/slog/?p=359</guid>
		<description><![CDATA[The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers&#8230;I&#8217;m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object [...]]]></description>
			<content:encoded><![CDATA[<p>The notions of <b>missing data</b> are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers&#8230;I&#8217;m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about <b>imputation</b> in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about <b>incompleteness</b> within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes. <span id="more-359"></span></p>
<p>I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won&#8217;t affect the final results. </p>
<p>Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey.  Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.</p>
<p>In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it&#8217;s better to have error bars than nothing. My question is <u>what are statistical proposals for astronomers to handle missing data?</u>  Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.</p>
<ul>
<li><em>Data mining and the impact of missing data</em> by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003)  Vol. 103, No. 8, pp.611-621
</li>
<li><em>Missing Data: Our View of the State of the Art</em> by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
</li>
<li><em>Missing Data, Imputation, and the Bootstrap</em> by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin&#8217;s comment</li>
<li><a href="http://www.stat.psu.edu/~jls/mifaq.html">The multiple imputation FAQ page (web)</a> by J. Shafer </li>
<li><em>Statistical Analysis with Missing Data</em> by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.</li>
<li><a href="http://www.secondmoment.org/articles/missingdata.php">The Curse of the Missing Data (web)</a> by Yong Kim
</li>
<li><em>A Review of Methods for Missing Data</em> by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with &#8220;asthma data&#8221;)</li>
</ul>
<p>Pigott discusses missing data methods to general audience in plain terms under the following categories: <i>complete-cases, available-cases, single-value imputation,</i> and more recent <i>model-based methods, maximum likelihood for multivariate normal data,</i> and <i>multiple imputation.</i> Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).  </p>
<p>Most introductory articles begin with common assumptions like <b>missing at random (MAR)</b> or <b>missing at completely random (MCAR)</b> but these seem not apply to typical astronomical data sets (I don&#8217;t know exactly why yet &#8211; I cannot provide counter examples to prove &#8211; but that&#8217;s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:</p>
<ul>
<li> <i>data missing at random</i> : missing for reasons related to completely observed variables in the data set</li>
<li> <i>data missing completely at random</i> : the complete cases are a random sample of the originally identified set of cases </li>
<li> <i>non-ignorable missing data</i> : the reasons for the missing observations depend on the values of those variables.</li>
<li> <i>outliers treated as missing data</i></li>
<li> <i>the assumption of an ignorable response mechanism.</i>
</li>
</ul>
<p>Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable. </p>
<p>Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some &#8220;new&#8221; concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.</p>
<p>Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:</p>
<ul>
<li>complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
</li>
<li>available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
</li>
<li>single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
</li>
<li>maximum likelihood, and
</li>
<li>multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)
</li>
</ul>
<p>and the following are imputation strategies:</p>
<ul>
<li>mean substituion,</li>
<li>case substitution (scientific knowledge authorizes substitution),</li>
<li>hot deck imputation (external sources imputes imputation),</li>
<li>cold deck imputation (values drawn from the next most similar case but difficulty in defining what is &#8220;similar&#8221;),</li>
<li>regression imputation (prediction with independent variables and mean imputation is a special case) and </li>
<li> multiple imputation</li>
</ul>
<p>Some might prefer the following listing (adopted from Gelman and Brown&#8217;s regression analysis book):</p>
<ul>
<li>simple missing data approaches that retain all the data</li>
<ol>
<li>-mean imputation</li>
<li>-last value carried forward</li>
<li>-using information from related observation</li>
<li>-indicator variables for missingness of categorical predictors</li>
<li>-indicator varibbles for missingness of continuous predictors</li>
<li>-imputation based on logical values</li>
</ol>
<li>random imputation of a single variables</li>
<li>imputation of several missing variables</li>
<li>model based imputation</li>
<li>combining inferences from multiple imputation</li>
</ul>
<p><!-- nonparametric fashion to impute missing data? the methodology shall data type dependent,  though. --></p>
<p>Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don&#8217;t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.</p>
<p><b>Schafer</b> and <b>Graham</b> described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest &#8212; not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data. </p>
<p>The following quote from the above web link (<b>Y. Kim</b>) says more.</p>
<blockquote><p>
Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine. </p></blockquote>
<p>Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.</p>
]]></content:encoded>
			<wfw:commentRss>http://groundtruth.info/AstroStat/slog/2008/missing-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Survival Analysis: A Primer</title>
		<link>http://groundtruth.info/AstroStat/slog/2008/survival-analysis-a-primer/</link>
		<comments>http://groundtruth.info/AstroStat/slog/2008/survival-analysis-a-primer/#comments</comments>
		<pubDate>Tue, 08 Jul 2008 23:27:38 +0000</pubDate>
		<dc:creator>hlee</dc:creator>
				<category><![CDATA[Fitting]]></category>
		<category><![CDATA[Stat]]></category>
		<category><![CDATA[arXiv]]></category>
		<category><![CDATA[censored]]></category>
		<category><![CDATA[Efron]]></category>
		<category><![CDATA[Feigelson]]></category>
		<category><![CDATA[Freedman]]></category>
		<category><![CDATA[massive data]]></category>
		<category><![CDATA[Nelson]]></category>
		<category><![CDATA[Petrosian]]></category>
		<category><![CDATA[survival analysis]]></category>
		<category><![CDATA[truncated]]></category>

		<guid isPermaLink="false">http://groundtruth.info/AstroStat/slog/?p=340</guid>
		<description><![CDATA[Astronomers confront with various censored and truncated data. Often these types of data are called after famous scientists who generalized them, like Eddington bias. When these censored or truncated data become the subject of study in statistics, instead of naming them, statisticians try to model them so that the uncertainty can be quantified. This area [...]]]></description>
			<content:encoded><![CDATA[<p>Astronomers confront with various censored and truncated data. Often these types of data are called after famous scientists who generalized them, like Eddington bias. When these censored or truncated data become the subject of study in statistics, instead of naming them, statisticians try to model them so that the uncertainty can be quantified. This area is called <strong>survival analysis</strong>. If your library has <i>The American Statistician</i> subscription and you are an astronomer handles censored or truncated data sets, this primer would be useful for briefly conceptualizing statistics jargon in survival analysis and for characterizing uncertainties residing in your data. <span id="more-340"></span></p>
<blockquote><p>
<strong> Survival Analysis: A Primer</strong> by David A. Freedman<br />
The American Statistician, May 2008, Vol. 62, No.2, pp. 110-119
</p></blockquote>
<p>This article explains the basics of survival analysis and adds criticisms on previously conducted studies. Since the given examples are from medical studies, astronomers may not be interested in reading the whole article. Nonetheless,  Freedman offers the definitions in survival analysis  such as  survival function, hazard rate, the Kaplan-Meier estimator, the proportional hazard model with clarity and conciseness. For example, if &#964; (a positive random variable indicating the waiting time for failure) is Weibull, the hazard rate takes an exact form of the celebrated power law in astronomy (I think modification of pdfs reflecting censoring and truncation may lead more robust results compared to fitting power laws unless parameters in power laws have astrophysical implications and survival analysis approaches cannot perform the same parametrization). </p>
<p>Commonality between power laws and Pareto distributions and frequent appearance of power laws in astronomical journals drives some anticipation of frequent applications of survival analysis to astronomical data; on the contrary, there are not many. </p>
<p>Though there are more, here are a few references relevant to survival analysis, that utilized examples from astronomy or appeared astronomical journals:</p>
<ul>
<li><b>Nonparametric Methods for Doubly Truncated Data</b> by B Efron and V Petrosian. (subscription required)<br />
<i> Journal of the American Statistical Association,</i> Vol. 94, pp. 824-834 (1999)
</li>
<li><b>Survival Analysis of the Gamma-Ray Burst Data</b> by B Efron and V Petrosian. (subscription required)<br />
<i> Journal of the American Statistical Association,</i> Vol. 89, pp. 452-464 (1994)
</li>
<li><a href="http://adsabs.harvard.edu/abs/1992ApJ...399..345E">A simple test of independence for truncated data with applications to redshift surveys</a> by B Efron and V Petrosian<br />
<i>ApJ,</i> Vol. 399, pp.345-352 (1992)
</li>
<li><a href="http://adsabs.harvard.edu/abs/1985ApJ...293..192F">Statistical methods for astronomical data with upper limits. I &#8211; Univariate distributions</a> by Feigelson and Nelson<br />
<i>ApJ,</i> Vol. 293, pp.192-206 (1985)</li>
<li><b>Nonparametric Estimation of the Slope of a Truncated Regression</b> by Bhattacharya, Chernoff, and Yang (subscription required)<br />
The Annals of Statistics, Vol. 11(2), pp. 505-514 (1983)
</li>
</ul>
<p>Note that these papers only dealt particular statistical interests with an general introduction about survival analysis and definitions of estimators based on relatively small sample size data sets. Facing massive survey data with truncation and heterogeneity in measurement errors in astronomy could open a new era of survival analysis. </p>
<p>Lastly, there are studies regarding Pareto distribution some of which are presented in the slog. (Use &#8220;search&#8221; with Pareto. More statistical papers on survival analysis in astronomy are welcome to be added; please, inform me.)</p>
<p><!--The mathematics does say something about statistical practice. At least in the setting of Example 5, and contrary to general opinion, the model does not use time-to-event data. It uses only the ranks: which subject failed first, which failed second, and so forth. That, indeed, is what enables the fitting procedure to get around problems created by the intractable likelihood function.--></p>
]]></content:encoded>
			<wfw:commentRss>http://groundtruth.info/AstroStat/slog/2008/survival-analysis-a-primer/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>[Quote] Bootstrap and MCMC</title>
		<link>http://groundtruth.info/AstroStat/slog/2007/quote-bootstrap-vs-mcmc/</link>
		<comments>http://groundtruth.info/AstroStat/slog/2007/quote-bootstrap-vs-mcmc/#comments</comments>
		<pubDate>Tue, 01 Jan 2008 00:48:59 +0000</pubDate>
		<dc:creator>hlee</dc:creator>
				<category><![CDATA[Quotes]]></category>
		<category><![CDATA[arXiv]]></category>
		<category><![CDATA[Bayesian]]></category>
		<category><![CDATA[bootstrap]]></category>
		<category><![CDATA[Efron]]></category>
		<category><![CDATA[Frequentist]]></category>
		<category><![CDATA[MCMC]]></category>

		<guid isPermaLink="false">http://groundtruth.info/AstroStat/slog/2007/quote-bootstrap-vs-mcmc/</guid>
		<description><![CDATA[ The Bootstrap and Modern Statistics  Brad Efron (2000), JASA Vol. 95 (452), p. 1293-1296. 
 If the bootstrap is an automatic processor for frequentist inference, then MCMC is its Bayesian counterpart.


Sometime in my second year of studying statistics, I said that bootstrap and MCMC are equivalent but reflect different streams in statistics. The [...]]]></description>
			<content:encoded><![CDATA[<p><strong> The Bootstrap and Modern Statistics </strong> Brad Efron (2000), JASA Vol. 95 (452), p. 1293-1296. </p>
<blockquote><p> If the bootstrap is an automatic processor for frequentist inference, then MCMC is its Bayesian counterpart.
</p></blockquote>
<p><span id="more-209"></span><br />
Sometime in my second year of studying statistics, I said that bootstrap and MCMC are equivalent but reflect different streams in statistics. The response to this comment was <i> &#8216;that&#8217;s nonsense.&#8217; </i>  Although I forgot details of the circumstance, I was hurt and didn&#8217;t try to prove myself. After years, the occasion immediately floats on the surface upon seeing this sentence. </p>
]]></content:encoded>
			<wfw:commentRss>http://groundtruth.info/AstroStat/slog/2007/quote-bootstrap-vs-mcmc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

