A Tutorial in Data Science: Lecture 1 – The Foundations of Data Science

by Justin Petrillo | Jan 12, 2021 | Math Lecture

Table of Contents

Data Science is the study of the datum, which is the Being of a being in its there-being. The datum of a being is its having been measured, and thus observed in one of its states. The science of data is thus how to interrogate data such as to reveal the being that has been measured, and thus the whole from the part that was revealed. The datum expresses the quantification of a quality of an object, whether that is as basic as mere existence or aspects pertaining to other objects. “What is” communicates itself through functional channels of measurement “as what it is in how it appears to be.” The datum is the informatic particulate, for which interpretation reveals its informatic waveform continuum of being. Thus, Data Science concerns itself with the question of the discretized datum of Being, and thus also with its dual, the interpretive continuum of Being. While only a finitude of nature may be measured, the continuum of its being is the question of understanding. The continuum is the hermeneutical or interpretative understanding of being while discretion is the articulation of being – it must first be before it can make itself known by articulation.

The Scientific Process: Question-Asking & Temporality

The scientific process is the temporality of question-asking, and as such the basis for temporality, which ontologically answers the epistemological questions in terms of genesis or the `coming forth from’ that underlies notions of causality.

The business of science is the procedure of knowledge accumulation. It begins with a qualitative hermeneutic experience of a phenomenon from being-in-the-world that is articulated into its parts through the process of rationalization and ends with the quantitative formalization of claim-making whereby empirical quantities are measured and the confidence of the validity of propositional claims are also measured. Statistics is this last secondary measurement action, the method of determining the validity of a hypothetical propositional claim about empirical nature. A claim that is not particularly valid will likely be true only some of the time or under certain specific conditions that are not too common. Ultimately, thus, within a domain of consideration, statistics answers the question of the universality of the claims made about nature through empirical methods of observation. It may be that two opposing claims are both true in the sense that they are each true half the time of random observation or within half the space of contextual conditionalities. The scientific process, as progress, relies on methods that over a linear time of repeated experimental cycles, increase the validity of the claims as the knowledge of nature approaches universality, itself always merely a horizon within the phenomenology of empiricism. This progressive scientific process is called ‘discovery,’ or merely research, although it is highly non-linear.

The scientific process is a branching process as the truth of a claim is found to be dependent upon its conditions, and those conditions found dependent on further conditionals. This structure of rationality is as a tree. In reality, this branching process is open-ended, yet in the imaginary ideal of human studies it is finite, with the elemental leaves of the rationality tree presumed common sense. A single claim (C) has a relative validity (V) due to the truth of an underlying, or conditioning, claim, $C_i$ , given as $V_{C_i}(C)=V(C,C_i)$ . We may understand the validity of claims through probability theory, in that the relative validity of a claim based on a conditioning claim is the probability the claim is true conditioned on $C_i$ , $V(C,C_i)=P(C|C_i)$ . In general, we will refer to the object under investigation, of which C is a claim about, as the primary variable X, and the subject performing the investigation, of which $C_i$ is hypothesized (as a cognitive action), as the secondary variable Y. Thus, the orientation of observation, i.e. the time-arrow, is given as $\sigma: Y \rightarrow X$ .

The question of inference is thus how to answer $P(C|C_i)$ . Given our assumptions, $A=C_i$ , we find the probability of validity for the hypothesis $C=H$ , as thus $P(H|A)$ .

An observer (Y) makes an observation from a particular position of an event (X) with its own place, forming a space-time of the action of measurement. An observation-as-information is a complex quantum-bit, which within a space of investigation is a complex variable, representing a tree of observation-conditioning rationality resulting from the branching process of hypothesis formation, with each node a conditional hypothesis and edge length the conditional probability. The gravitation of the system of measurement is the space-time tensor of its world-manifold, stable or chaotic of the time of interaction. We thus understand the positions of observers within a place of investigation, itself given at least in real-part component by the object of investigation.

The Experimental Set-up: Breaking the Flow of Nature

Nature is explained by a parameterized model. Each parameter, as a functional aggregation of measurement samples, has itself a corresponding distribution as it occurs in nature along the infinite, universal horizon of measurement.

Let $X^n$ be a random variable representing the n qualities that can be measured for the thing under investigation, $\Omega$ , itself the collected gathering of all its possible appearances, $\omega \in \Omega$ such that $X^n:\omega \rightarrow {\mathbb{R}}^n$ . Each sampled measurement of $X^n$ through an interaction with $\omega$ is given as an $\hat{X}^n(t_i)$ , each one constituting a unit of indexable time in the catalogable measurement process. Thus, the set of sampled measurements, a $sample \ space$ , is a partition of ‘internally orderable’ test times within the measurement action, ${ \hat{X}^n(t): t \in \pi }$ .

$\Omega$ is a state-system, i.e the spatio-temporality of the thing in question, in that it has specific space-states $\omega$ at different times $\Omega(t)=\omega$ . $X$ is the function that measures $\omega$ . What if measurement is not Real, but Complex: $X: \Omega \rightarrow \mathbb{C}$ ? While a real number results from a finite, approximate, or open-ended process of objective empirical measurement, an imaginary number results from a subjective intuition or presupposition to measurement. Every interaction with $\Omega$ lets it appear as $\omega$ , which is quantified by $X$ . From these interactions, we seek to establish truths about $\Omega$ as quantifying the probability that the Claim $C$ is correct, which is itself a quantifiable statement about $\Omega$ .

In this set up of statistical sampling, one will notice the step-wise process-timing of a single actor performing n sequential measurements can be represented the same as n indexed actors performing simultaneous measurements, at least with regard to internal time accounting. In order to infer the latter interpretational context, such as to preserve the common sense notion of time as distinct from social space, one would represent all n simultaneous measurements as n dimensions of X, although assumed to be generally the same in quality in such that all n actors sample the same object in the same way, yet are distinct in some orderable indexical quality. Thus, in each turn of the round time (i.e. one unit), all actors perform independent and similar measurements. It may be, as in progressive action processes, future actions are dependent on previous ones, and thus independence is only found within the sample space of a single time round. Alternatively, it may also be that the actors perform different actions, or are dependent upon each other in their interactions. Thus, the notion of actor(s) may be embedded in the space-time of the action of measurement. The embedding of a coordinated plurality of actors in the most mundane sense of ‘collective progress’ can be represented as the group action of all independent & similar measurers completes itself in each round of time, with inter-temporalities in the process measurement process being similar but dependent on the previous one. The progressive interaction may be represented as the inducer $I^+:X(t_i) \rightarrow X(t_{i}+1)$ , with the assumptions of similarity and independence as $\hat{x_i}(t) \sim \hat{x_j}(t) \ \& \ I(\hat{x_i}(t),\hat{x_j}(t))=0$ . We take $\hat{X}(t)$ to be a group of measurement actors/actions ${ \hat{x}_i(t): i \in \pi }$ that acts on $\Omega$ together, or simultaneously, to produce a singular measurement of one round time.

Deriving The Distribution of Normalcy: The Unquestioned State of Nature

The question with measurement is not, “what is the true distribution of the object in question in nature?”, for this uncountable nature cannot be known within the countable horizon of empirical science, but rather “what is the distribution of the parameter I am using to measure?”, as thus the mimetic relationship between subject and object in the activity of measurement. The underlying metric of the quality under investigation, itself arising due to an interaction of measurement as the distance function within the investigatory space-time, is $\mu$ . As the central limit states, averages of any such measurements, each having an error, will converge to normalcy. The reflective view on this convergence process as time-conditioning is backwards, in that all measurements as answers to the question come from this primitive state of normalcy, before distinction has been made by inquiry. We can describe analytically the distribution of that which has not been questioned, and thus $atemporal$ , by the averaged measurements in that the rate of change of the frequency $f$ of a particular sample measurement $x,x_0 \in X$ by the change in this line-space of measurement values, is negatively proportional, by constant k, to the distance from the true measurement ( $\mu$ ) and the frequency:

$\frac{df}{dx}=-k(x-\mu)f(x)$

or in $\epsilon-\delta$

notation:

$\forall \epsilon > 0, \exists \delta(\epsilon)>0 \ s.t. \ \forall x, |x_0-x|<\delta \rightarrow \bigg| k(x_0-\mu)f(x_0) - \frac{f(x_0)-f(x)}{x_0-x}\bigg|<\epsilon$