Data inference

1 post

Data collection and data inference

Data collection and data inference on the SAT test

SAT Subscore: Problem solving and data analysis

Data collection

Data collection is a process of collecting and measuring information on variables of interest, that enables the researcher to test hypotheses and evaluate outcomes.

Data can be collected with a sample or with a controlled experiment:

A sample is a small group that is selected from a large population by using a pre- defined sampling method. The sample must be representative and random.

A controlled experiment is an experiment made on an experimental group, while one factor that is being tested is changed by the researchers and all other factors are held constant.
Each controlled experiment must have a control group. In the control group we don’t change the factor that is being tested in the experimental group. The participants of the control group must be randomly selected and must closely resemble the participants in the experimental group.

Data inference

Data inference is a generalization about a population that is based on statistics calculated from a small group (a sample) that is drawn from that population.

An estimate is a process of finding a value of a population that is close enough to the right value, by performing a sample on a part of that population.
A sample proportion is a variable that is calculated from the sample, that we assume reflects the whole population.
The estimate formula: estimate= sample proportion * population

A margin of error is the degree of error in results received from random sampling surveys, it exists since the sample does not exactly match the population.
The range formula: range= estimate ± margin of error

Continue reading this page for detailed explanations and examples.

Data collection

Data collection is a process of collecting and measuring information on variables of interest, that enables the researcher to test hypotheses and evaluate outcomes. Data can be collected with a sample or with a controlled experiment.

Data collection with a sample

A sample is a small group that is selected from a large population by using a pre- defined sampling method. Samples are used when the population size is too large for the test to include the whole population. The three most common types of sample surveys are e-mail surveys, telephone surveys, and interview surveys.

Sample characteristics:
The sample must be representative and random.
A representative sample is a sample that accurately reflects the examined characteristic of the whole population. A sample that includes members that don’t belong to the population is not representative.
A random sample is a sample that was chosen randomly (purely by chance) so that every member of the population has an equal chance to be selected. A sample that over presents or under presents the subgroup is not random.

Consider the following example:

The school principal wants to estimate how many parents in a group of 80 pupils want their children to participate in an activity. She was suggested five sampling methods, are the methods appropriate?

Method 1: Surveying the parents of randomly chosen 20 girls.
Method 2: Surveying 20 randomly chosen pupils.
Method 3: Surveying 20 randomly chosen parents of the students from student council.
Method 4: Sending the survey with SMS messages to randomly chosen 20 parents of students in the school.
Method 5: Surveying 3 randomly chosen parents of the students.

Method 1: Surveying the parents of a randomly chosen 20 girls. This surveying method is bad because it not chosen randomly. It over presents the sub- group of the girls and under presents the subgroup of the boys.

Method 2: Surveying 20 randomly chosen pupils. This surveying method is bad because it is not representative. It includes pupils that are not a part of the population, since the population is the parents.

Method 3: Surveying 20 randomly chosen parents of the students from student council. This surveying method is bad because it not chosen randomly. It over presents the sub- group of parents of the students from student council.

Method 4: Sending the survey with SMS messages to randomly chosen 20 parents of students in the school. This surveying method is bad because it not chosen randomly. It over presents the sub- group of parents of students that feel positive about the activity and therefore more likely to answer the SMS.

Method 5: Surveying 3 randomly chosen parents of the students. This surveying method is bad because it is not representative. It includes only 3 parents from a group of parents of 80 pupils, therefore it is too small.

 Data collection with a controlled experiment

We can conduct a controlled experiment and conclude about the population from the experiment outcomes.

A controlled experiment is an experiment made on an experimental group, while one factor that is being tested is changed by the researchers and all other factors are held constant (like they were before the experiment).

An independent variable is a variable that is being changed by the researchers in the experimental group.

A correlation means that there is a relationship between two variables (a positive correlation means the variables change at a same direction).

A causation means that a change in one variable is the cause of a change in another variable. In other words: one event is the result of the occurrence of the other event.

Note that a correlation is not a causation. If two variables correlate it does not mean that one causes the other, since the correlation may be caused by a third variable that affects both variables.

For example:
If we want to test the effect of a new medication, the experimental group should include 1,000 participants that will receive the medication for 1 month. The independent variable is taking the new medication for 1 month. All the other factors, like the use of other medications, must remail constant, because changing other factors may influence the outcome of the experiment and we might wrongly assume that the influence came from the factor being tested. (If a participant stops taking the medication for lowering the blood pressure his blood pressure will rise and the researchers can wrongly assume that the rise in the blood pressure was a result of a new medication).

Consider the following example:

Which experiment is appropriate to test the effect of a new quit smoking treatment on the population of smokers?

A. An experiment conducted on 500 smokers aged 20-40 that received the new treatment.

B. An experiment conducted on 500 smokers that received the new treatment and were put on a diet.

C. An experiment conducted on 500 smokers that received the new treatment.

D. An experiment conducted on 500 smokers that had a low blood pressure and received the new treatment.

The answer A in not correct, since the sample is not random, the sample over presents the subgroup of men aged 20-40.

The answer B in not correct, since there are 2 independent variables instead of one. In addition to receiving the new treatment, the smokers were also put on a diet.

The answer D in not correct, since the sample is not random, the sample over presents the subgroup of men that had a low blood pressure.

The answer C in correct, the sample is random and has one independent variable.

The control group in a controlled experiment

In a control group, we don’t change the factor that is being tested in the experimental group (it stays like it was before the experiment). Meaning that all the factors are identical between the two groups except for the factor being tested.

A control group properties:

  • The participants must be randomly selected to be in the control group.
  •  The participants must closely resemble the participants who are in the experimental group.

The purpose of the control group is to rule out alternative explanations of the experimental results.

For example:
What is the control group in the previous example?
Composition: The control group is composed of participants who do not receive the experimental medication. In addition, all the conditions must be unchanged (stay like they were before the experiment). The participants of the control group must resemble the participants of the experimental group in their age, medical conditions and other parameters.
Purpose: An unknown disease may occur in the experimental group and the control group. The existence of the control group will reveal the unknown disease as an unexpected factor, otherwise the researchers will wrongly assume that the unknown disease was caused by the new medicine.

Consider the following example:

Which control group is appropriate to test the effect of a new quit smoking treatment on the population of smokers?

A. A group of 500 smokers that received the new treatment.

B. A group of 500 male smokers.

C. A group of 500 smokers that were put on a diet.

D. A group of 500 smokers.

The answer A in not correct, since in the control group we don’t change the factor that is being tested, so that all the conditions must be unchanged (the smokers in the control group shouldn’t receive the treatment).

The answer B in not correct, since the participants of the control group don’t resemble the participants of the experimental group (there are only male participants in the control group).

The answer C in not correct, since in the control group we don’t change any factor, so that all the conditions must be unchanged (the smokers in the control group shouldn’t be put on a diet).

The answer D in correct, since all the factors are unchanged and the participants of the control group resemble the participants of the experimental group.

Data inference

Data inference is a generalization about a population that is based on statistics calculated from a small group (representative sample) that is drawn from that population. In other words: Instead of checking the whole population we check only a part of the population (representative sample) and assume that the conclusion that was derived from the representative sample is relevant to the whole population.

An estimate calculation

An estimate is a process of finding a value of a population that is close enough to the right value, by performing a sample on a part of that population.

A sample proportion is a variable that is calculated from the sample, that we assume reflects the whole population. A sample proportion can be written as fraction or as a percentage. For example: 10 percent of the sample have a positive opinion about the surveyed subject.

The estimate formula: If we found that a certain percentage from the sample (a sample proportion) represents the percentage in the whole population, we can calculate an estimate by multiplying that percentage by the total amount of items in the population.

estimate= sample proportion * population

Consider the following example:

The school principal wants to estimate the number of pupils that will participate in an activity. She makes a representative sample of 30 pupils and 10 of them answer that they will participate in the activity.

If there are 8 classes in the school with 20 pupils in each class, how many pupils are expected to participate in the activity?

The sample proportion is 10/30=33.33%=0.3333.

The population is 8*20=160 pupils.

estimate= sample proportion * population
estimate= 0.3333 * 160=53 pupils.

The answer is that approximately 53 pupils are expected to participate in the activity.

A range calculation

A margin of error is the degree of error in results received from random sampling surveys, it exists since the sample does not exactly match the population. The margin of error is commonly given as a percentage and is added to the estimate to increase the confidence in the estimate.

Note that:

  • Even after including a margin of error, there is no certainty that the estimation is correct.
  • A high margin of error indicates small confidence that the results represent the population.
  • The larger the sample, the smaller the margin of error (bigger sample size increases the certainty in the prediction).

The range formula: We add (or subtract) the margin of error to the estimate to display the size of the error getting an outcome of a range instead of a single estimate.

range= estimate ± margin of error

Consider the following example:

15 percent of the sample participants own a dog and the margin of error for the sample is 2 percent. If there are 3,000 residents in the town, how many of them are expected to own a dog?

sample proportion= 15%

population= 3,000

estimate= sample proportion * population
estimate= 15% * 3,000= 0.15*3,000=450

the margin of error= 2% from the estimate= 0.02*450=9

The number of dog owners is 450±9, between 441 and 459.