Good, bad & the misunderstood

Statistical Estimation

By Rajiv Gupta

What public event has the active participation of 640 million people, with even a larger number of spectators? If you guessed the Indian general elections 2024, you are very correct. Nothing beats the Indian elections from the perspective of drama, emotions, debates, and just “tamasha.” At the end of the seven week long exercise in democracy, before the results of the elections were announced, an added dramatic element was the announcement of the exit poll results. People eagerly awaited the results of the exit polls with almost the same fervour as they would the results of the actual election. Due to the fact that opinion polls during an election are not permitted in India, the exit polls are the first indication, right or wrong, about the possible outcome of the election.

Although exit polls are described as predictions, they are really a way of estimating the votes that have already been cast. Nothing can, or should, change the number of votes each individual or party has received. The only reason it is treated as a forecast is because the physical counting of the actual votes has not taken place. Exit polls are developed using statistical sampling. Statistical sampling is necessary when the population from which we sample is very large.

In the 2024 elections, approximately 640 million people voted at more than a million voting booths. So, the task for the pollsters is to determine a suitable sample to estimate how people have voted. Let us try to understand what can lead these estimates to be a true or incorrect reflection of the larger population they are attempting to describe.

The first issue to consider is the homogeneity of the population. Let us take a simple example. Consider a large container of a liquid in which some sugar is completely dissolved. If we wished to estimate the percentage of sugar in the liquid, we could take a small sample and evaporate the liquid to obtain the amount of sugar in the liquid. The percentage of sugar in the sample should be the same as in the large container because sugar has been dissolved completely and evenly. This would be an example of a homogeneous distribution of sugar in the liquid.

Now, if instead of sugar that evenly dissolves in the entire liquid, we had a substance that partially dissolved in the liquid, while some of it remained suspended in a lumpy manner. Now, the problem of estimating the percentage of this substance in the liquid becomes complicated. The percentage of the substance in one part of the liquid is not necessarily the same as in any other part. So, in order to draw a sample, or samples, which reflect the distribution of the substance in the liquid, we will need to have some idea of that distribution.

Now add several more substances that are distributed unevenly in the liquid, and you can begin to appreciate the complexity of taking samples of exit polls. People with different identities including caste, religion, age, gender, region, and political allegiance are unevenly distributed in the country. In addition, the allegiance of the voters’ changes over time. This makes the task of taking a representative sample, or samples, all the more difficult.

The problem is compounded by the fact that what a pollster collects data on is not the number of seats won, but rather on the number of votes each political party is likely to receive. However, the relationship between vote share and the number of seats won is not straightforward. This can be seen in the fact that the Bhartiya Janata Party’s vote share in the 2024 election was only slightly below its vote share in the 2019 election, 36.6 per cent versus 37.3 per cent, but it lost about 20 per cent of its seats between 2019 and 2024 (303 to 240). The conversion of the estimated vote share percentage into the estimated number of seats won presents a major source of error.

A third major source of error in estimation is the reluctance of some voters to reveal how they voted. This would be equivalent to a certain part of our earlier liquid container that was inaccessible for sampling. There could be several reasons for people refusing to respond to the pollster. They may be afraid that, if they voted against the expected majority party, they might be targeted for persecution. Or people may feel they do not wish to be bothered by pollsters.

The reluctance of some people to participate in the polling process, makes the sample non-random, which goes against the requirements for non-biased sampling. This can, and will, skew the results as it may exclude people who may have voted in one particular manner. Also, people in rural areas are wary of educated city folk asking questions and may even answer the questions incorrectly on purpose. This further exacerbates the problem.

While it is tempting to offer solutions to the problem of sampling, it is perhaps more meaningful to understand and accept that all statistics used in estimation have an inherent error. The more heterogeneous the population, the bigger the potential for error. The homogeneity/heterogeneity of the population can be affected by what we have tended to term as a political “wave” where more people cast aside their differences and vote in a more homogeneous manner.

In 2024, as some of pollsters did predict, there was no political “wave.” Perhaps the reason why the exit polls were more accurate in 2014 and 2019 was because of the Modi “wave.” In the absence of a unifying theme or “wave,” people assume their identities based on economic need, caste, region, age, religion, etc. The pollsters in India have increased their sample sizes from several thousands to several lakhs. But even that number is a drop in the bucket of an electorate of 640 million.

They have also tried to reformulate their sampling plans. But I feel that what is required is also an education of the media and of the general public in terms of what they should reasonably expect from the estimates the pollsters provide. Moreover, is it really so important to have an estimate of a count a couple of days before the actual count itself? Something for the nation to think about. — INFA