Skip to content

How to use weights – Analysis guidance for weights, PSU, Strata

As discussed in other sections, Understanding Society is a probability survey with a complex sample design and most of the sub-samples were clustered and stratified with unequal selection probabilities (i.e., not all population sub-groups are selected with the same probability). Most statistical softwares assume that the data is from a survey where the sample design is SRS and all sub-groups are selected with equal selection probability and random attrition and non-response. So, estimates and their standard errors produced using Understanding Society data without any further adjustments may be biased.

The estimates will be biased in favour of groups who are over-represented in the sample (compared to the population) if the variable statistic being estimated differs by this group. For example, as average pay is lower for most ethnic minority groups as compared to the White majority group, and the former are over-represented in the sample (due to EMBS and IEMBS), then the UK average pay estimated using this data will be underestimated. But as the weights provided are designed to counteract this, weighted estimates will be unbiased estimates of population statistics.

Standard errors of estimates produced from a sample with a clustered design is likely to be higher than that of estimates produced from a sample with a SRS design of the same size. The opposite is the case for stratified samples. As most statistical softwares assumed SRS design, without further adjustments the estimated standard errors of estimates will be incorrect.

Most statistical softwares have specific commands that allow you to specify these features. In the case of Stata it is the SVY suite of commands, in SPSS it is the Complex Samples suite of commands, in R it is the Survey package and in SAS it is surveymeans command. Please take a look at the section on ‘Working with weights and complex survey design’ in our online courses ‘Introduction to Understanding Society using Stata, SPSS, SAS and R’. There is a different Moodle course for each software, so choose the one based on the software you use. In this section, we provide a worksheet with a worked out example to help you understand how to produce weighted estimates with correct standard errors using that software. The accompanying syntax and output files are also provided. For example, to produce unbiased estimate of average monthly pay in the UK in 2009-10, with correct standard errors using our data with Stata, you will need to do the following:

use a_indresp, clear

svyset a_psu [pweight=a_indinus_xw], strata(a_strata)singleunit(scaled)

replace a_paygu_dv=. if a_paygu_dv<0

svy: mean a_paygu_dv

Each Understanding Society weight is set to zero for all sample units to which it does not apply. Thus, specifying the use of the correct weight in analysis will automatically result in the analysis being restricted to the appropriate sample. For example, there are around 2,000 persons in the file h_indresp, with a zero value of h_indinui_lw. The persons with non-zero values of this weight variable are the people who gave a full individual interview at all of Waves 6, 7 and 8 and the waves before these. For longitudinal analysis of data obtained in the individual interviews at Waves 6, 7 and 8, it is therefore sufficient to specify use of the weight h_indinui_lw or to create your own tailored weight. The analysis sample can of course also be further restricted by selecting based on respondent characteristics (e.g. by gender, age, ethnicity, employment status etc): the weight is appropriate for analysis of any demographic subset of the full sample to which the weight applies. Please see ‘Selecting the correct weights’ section to know about all the different weights that are provided and how to select the correct weight for each type of analysis.

Tips for analysts:

The weights provided have been developed for use when analysing data from various combinations of survey instruments in one of two ways:

When using data from a series of consecutive waves, e.g. a panel analysis. These are the longitudinal weights ending in _lw;

When using data from a single wave. These are the cross-sectional weights ending in _xw.

Running analysis on a calendar year or month

It is possible to run analysis relating to a calendar year or month with a few extra adjustments. However to make it easier for researchers and analysts to conduct this type of analysis we now release calendar year cross-sectional datasets; these are planned for each year and are explained in the Data section under comparisons of calendar year data.

The survey sample is designed such that each sample month (identified by the variable w_month) is a random representative (once weighted) sample of the population with some exceptions:

  • Northern Ireland is only present in months 1-12 (first year of each wave)
  • BHPS is only present in issue month 1-12 (first year of each wave)
  • The IEMB sample is only present in issue month 13-24 (second year of each wave)

Because of this we recommend use of the us_lw weight in analysis. This weight correctly excludes BHPS and IEMB. For cross-sectional analysis we suggest using the weights produced for the calendar year datasets.

Please also note that if you use months 13-24 you are excluding Northern Ireland from your analysis. If you use months 1-12 Northern Ireland will be over-represented without an additional adjustment to the weight.

Stata syntax for adjustment if you use month 1-12:

gen adj=1

replace adj=0.5 if w_country==4

gen weight=w_xxxyyus_lw*adj 8

We suggest that you use sample month / year (w_month) to identify the analysis sample rather than month / year of interview. For each sample month, interviews take place over 3-4 months, but the majority of interviews take place in the calendar month coinciding with the sample month. The interviews that come in later calendar months tend to be with sample members who are either hard to contact or reluctant to participate. Our weights are designed for each whole sample month to represent the population. If you omit the interviews from the calendar months following the sample months you will be excluding a category of respondents who tends to be very different to earlier respondents, so it is unlikely that your analysis sample will remain representative.

If you still want to define your analysis sample by month / year of interview (rather than sample month) there are two ways you can adjust for the late respondents:

  • Create a tailored adjustment to our weight.
  • Use late respondents from other issue months with our weights (see below).

For example, if you are interested in studying December 2014. Your optimal option with the largest sample size will be to combine all interviews carried out in December of 2014 from the following samples:

  • Wave 5 sample months 21, 22, 23 and 24
  • Wave 6 sample months 9, 10, 11 and 12
  • Create a new variable that equals e_xxxxxus_zz weight for the Wave 5 interviews and f_xxxxxus_zz weight for Wave 6. No Northern Ireland adjustment is needed. No extra nonresponse adjustment is needed as late respondents in the month 24 sample are compensated for by bringing in the late respondents from previous sample months. But you will need a scaling factor.
  • Use psu and strata variables from xwave.dat to take into account clustering and stratification.

Note if you want to study January 2014 for example, the information will come from 3 waves, because to compensate for missing of late respondents from Wave 5, sample month 1, you will need to include January respondents from Wave 4, sample months 22- 24. The rest will follow the above example. If you use respondents from calendar months / year just from one wave you will need an extra adjustment for Northern Ireland and potentially also for late respondents (if your period of interest includes sample months 1, 2 or 3).

Email newsletter

Sign up to our newsletter