Open In Colab

Data Processing

This tutorial was inspired by and adapted from Shawn A. Rhoads’ PSYC 347 Course [CC BY-SA 4.0 License].

Learning objectives

This notebook is intended to teach you basic python syntax for:

  1. Reading data from a file

  2. Summarizing data

  3. Filtering data

  4. Writing data to a file

One important package will be used in this notebook: pandas. Pandas is a powerful data analysis package. It is designed to make data analysis fast and easy in Python. It is also designed to work with data in a tabular format, which is a common way to store data.

Reading data from a CSV file

In this notebook, we are going to work with open data from Sarah’s 2022 paper:

Banker, S. M., Na, S., Beltrán, J., Koenigsberg, H. W., Foss-Feig, J. H., Gu, X., & Schiller, D. (2022). Disrupted computations of social control in individuals with obsessive-compulsive and misophonia symptoms. iScience, 25(7), 104617. https://doi.org/10.1016/j.isci.2022.104617

The data is available on OSF. I already cleaned it up a little bit. We will be using the Banker_et_al_2022_QuestionnaireData.csv file.

filename = 'https://raw.githubusercontent.com/Center-for-Computational-Psychiatry/course_spice/main/modules/resources/data/Banker_et_al_2022_QuestionnaireData.csv'

To read this file, we will use the pandas package. pandas is a Python library used for data manipulation and analysis. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for working with data from a variety of sources such as CSV files, SQL databases, and Excel spreadsheets. Pandas is commonly used in data science and machine learning applications, as well as in finance, social science, and other fields where data analysis is important.

A CSV file is a comma-separated values file, which allows data to be saved in a tabular format. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. CSV files can be opened in spreadsheet applications like Microsoft Excel or Google Sheets.

# we will import pandas using the alias pd
import pandas as pd 

Above, we imported pandas using the import statement. This statement makes the pandas package available to us in our notebook. We will use import to make other packages available to us as well throughout this course and in your research.

We can now use the functions and data structures provided by pandas in our code. We will use the pandas package to read in the data from the CSV file.

Let’s start by reading in the data from the CSV file. We will use the read_csv() function from the pandas package to do this. The read_csv() function takes a single argument, the path to the CSV file to read. The function returns a DataFrame object, which is a data structure provided by pandas for storing tabular data. We will assign the DataFrame object returned by read_csv() to a variable called df.

our_data = pd.read_csv(filename)

Making sense of your DataFrame

We can preview the first five rows of the DataFrame using the head() method. The head() method takes a single argument, the number of rows to preview. If no argument is provided, the head() method will return the first five rows by default.

display(our_data.head())
prolific_id AttentionCheck Trait Anxiety Loneliness Depression Obsessive Compulsion Subjective Happiness Stress Misophonia Autism Spectrum ... Avoidant Personality Disorder Borderline Personality Disorder Apathy Eating Disorder Alcohol Use Disorder Age Sex Gender Income Education
0 546ec14dfdf99b2bc7ebd032 0 37 18 28 18 11 7 8 4.527778 ... 30 2 14 2 0 55 1 1 5 6
1 548491acfdf99b0379939cc0 0 53 19 46 18 12 14 7 3.472222 ... 18 3 11 11 0 28 2 2 8 5
2 54924b8efdf99b77ccedc1d5 0 56 20 51 20 11 11 2 3.055556 ... 25 3 19 26 0 19 1 1 5 4
3 5563984afdf99b672b5749b6 0 23 11 24 3 28 6 6 3.500000 ... 14 1 5 16 0 39 1 1 11 5
4 5588d358fdf99b304ee5674f 0 59 24 57 22 8 15 5 3.888889 ... 23 5 33 10 3 27 2 2 4 6

5 rows × 21 columns

You will see that we have a lot of columns in this DataFrame. We can use the shape attribute to see the dimensions of the DataFrame. The shape attribute returns a tuple containing the number of rows and columns in the DataFrame.

our_data.shape
(1175, 21)

We have 1175 rows and 21 columns. Typically, the rows in a DataFrame represent observations and the columns represent variables. In this case, each row represents a participant and each column represents a variable measured for each participant.

To get a better sense of our variables, we can use the info() method. The info() method prints a summary of the DataFrame including the number of rows and columns, the number of non-null values in each column, and the data type of each column.

print(our_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1175 entries, 0 to 1174
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   prolific_id                      1175 non-null   object 
 1   AttentionCheck                   1175 non-null   int64  
 2   Trait Anxiety                    1175 non-null   int64  
 3   Loneliness                       1175 non-null   int64  
 4   Depression                       1175 non-null   int64  
 5   Obsessive Compulsion             1175 non-null   int64  
 6   Subjective Happiness             1175 non-null   int64  
 7   Stress                           1175 non-null   int64  
 8   Misophonia                       1175 non-null   int64  
 9   Autism Spectrum                  1175 non-null   float64
 10  Social Anxiety                   1175 non-null   int64  
 11  Avoidant Personality Disorder    1175 non-null   int64  
 12  Borderline Personality Disorder  1175 non-null   int64  
 13  Apathy                           1175 non-null   int64  
 14  Eating Disorder                  1175 non-null   int64  
 15  Alcohol Use Disorder             1175 non-null   int64  
 16  Age                              1175 non-null   int64  
 17  Sex                              1175 non-null   int64  
 18  Gender                           1175 non-null   int64  
 19  Income                           1175 non-null   int64  
 20  Education                        1175 non-null   int64  
dtypes: float64(1), int64(19), object(1)
memory usage: 192.9+ KB
None

You should notice a few key things from the output of the info() method. First, we have 1175 rows and 21 columns, which matches the output of the shape attribute. Second, we have 1175 non-null values in each column, which means that we have no missing data. Third, we have a mix of data types. Most of the columns contain integers, but some of the columns contain floating point numbers and some of the columns contain strings (objects).

We can also list out all of the columns in the DataFrame using the columns attribute. The columns attribute returns a list of column names.

print(our_data.columns)
Index(['prolific_id', 'AttentionCheck', 'Trait Anxiety', 'Loneliness',
       'Depression', 'Obsessive Compulsion', 'Subjective Happiness', 'Stress',
       'Misophonia', 'Autism Spectrum', 'Social Anxiety',
       'Avoidant Personality Disorder', 'Borderline Personality Disorder',
       'Apathy', 'Eating Disorder', 'Alcohol Use Disorder', 'Age', 'Sex',
       'Gender', 'Income', 'Education'],
      dtype='object')

Exploring variables within your DataFrame

Let’s explore some of our variables. We can start with ‘Age’. We can use the describe() method to get summary statistics for a column. The describe() method takes a single argument, the name of the column to summarize. The describe() method returns a Series object, which is a data structure provided by pandas for storing a single column of data.

our_data['Age'].describe()
count    1175.000000
mean       35.883404
std        13.353503
min        18.000000
25%        25.000000
50%        33.000000
75%        44.000000
max        77.000000
Name: Age, dtype: float64

We can see that the mean age of our participants is 35.88 years old. The youngest participant is 18 years old and the oldest participant is 77 years old. The standard deviation is 13.35 years. This seems like a pretty young sample in terms of age! I would typically plot these data to get a better sense of the distribution of ages, but we will save that for the next module.

You can use describe() to get summary statistics for other columns as well. For example, we can use describe() to get summary statistics for the ‘Subjective Happiness’ column, which contains the total score on Subjective Happiness Scale.

our_data['Subjective Happiness'].describe()
count    1175.000000
mean       17.645957
std         5.874810
min         4.000000
25%        13.000000
50%        18.000000
75%        22.000000
max        28.000000
Name: Subjective Happiness, dtype: float64

We can see that the mean score of our participants is 17.64 – our sample is pretty happy!

For data that are rank-ordered, such as “Income” or “Education”, computing the mean and standard deviation is not very informative. Instead, we can use the value_counts() method to get a frequency count for each value in the column. The value_counts() method takes a single argument, the name of the column to summarize. The value_counts() method returns a Series object, which is a data structure provided by pandas for storing a single column of data.

our_data['Income'].value_counts().sort_index()
1      79
2      84
3     114
4     102
5     101
6     125
7      73
8     104
9      73
10     75
11    165
12     80
Name: Income, dtype: int64

Above, we can see a table of frequencies for each rank (ranging from 1 to 12). You should notice that there seem to be many more participants who have lower ranks (e.g., 3, 4, 5) than higher ranks (e.g., 10, 12). This is a common pattern for rank-ordered income data!

Creating new variables

Can money buy happiness?

Let’s explore the relationship between income and happiness. To do this, let’s split participants into different income brackets. We will create a new column called ‘IncomeSplit’ that contains the string ‘High’ if the participant’s income rank is in the top 33.3%, ‘Medium’ if the participant’s income rank is in the middle 33.3%, and ‘Low’ if the participant’s income rank is in the bottom 33.3%.

is greater than or equal to the median income rank and ‘Low’ if the participant’s income rank is less than the median income rank.

We can use the qcut() function to perform the a quantitle split. The qcut() function takes three arguments: the DataFrame to split, the name of the column to split, and the number of bins to split the data into. The qcut() function returns a Series object, which is a data structure provided by pandas for storing a single column of data.

our_data['IncomeSplit'] = pd.qcut(our_data['Income'], q=3, labels=['Low','Medium','High']) 

This created a new column called “IncomeSplit”. We can see it now in our columns (all the way at the end).

display(our_data.head())
prolific_id AttentionCheck Trait Anxiety Loneliness Depression Obsessive Compulsion Subjective Happiness Stress Misophonia Autism Spectrum ... Borderline Personality Disorder Apathy Eating Disorder Alcohol Use Disorder Age Sex Gender Income Education IncomeSplit
0 546ec14dfdf99b2bc7ebd032 0 37 18 28 18 11 7 8 4.527778 ... 2 14 2 0 55 1 1 5 6 Low
1 548491acfdf99b0379939cc0 0 53 19 46 18 12 14 7 3.472222 ... 3 11 11 0 28 2 2 8 5 Medium
2 54924b8efdf99b77ccedc1d5 0 56 20 51 20 11 11 2 3.055556 ... 3 19 26 0 19 1 1 5 4 Low
3 5563984afdf99b672b5749b6 0 23 11 24 3 28 6 6 3.500000 ... 1 5 16 0 39 1 1 11 5 High
4 5588d358fdf99b304ee5674f 0 59 24 57 22 8 15 5 3.888889 ... 5 33 10 3 27 2 2 4 6 Low

5 rows × 22 columns

Now that we have our new variable, we can view the frequency counts for each level of the variable using the value_counts() method.

our_data['IncomeSplit'].value_counts().sort_index()
Low       480
Medium    375
High      320
Name: IncomeSplit, dtype: int64

Now, we can use the groupby() method to group the data by the ‘IncomeSplit’ column. The groupby() method takes a single argument, the name of the column to group by. The groupby() method returns a DataFrameGroupBy object, which is a data structure provided by pandas for storing grouped data.

We will again use the describe() method to get summary statistics for the ‘Subjective Happiness’ column, grouped by the ‘IncomeSplit’ column.

our_data.groupby('IncomeSplit')['Subjective Happiness'].describe()
count mean std min 25% 50% 75% max
IncomeSplit
Low 480.0 17.435417 5.958100 4.0 13.0 18.0 22.0 28.0
Medium 375.0 17.613333 5.776342 4.0 13.0 18.0 22.0 28.0
High 320.0 18.000000 5.865258 4.0 14.0 19.0 23.0 28.0

The effects is small, but it looks like participants mean happiness scores increase as the income bracket group goes from “Low” to “High”!

Remember, it is important to think critically about your variables. Income is associated with many different factors (e.g., access to healthcare, access to education, job security, etc.), so it makes sense if people with more resources report being happier (or if people who report being happier have greater incomes – remember correlation does not imply causation).

Filtering data

Now that have a sense of our data, let’s filter it down to a subset of participants. We can use the query() method to do this. The query() method takes a single argument, a string containing a boolean expression. The query() method returns a DataFrame object containing only the rows that match the boolean expression.

One variable in particular is the “Attention Check” column, which indicates if a participant missed an attention check question. We can use the query() method to filter out participants who missed an attention check question. We can do this by querying the ‘attention_sum’ column to see if it is equal to 0. We can then assign the filtered DataFrame to a new variable called our_data_clean.

our_data_clean = our_data.query("AttentionCheck == 0")
our_data_clean.shape
(1090, 22)

Looks like our new DataFrame contains 1090 rows (which is 85 fewer rows than our original DataFrame). This means that 85 participants missed an attention check question! We should exclude these participants from future analyses.

Let’s see how “Age” and “Subjective Happiness” changed for this filtered sample. Let’s do this in a for-loop this time.

for var_name in ['Age','Subjective Happiness']:
    print(our_data_clean[var_name].describe(), end='\n\n')
count    1090.000000
mean       35.755963
std        13.295645
min        18.000000
25%        25.000000
50%        33.000000
75%        44.000000
max        77.000000
Name: Age, dtype: float64

count    1090.000000
mean       17.694495
std         5.872166
min         4.000000
25%        13.000000
50%        18.000000
75%        22.000000
max        28.000000
Name: Subjective Happiness, dtype: float64

Not much has drastically changed! This is good news – it means that the participants who missed an attention check question are not very different from the participants who did not miss an attention check question.

Saving your filtered data

To save our filtered data, we can use the to_csv() method. The to_csv() method takes a single argument, the name of the file to save the DataFrame to.

our_data_clean.to_csv('./resources/data/Banker_et_al_2022_QuestionnaireData_clean.csv', index=False)

Next steps

Now, using our filtered dataframe, try this for any other variables you are interested in! Pick one! For example, you can look at the descriptive statistics for any of the psychiatric variables and group them by various demographic variables.

Carry out the following tasks:

  1. Pick one of the psychiatric variables (e.g., “Depression”, “Stress”, “Social Anxiety”)

  2. Use the describe() method to get summary statistics for that variable.

  3. Pick a second variable and create a new column based on the median split (hint: you can change q=3 to q=2 and labels=["Low","Medium","High"] to labels=["Low","High"] in the pd.qcut() function).

  4. Summarize the psychiatric variable by the new column you created in step 3 using the groupby() method.

# Insert your code here