Data Processing¶

This tutorial was inspired by and adapted from Shawn A. Rhoads’ PSYC 347 Course [CC BY-SA 4.0 License].

Learning objectives¶

This notebook is intended to teach you basic python syntax for:

Reading data from a file
Summarizing data
Filtering data
Writing data to a file

One important package will be used in this notebook: pandas. Pandas is a powerful data analysis package. It is designed to make data analysis fast and easy in Python. It is also designed to work with data in a tabular format, which is a common way to store data.

Reading data from a CSV file¶

In this notebook, we are going to work with open data from Sarah’s 2022 paper:

Banker, S. M., Na, S., Beltrán, J., Koenigsberg, H. W., Foss-Feig, J. H., Gu, X., & Schiller, D. (2022). Disrupted computations of social control in individuals with obsessive-compulsive and misophonia symptoms. iScience, 25(7), 104617. https://doi.org/10.1016/j.isci.2022.104617

The data is available on OSF. I already cleaned it up a little bit. We will be using the Banker_et_al_2022_QuestionnaireData.csv file.

filename = 'https://raw.githubusercontent.com/Center-for-Computational-Psychiatry/course_spice/main/modules/resources/data/Banker_et_al_2022_QuestionnaireData.csv'

To read this file, we will use the pandas package. pandas is a Python library used for data manipulation and analysis. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for working with data from a variety of sources such as CSV files, SQL databases, and Excel spreadsheets. Pandas is commonly used in data science and machine learning applications, as well as in finance, social science, and other fields where data analysis is important.

A CSV file is a comma-separated values file, which allows data to be saved in a tabular format. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. CSV files can be opened in spreadsheet applications like Microsoft Excel or Google Sheets.

# we will import pandas using the alias pd
import pandas as pd 

Above, we imported pandas using the import statement. This statement makes the pandas package available to us in our notebook. We will use import to make other packages available to us as well throughout this course and in your research.

We can now use the functions and data structures provided by pandas in our code. We will use the pandas package to read in the data from the CSV file.

Let’s start by reading in the data from the CSV file. We will use the read_csv() function from the pandas package to do this. The read_csv() function takes a single argument, the path to the CSV file to read. The function returns a DataFrame object, which is a data structure provided by pandas for storing tabular data. We will assign the DataFrame object returned by read_csv() to a variable called df.

our_data = pd.read_csv(filename)

Making sense of your DataFrame¶

We can preview the first five rows of the DataFrame using the head() method. The head() method takes a single argument, the number of rows to preview. If no argument is provided, the head() method will return the first five rows by default.

display(our_data.head())

	prolific_id	Trait Anxiety	Loneliness	Depression	Obsessive Compulsion	Subjective Happiness	Stress	Misophonia	Autism Spectrum	...	Avoidant Personality Disorder	Borderline Personality Disorder	Apathy	Eating Disorder	Alcohol Use Disorder	Age	Sex	Gender	Income	Education
0	546ec14dfdf99b2bc7ebd032	37	18	28	18	11	7	8	4.527778	...	30	2	14	2	0	55	1	1	5	6
1	548491acfdf99b0379939cc0	53	19	46	18	12	14	7	3.472222	...	18	3	11	11	0	28	2	2	8	5
2	54924b8efdf99b77ccedc1d5	56	20	51	20	11	11	2	3.055556	...	25	3	19	26	0	19	1	1	5	4
3	5563984afdf99b672b5749b6	23	11	24	3	28	6	6	3.500000	...	14	1	5	16	0	39	1	1	11	5
4	5588d358fdf99b304ee5674f	59	24	57	22	8	15	5	3.888889	...	23	5	33	10	3	27	2	2	4	6

5 rows × 21 columns

You will see that we have a lot of columns in this DataFrame. We can use the shape attribute to see the dimensions of the DataFrame. The shape attribute returns a tuple containing the number of rows and columns in the DataFrame.

our_data.shape

(1175, 21)

We have 1175 rows and 21 columns. Typically, the rows in a DataFrame represent observations and the columns represent variables. In this case, each row represents a participant and each column represents a variable measured for each participant.

To get a better sense of our variables, we can use the info() method. The info() method prints a summary of the DataFrame including the number of rows and columns, the number of non-null values in each column, and the data type of each column.

print(our_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1175 entries, 0 to 1174
Data columns (total 21 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 prolific_id                      1175 non-null   object 
 AttentionCheck                   1175 non-null   int64  
 Trait Anxiety                    1175 non-null   int64  
 Loneliness                       1175 non-null   int64  
 Depression                       1175 non-null   int64  
 Obsessive Compulsion             1175 non-null   int64  
 Subjective Happiness             1175 non-null   int64  
 Stress                           1175 non-null   int64  
 Misophonia                       1175 non-null   int64  
 Autism Spectrum                  1175 non-null   float64
Social Anxiety                   1175 non-null   int64  
Avoidant Personality Disorder    1175 non-null   int64  
Borderline Personality Disorder  1175 non-null   int64  
Apathy                           1175 non-null   int64  
Eating Disorder                  1175 non-null   int64  
Alcohol Use Disorder             1175 non-null   int64  
Age                              1175 non-null   int64  
Sex                              1175 non-null   int64  
Gender                           1175 non-null   int64  
Income                           1175 non-null   int64  
Education                        1175 non-null   int64  
dtypes: float64(1), int64(19), object(1)
memory usage: 192.9+ KB
None

You should notice a few key things from the output of the info() method. First, we have 1175 rows and 21 columns, which matches the output of the shape attribute. Second, we have 1175 non-null values in each column, which means that we have no missing data. Third, we have a mix of data types. Most of the columns contain integers, but some of the columns contain floating point numbers and some of the columns contain strings (objects).

We can also list out all of the columns in the DataFrame using the columns attribute. The columns attribute returns a list of column names.

print(our_data.columns)

Index(['prolific_id', 'AttentionCheck', 'Trait Anxiety', 'Loneliness',
       'Depression', 'Obsessive Compulsion', 'Subjective Happiness', 'Stress',
       'Misophonia', 'Autism Spectrum', 'Social Anxiety',
       'Avoidant Personality Disorder', 'Borderline Personality Disorder',
       'Apathy', 'Eating Disorder', 'Alcohol Use Disorder', 'Age', 'Sex',
       'Gender', 'Income', 'Education'],
      dtype='object')

Exploring variables within your DataFrame¶

Let’s explore some of our variables. We can start with ‘Age’. We can use the describe() method to get summary statistics for a column. The describe() method takes a single argument, the name of the column to summarize. The describe() method returns a Series object, which is a data structure provided by pandas for storing a single column of data.

our_data['Age'].describe()

count    1175.000000
mean       35.883404
std        13.353503
min        18.000000
25%        25.000000
50%        33.000000
75%        44.000000
max        77.000000
Name: Age, dtype: float64

We can see that the mean age of our participants is 35.88 years old. The youngest participant is 18 years old and the oldest participant is 77 years old. The standard deviation is 13.35 years. This seems like a pretty young sample in terms of age! I would typically plot these data to get a better sense of the distribution of ages, but we will save that for the next module.

You can use describe() to get summary statistics for other columns as well. For example, we can use describe() to get summary statistics for the ‘Subjective Happiness’ column, which contains the total score on Subjective Happiness Scale.

our_data['Subjective Happiness'].describe()

count    1175.000000
mean       17.645957
std         5.874810
min         4.000000
25%        13.000000
50%        18.000000
75%        22.000000
max        28.000000
Name: Subjective Happiness, dtype: float64

We can see that the mean score of our participants is 17.64 – our sample is pretty happy!

For data that are rank-ordered, such as “Income” or “Education”, computing the mean and standard deviation is not very informative. Instead, we can use the value_counts() method to get a frequency count for each value in the column. The value_counts() method takes a single argument, the name of the column to summarize. The value_counts() method returns a Series object, which is a data structure provided by pandas for storing a single column of data.

our_data['Income'].value_counts().sort_index()

    79
    84
   114
   102
   101
   125
    73
   104
    73
   75
  165
   80
Name: Income, dtype: int64

Above, we can see a table of frequencies for each rank (ranging from 1 to 12). You should notice that there seem to be many more participants who have lower ranks (e.g., 3, 4, 5) than higher ranks (e.g., 10, 12). This is a common pattern for rank-ordered income data!

Creating new variables¶

Can money buy happiness?¶

Let’s explore the relationship between income and happiness. To do this, let’s split participants into different income brackets. We will create a new column called ‘IncomeSplit’ that contains the string ‘High’ if the participant’s income rank is in the top 33.3%, ‘Medium’ if the participant’s income rank is in the middle 33.3%, and ‘Low’ if the participant’s income rank is in the bottom 33.3%.

is greater than or equal to the median income rank and ‘Low’ if the participant’s income rank is less than the median income rank.

We can use the qcut() function to perform the a quantitle split. The qcut() function takes three arguments: the DataFrame to split, the name of the column to split, and the number of bins to split the data into. The qcut() function returns a Series object, which is a data structure provided by pandas for storing a single column of data.

our_data['IncomeSplit'] = pd.qcut(our_data['Income'], q=3, labels=['Low','Medium','High']) 

This created a new column called “IncomeSplit”. We can see it now in our columns (all the way at the end).

display(our_data.head())

	prolific_id	Trait Anxiety	Loneliness	Depression	Obsessive Compulsion	Subjective Happiness	Stress	Misophonia	Autism Spectrum	...	Borderline Personality Disorder	Apathy	Eating Disorder	Alcohol Use Disorder	Age	Sex	Gender	Income	Education	IncomeSplit
0	546ec14dfdf99b2bc7ebd032	37	18	28	18	11	7	8	4.527778	...	2	14	2	0	55	1	1	5	6	Low
1	548491acfdf99b0379939cc0	53	19	46	18	12	14	7	3.472222	...	3	11	11	0	28	2	2	8	5	Medium
2	54924b8efdf99b77ccedc1d5	56	20	51	20	11	11	2	3.055556	...	3	19	26	0	19	1	1	5	4	Low
3	5563984afdf99b672b5749b6	23	11	24	3	28	6	6	3.500000	...	1	5	16	0	39	1	1	11	5	High
4	5588d358fdf99b304ee5674f	59	24	57	22	8	15	5	3.888889	...	5	33	10	3	27	2	2	4	6	Low

5 rows × 22 columns

Now that we have our new variable, we can view the frequency counts for each level of the variable using the value_counts() method.

our_data['IncomeSplit'].value_counts().sort_index()

Low       480
Medium    375
High      320
Name: IncomeSplit, dtype: int64

Now, we can use the groupby() method to group the data by the ‘IncomeSplit’ column. The groupby() method takes a single argument, the name of the column to group by. The groupby() method returns a DataFrameGroupBy object, which is a data structure provided by pandas for storing grouped data.

We will again use the describe() method to get summary statistics for the ‘Subjective Happiness’ column, grouped by the ‘IncomeSplit’ column.

our_data.groupby('IncomeSplit')['Subjective Happiness'].describe()

	count	mean	std	min	25%	50%	75%	max
IncomeSplit
Low	480.0	17.435417	5.958100	4.0	13.0	18.0	22.0	28.0
Medium	375.0	17.613333	5.776342	4.0	13.0	18.0	22.0	28.0
High	320.0	18.000000	5.865258	4.0	14.0	19.0	23.0	28.0

The effects is small, but it looks like participants mean happiness scores increase as the income bracket group goes from “Low” to “High”!

Remember, it is important to think critically about your variables. Income is associated with many different factors (e.g., access to healthcare, access to education, job security, etc.), so it makes sense if people with more resources report being happier (or if people who report being happier have greater incomes – remember correlation does not imply causation).

Filtering data¶

Now that have a sense of our data, let’s filter it down to a subset of participants. We can use the query() method to do this. The query() method takes a single argument, a string containing a boolean expression. The query() method returns a DataFrame object containing only the rows that match the boolean expression.

One variable in particular is the “Attention Check” column, which indicates if a participant missed an attention check question. We can use the query() method to filter out participants who missed an attention check question. We can do this by querying the ‘attention_sum’ column to see if it is equal to 0. We can then assign the filtered DataFrame to a new variable called our_data_clean.

our_data_clean = our_data.query("AttentionCheck == 0")

our_data_clean.shape

(1090, 22)

Looks like our new DataFrame contains 1090 rows (which is 85 fewer rows than our original DataFrame). This means that 85 participants missed an attention check question! We should exclude these participants from future analyses.

Let’s see how “Age” and “Subjective Happiness” changed for this filtered sample. Let’s do this in a for-loop this time.

for var_name in ['Age','Subjective Happiness']:
    print(our_data_clean[var_name].describe(), end='\n\n')

count    1090.000000
mean       35.755963
std        13.295645
min        18.000000
25%        25.000000
50%        33.000000
75%        44.000000
max        77.000000
Name: Age, dtype: float64

count    1090.000000
mean       17.694495
std         5.872166
min         4.000000
25%        13.000000
50%        18.000000
75%        22.000000
max        28.000000
Name: Subjective Happiness, dtype: float64

Not much has drastically changed! This is good news – it means that the participants who missed an attention check question are not very different from the participants who did not miss an attention check question.

Saving your filtered data¶

To save our filtered data, we can use the to_csv() method. The to_csv() method takes a single argument, the name of the file to save the DataFrame to.

our_data_clean.to_csv('./resources/data/Banker_et_al_2022_QuestionnaireData_clean.csv', index=False)

Next steps¶

Now, using our filtered dataframe, try this for any other variables you are interested in! Pick one! For example, you can look at the descriptive statistics for any of the psychiatric variables and group them by various demographic variables.

Carry out the following tasks:

Pick one of the psychiatric variables (e.g., “Depression”, “Stress”, “Social Anxiety”)
Use the describe() method to get summary statistics for that variable.
Pick a second variable and create a new column based on the median split (hint: you can change q=3 to q=2 and labels=["Low","Medium","High"] to labels=["Low","High"] in the pd.qcut() function).
Summarize the psychiatric variable by the new column you created in step 3 using the groupby() method.

# Insert your code here

Data Processing

Contents