Data Processing
Contents
Data Processing¶
This tutorial was inspired by and adapted from Shawn A. Rhoads’ PSYC 347 Course [CC BY-SA 4.0 License].
Learning objectives¶
This notebook is intended to teach you basic python syntax for:
Reading data from a file
Summarizing data
Filtering data
Writing data to a file
One important package will be used in this notebook: pandas
. Pandas is a powerful data analysis package. It is designed to make data analysis fast and easy in Python. It is also designed to work with data in a tabular format, which is a common way to store data.
Reading data from a CSV file¶
In this notebook, we are going to work with open data from Sarah’s 2022 paper:
Banker, S. M., Na, S., Beltrán, J., Koenigsberg, H. W., Foss-Feig, J. H., Gu, X., & Schiller, D. (2022). Disrupted computations of social control in individuals with obsessive-compulsive and misophonia symptoms. iScience, 25(7), 104617. https://doi.org/10.1016/j.isci.2022.104617
The data is available on OSF. I already cleaned it up a little bit. We will be using the Banker_et_al_2022_QuestionnaireData.csv
file.
filename = 'https://raw.githubusercontent.com/Center-for-Computational-Psychiatry/course_spice/main/modules/resources/data/Banker_et_al_2022_QuestionnaireData.csv'
To read this file, we will use the pandas
package. pandas
is a Python library used for data manipulation and analysis. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for working with data from a variety of sources such as CSV files, SQL databases, and Excel spreadsheets. Pandas is commonly used in data science and machine learning applications, as well as in finance, social science, and other fields where data analysis is important.
A CSV file is a comma-separated values file, which allows data to be saved in a tabular format. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. CSV files can be opened in spreadsheet applications like Microsoft Excel or Google Sheets.
# we will import pandas using the alias pd
import pandas as pd
Above, we imported pandas
using the import
statement. This statement makes the pandas
package available to us in our notebook. We will use import
to make other packages available to us as well throughout this course and in your research.
We can now use the functions and data structures provided by pandas
in our code. We will use the pandas
package to read in the data from the CSV file.
Let’s start by reading in the data from the CSV file. We will use the read_csv()
function from the pandas
package to do this. The read_csv()
function takes a single argument, the path to the CSV file to read. The function returns a DataFrame
object, which is a data structure provided by pandas
for storing tabular data. We will assign the DataFrame
object returned by read_csv()
to a variable called df
.
our_data = pd.read_csv(filename)
Making sense of your DataFrame¶
We can preview the first five rows of the DataFrame
using the head()
method. The head()
method takes a single argument, the number of rows to preview. If no argument is provided, the head()
method will return the first five rows by default.
display(our_data.head())
prolific_id | AttentionCheck | Trait Anxiety | Loneliness | Depression | Obsessive Compulsion | Subjective Happiness | Stress | Misophonia | Autism Spectrum | ... | Avoidant Personality Disorder | Borderline Personality Disorder | Apathy | Eating Disorder | Alcohol Use Disorder | Age | Sex | Gender | Income | Education | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 546ec14dfdf99b2bc7ebd032 | 0 | 37 | 18 | 28 | 18 | 11 | 7 | 8 | 4.527778 | ... | 30 | 2 | 14 | 2 | 0 | 55 | 1 | 1 | 5 | 6 |
1 | 548491acfdf99b0379939cc0 | 0 | 53 | 19 | 46 | 18 | 12 | 14 | 7 | 3.472222 | ... | 18 | 3 | 11 | 11 | 0 | 28 | 2 | 2 | 8 | 5 |
2 | 54924b8efdf99b77ccedc1d5 | 0 | 56 | 20 | 51 | 20 | 11 | 11 | 2 | 3.055556 | ... | 25 | 3 | 19 | 26 | 0 | 19 | 1 | 1 | 5 | 4 |
3 | 5563984afdf99b672b5749b6 | 0 | 23 | 11 | 24 | 3 | 28 | 6 | 6 | 3.500000 | ... | 14 | 1 | 5 | 16 | 0 | 39 | 1 | 1 | 11 | 5 |
4 | 5588d358fdf99b304ee5674f | 0 | 59 | 24 | 57 | 22 | 8 | 15 | 5 | 3.888889 | ... | 23 | 5 | 33 | 10 | 3 | 27 | 2 | 2 | 4 | 6 |
5 rows × 21 columns
You will see that we have a lot of columns in this DataFrame
. We can use the shape
attribute to see the dimensions of the DataFrame
. The shape
attribute returns a tuple containing the number of rows and columns in the DataFrame
.
our_data.shape
(1175, 21)
We have 1175 rows and 21 columns. Typically, the rows in a DataFrame
represent observations and the columns represent variables. In this case, each row represents a participant and each column represents a variable measured for each participant.
To get a better sense of our variables, we can use the info()
method. The info()
method prints a summary of the DataFrame
including the number of rows and columns, the number of non-null values in each column, and the data type of each column.
print(our_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1175 entries, 0 to 1174
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 prolific_id 1175 non-null object
1 AttentionCheck 1175 non-null int64
2 Trait Anxiety 1175 non-null int64
3 Loneliness 1175 non-null int64
4 Depression 1175 non-null int64
5 Obsessive Compulsion 1175 non-null int64
6 Subjective Happiness 1175 non-null int64
7 Stress 1175 non-null int64
8 Misophonia 1175 non-null int64
9 Autism Spectrum 1175 non-null float64
10 Social Anxiety 1175 non-null int64
11 Avoidant Personality Disorder 1175 non-null int64
12 Borderline Personality Disorder 1175 non-null int64
13 Apathy 1175 non-null int64
14 Eating Disorder 1175 non-null int64
15 Alcohol Use Disorder 1175 non-null int64
16 Age 1175 non-null int64
17 Sex 1175 non-null int64
18 Gender 1175 non-null int64
19 Income 1175 non-null int64
20 Education 1175 non-null int64
dtypes: float64(1), int64(19), object(1)
memory usage: 192.9+ KB
None
You should notice a few key things from the output of the info()
method. First, we have 1175 rows and 21 columns, which matches the output of the shape
attribute. Second, we have 1175 non-null values in each column, which means that we have no missing data. Third, we have a mix of data types. Most of the columns contain integers, but some of the columns contain floating point numbers and some of the columns contain strings (objects).
We can also list out all of the columns in the DataFrame
using the columns
attribute. The columns
attribute returns a list of column names.
print(our_data.columns)
Index(['prolific_id', 'AttentionCheck', 'Trait Anxiety', 'Loneliness',
'Depression', 'Obsessive Compulsion', 'Subjective Happiness', 'Stress',
'Misophonia', 'Autism Spectrum', 'Social Anxiety',
'Avoidant Personality Disorder', 'Borderline Personality Disorder',
'Apathy', 'Eating Disorder', 'Alcohol Use Disorder', 'Age', 'Sex',
'Gender', 'Income', 'Education'],
dtype='object')
Exploring variables within your DataFrame¶
Let’s explore some of our variables. We can start with ‘Age’. We can use the describe()
method to get summary statistics for a column. The describe()
method takes a single argument, the name of the column to summarize. The describe()
method returns a Series
object, which is a data structure provided by pandas
for storing a single column of data.
our_data['Age'].describe()
count 1175.000000
mean 35.883404
std 13.353503
min 18.000000
25% 25.000000
50% 33.000000
75% 44.000000
max 77.000000
Name: Age, dtype: float64
We can see that the mean age of our participants is 35.88 years old. The youngest participant is 18 years old and the oldest participant is 77 years old. The standard deviation is 13.35 years. This seems like a pretty young sample in terms of age! I would typically plot these data to get a better sense of the distribution of ages, but we will save that for the next module.
You can use describe()
to get summary statistics for other columns as well. For example, we can use describe()
to get summary statistics for the ‘Subjective Happiness’ column, which contains the total score on Subjective Happiness Scale.
our_data['Subjective Happiness'].describe()
count 1175.000000
mean 17.645957
std 5.874810
min 4.000000
25% 13.000000
50% 18.000000
75% 22.000000
max 28.000000
Name: Subjective Happiness, dtype: float64
We can see that the mean score of our participants is 17.64 – our sample is pretty happy!
For data that are rank-ordered, such as “Income” or “Education”, computing the mean and standard deviation is not very informative. Instead, we can use the value_counts()
method to get a frequency count for each value in the column. The value_counts()
method takes a single argument, the name of the column to summarize. The value_counts()
method returns a Series
object, which is a data structure provided by pandas
for storing a single column of data.
our_data['Income'].value_counts().sort_index()
1 79
2 84
3 114
4 102
5 101
6 125
7 73
8 104
9 73
10 75
11 165
12 80
Name: Income, dtype: int64
Above, we can see a table of frequencies for each rank (ranging from 1 to 12). You should notice that there seem to be many more participants who have lower ranks (e.g., 3, 4, 5) than higher ranks (e.g., 10, 12). This is a common pattern for rank-ordered income data!
Creating new variables¶
Can money buy happiness?¶
Let’s explore the relationship between income and happiness. To do this, let’s split participants into different income brackets. We will create a new column called ‘IncomeSplit’ that contains the string ‘High’ if the participant’s income rank is in the top 33.3%, ‘Medium’ if the participant’s income rank is in the middle 33.3%, and ‘Low’ if the participant’s income rank is in the bottom 33.3%.
is greater than or equal to the median income rank and ‘Low’ if the participant’s income rank is less than the median income rank.
We can use the qcut()
function to perform the a quantitle split. The qcut()
function takes three arguments: the DataFrame
to split, the name of the column to split, and the number of bins to split the data into. The qcut()
function returns a Series
object, which is a data structure provided by pandas
for storing a single column of data.
our_data['IncomeSplit'] = pd.qcut(our_data['Income'], q=3, labels=['Low','Medium','High'])
This created a new column called “IncomeSplit”. We can see it now in our columns (all the way at the end).
display(our_data.head())
prolific_id | AttentionCheck | Trait Anxiety | Loneliness | Depression | Obsessive Compulsion | Subjective Happiness | Stress | Misophonia | Autism Spectrum | ... | Borderline Personality Disorder | Apathy | Eating Disorder | Alcohol Use Disorder | Age | Sex | Gender | Income | Education | IncomeSplit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 546ec14dfdf99b2bc7ebd032 | 0 | 37 | 18 | 28 | 18 | 11 | 7 | 8 | 4.527778 | ... | 2 | 14 | 2 | 0 | 55 | 1 | 1 | 5 | 6 | Low |
1 | 548491acfdf99b0379939cc0 | 0 | 53 | 19 | 46 | 18 | 12 | 14 | 7 | 3.472222 | ... | 3 | 11 | 11 | 0 | 28 | 2 | 2 | 8 | 5 | Medium |
2 | 54924b8efdf99b77ccedc1d5 | 0 | 56 | 20 | 51 | 20 | 11 | 11 | 2 | 3.055556 | ... | 3 | 19 | 26 | 0 | 19 | 1 | 1 | 5 | 4 | Low |
3 | 5563984afdf99b672b5749b6 | 0 | 23 | 11 | 24 | 3 | 28 | 6 | 6 | 3.500000 | ... | 1 | 5 | 16 | 0 | 39 | 1 | 1 | 11 | 5 | High |
4 | 5588d358fdf99b304ee5674f | 0 | 59 | 24 | 57 | 22 | 8 | 15 | 5 | 3.888889 | ... | 5 | 33 | 10 | 3 | 27 | 2 | 2 | 4 | 6 | Low |
5 rows × 22 columns
Now that we have our new variable, we can view the frequency counts for each level of the variable using the value_counts()
method.
our_data['IncomeSplit'].value_counts().sort_index()
Low 480
Medium 375
High 320
Name: IncomeSplit, dtype: int64
Now, we can use the groupby()
method to group the data by the ‘IncomeSplit’ column. The groupby()
method takes a single argument, the name of the column to group by. The groupby()
method returns a DataFrameGroupBy
object, which is a data structure provided by pandas
for storing grouped data.
We will again use the describe()
method to get summary statistics for the ‘Subjective Happiness’ column, grouped by the ‘IncomeSplit’ column.
our_data.groupby('IncomeSplit')['Subjective Happiness'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
IncomeSplit | ||||||||
Low | 480.0 | 17.435417 | 5.958100 | 4.0 | 13.0 | 18.0 | 22.0 | 28.0 |
Medium | 375.0 | 17.613333 | 5.776342 | 4.0 | 13.0 | 18.0 | 22.0 | 28.0 |
High | 320.0 | 18.000000 | 5.865258 | 4.0 | 14.0 | 19.0 | 23.0 | 28.0 |
The effects is small, but it looks like participants mean happiness scores increase as the income bracket group goes from “Low” to “High”!
Remember, it is important to think critically about your variables. Income is associated with many different factors (e.g., access to healthcare, access to education, job security, etc.), so it makes sense if people with more resources report being happier (or if people who report being happier have greater incomes – remember correlation does not imply causation).
Filtering data¶
Now that have a sense of our data, let’s filter it down to a subset of participants. We can use the query()
method to do this. The query()
method takes a single argument, a string containing a boolean expression. The query()
method returns a DataFrame
object containing only the rows that match the boolean expression.
One variable in particular is the “Attention Check” column, which indicates if a participant missed an attention check question. We can use the query()
method to filter out participants who missed an attention check question. We can do this by querying the ‘attention_sum’ column to see if it is equal to 0. We can then assign the filtered DataFrame
to a new variable called our_data_clean
.
our_data_clean = our_data.query("AttentionCheck == 0")
our_data_clean.shape
(1090, 22)
Looks like our new DataFrame
contains 1090 rows (which is 85 fewer rows than our original DataFrame
). This means that 85 participants missed an attention check question! We should exclude these participants from future analyses.
Let’s see how “Age” and “Subjective Happiness” changed for this filtered sample. Let’s do this in a for-loop this time.
for var_name in ['Age','Subjective Happiness']:
print(our_data_clean[var_name].describe(), end='\n\n')
count 1090.000000
mean 35.755963
std 13.295645
min 18.000000
25% 25.000000
50% 33.000000
75% 44.000000
max 77.000000
Name: Age, dtype: float64
count 1090.000000
mean 17.694495
std 5.872166
min 4.000000
25% 13.000000
50% 18.000000
75% 22.000000
max 28.000000
Name: Subjective Happiness, dtype: float64
Not much has drastically changed! This is good news – it means that the participants who missed an attention check question are not very different from the participants who did not miss an attention check question.
Saving your filtered data¶
To save our filtered data, we can use the to_csv()
method. The to_csv()
method takes a single argument, the name of the file to save the DataFrame
to.
our_data_clean.to_csv('./resources/data/Banker_et_al_2022_QuestionnaireData_clean.csv', index=False)
Next steps¶
Now, using our filtered dataframe, try this for any other variables you are interested in! Pick one! For example, you can look at the descriptive statistics for any of the psychiatric variables and group them by various demographic variables.
Carry out the following tasks:
Pick one of the psychiatric variables (e.g., “Depression”, “Stress”, “Social Anxiety”)
Use the
describe()
method to get summary statistics for that variable.Pick a second variable and create a new column based on the median split (hint: you can change
q=3
toq=2
andlabels=["Low","Medium","High"]
tolabels=["Low","High"]
in thepd.qcut()
function).Summarize the psychiatric variable by the new column you created in step 3 using the
groupby()
method.
# Insert your code here