Let’s talk big data

Big data can be messy and complicated or elegantly simple. Many big data projects begin from the need to answer specific questions and with the right analytics in place organizations can find actionable insights into their operations. In addition, big data allows for different variations of computer aided designs to check how even minor variations can affect outcomes. Big data projects can obtain, process and analyze data in a variety of ways. Every data source has different characteristics and provides valuable information. With this goal in mind the National Science Foundation awarded a $4.8 million grant to an education project called LearnSphere. LearnSphere is poised to hold large amounts of anonymous student information that is routinely collected for different data analysis purposes. Most importantly, it will allow for large scale analysis down to “being able to detect emotional states from keystroke data”

So what does all this mean? I had a few questions and Dr. Kenneth Koedinger, who is spearheading this effort, was open to address my questions and concerns. Ken is a professor of Human Computer Interaction and Psychology at Carnegie Mellon University. He has an M.S. in Computer Science, a Ph.D. in Cognitive Psychology, and experience teaching in an urban high school.

Although I admire the ambitions of the project, aiming for a deeper understanding of the learning process, my initial concern was the lack of individual student acknowledgment. Data analytics can be a great tool if we need to raise efficiency in production or predict consumer behavior but students must not merely be viewed as products to be improved upon. How do we ensure that the use of such data does not aggravate existing bias and discrimination in education? What measures shall protect our most at risk students– students with learning differences, students marginalized for their race, religion or nationality. And while the project insists that kids are not numbers, the data and information generated from students are looked at as numbers, so that the data and information can be looked at in the most unbiased way possible.

And this is where the project gets interesting. First, the data used for research has been de-identified. Demographic information has been removed from data sets researchers are looking at. The shared data sets are randomly assigned new identifiers and they do not indicate race, geographic location or school the information is coming from. Measures have been taken to make it difficult to tag records back to particular students. The project maintains there is no back mapping to the native records.

The main objective for LearnSphere is to improve student outcomes. What are the difficult parts of learning a course? Why do students have such a difficult time with certain mathematical problems and not others? For example, some might predict that math word problems will be harder for students to work through but as it turns out, some students did better with a math problem that used language instead of just looking at an equation on a piece of paper. The project will look at the learning barriers some students have and how can we use this information to improve learning designs. Where are students now in the learning process and how can we be sure we provide as much support as we can with the data we possess. As Dr. Koedinger clarified it for me “we are studying the terrain on a hiking course rather than the hikers” so how can a study like this make the terrain easier for our hikers? Well, LearnSphere aims to help us all identify those “expert blindspots.” For example, a teacher will know all his / her students but there are spots a teacher just doesn’t see because of closeness to the students. LearnSphere aims to help teachers create effective teaching environments so that our kids can hike the next hill a bit easier.

I still have some reservations when studying vast amounts of data. If data sets include student papers and extensive student data, the risk for re-identification exists. If we are studying data with such finite precision, at what point can the source of the data be tagged back to particular students? There can be no guarantee that data will be fully non-identifiable without an independent qualified expert review and approval of aggregation methodology. Leading de-identification experts need to be involved with a project of this scale. I also hope Dr Koedinger will take advantage of the opportunity to work with two of the most highly respected experts on privacy, Professors Lorrie Cranor and Allessandro Aquisti, both who happen to be located nearby at Carnegie Mellon University.

I advocate for a smart and ethical collection of data for I see the potential benefits of its use. We must remain mindful of the privacy issues raised in such a project as this. Will we truly be making it a more equitable educational system or will the algorithms of big data systematically discriminate against those learners already at a disadvantage? Projects like this can help, but only if the concerns of potential discrimination are carefully considered and the goal of helping individual students is paramount.

We ran out of time (or phone battery) when we spoke to Ken but there will be more information clarifying some of the still outstanding questions I have. Stay tuned…

In the meantime, here is a short video of Ken explaining his Learning Project

Let’s talk big data

Related Resources

School Fundraising in the Digital Age: Policy, Privacy, and Pitfalls

Checklist to Help Schools Vet AI Tools for Legal Compliance

FPF Releases Policy Brief Comparing Federal Child Privacy Bills