So much data, so little time (and space and money and organization)

Last week at the National Radio Astronomy Observatory in Green Bank, WV, where I work, we hosted a conference called "Innovations in Data-Intensive Astronomy." While this specific title might put you off if you are not someone who cares about making innovations in data-intensive astronomy, don't stop reading! The issues discussed were relevant to all fields and subfields of science and are sociological as well as technical, practical as well as philosophical.

The conference was convened because of the huge amounts of data generated by modern telescopes, which often sample on the level of microseconds (meaning that they'll gather data for x microseconds and then dump that data into a file, so that for every second of observation time, you have a million datapoints to deal with). Given that observation times are on the scale of hours, that means that for a regular high-sampling observation, you could have 60 million datapoints hanging out in a file being like, "Whatchu gonna do with me?"

Obviously stolen image.

These files quickly stretch their tentacles into the range of terabytes (1 terabyte being perhaps 17 times the size of the average laptop hard drive). In the office next to mine sit about 50 one-terabyte external hard drives, all home to strings and strings and columns and columns of numbers that add up to lots of information about the universe (in the case of my next-door neighbor, about pulsars). The problem is not just in astronomy, but in all tech-heavy scientific fields, from bioinformatics to climatology.

The existence of offices that become disk sanctuaries, like some kind of scientific Ark of the Covenant, raises many questions. Since this blog has the name that it does, you may be ahead of this sentence in knowing that they are smaller questions.

The conference was convened to address these, so that the future does not leave all the science geeks crushed to death (both literally and figuratively) by their data (talk about biting the hand that feeds, UNIVERSE).

Some of the questions:

  1. How do we make this much data into something understandable by our puny brains?
  2. How do we make/buy computers that are fast enough to do this analysis, when all our budgets are being slaughtered?
  3. Can we store this much data, or do we have to delete it after we've extracted our results?
  4. How do we choose what to delete and what to save?
  5. Do we have a duty to The Future to archive data so that our results are still replicable?
  6. Should there be more universal access to data, especially if a program is funded by tax dollars?
  7. Should there be more universal software-based analysis tools so that scientists are not always doing hack-jobs of creating their own code that is slightly different at every institution?
  8. Should any of this be federally supported or mandated or managed so that data is more accessible, formats and tools are more universal, and processes are more open?
  9. Will our ability to take data soon outstrip our ability to understand it?

These questions, and more, were hotly debated by people in patterned button-down shirts in the Jansky Auditorium below my office.

I dutifully recorded and archived the sessions on Ustream.tv so that anyone can watch them (access to data, what what).

Check them out here, and weigh in on the blog. What to do about data? What do you think of a world piled high with hard drives?

Reader Comments (3)

I think this is a pretty common problem in a lot of fields right now. There was even a new field (bioinformatics) that grew out of biology due to the problem of dealing with mountains of DNA sequence data.

Also, my job in Colorado is going to be about the "data mountain" problem. The patients they treat with radiation therapy come in for 30-40 separate treatments. Each time they come in, they have a bunch of images taken of them, and a few other tests. They want me to sift through all those images and find something that correlates with the treatment outcomes.

May 9, 2011 | Unregistered CommenterTripp

Are you going to write a visual-based algorithm to do it for you?

May 9, 2011 | Unregistered CommenterSarah Scoles

