Hello again — What constitutes “clean data”?

They say better late than never, right? Do people still blog in 2017? Let’s hope so!

The last time I updated was about two years ago, and my, how things have changed. I do more work on data analysis than visualization these days, I’m learning even more new programming languages, ways to structure code, ways to work better with my colleagues, ways to tell stories. New philosophies about what it means to be a coder and a journalist.

(Oh, I also fell off the radar while I battled stage four kidney failure, started an obsessive healthy eating and weight lifting habit, developed a blood disease anyway, landed on dialysis, got worked up for the second kidney transplant of my life, got said transplant from a fellow news nerd (because, of course that would happen!), recovered from the transplant and reimmersed myself in coding and DC life. More on all that is on my Facebook page and this other blog.)

I’ve made an official goal of sharing my knowledge here again, so share I shall, although maybe in shorter bursts. That’s probably a good thing for you, as well as me. Yes?

I know I’m better because I’m working on three project simultaneously at work again (gosh, I love my job), and one thing I’ve noticed in all of them, and been wrestling with a lot is the concept of “clean data”. In teaching, I often explained this as rows and columns having meaning, and wanting to stay organized, so things belong in separate boxes. After doing more ETL work (that’s extract, transform, load — the process of taking data given to you and putting into a usable format. Sometimes that’s looked like a MySQL database, more recently an R data frame.)

However, clean data also has a lot to do with like things being like. For example, if you want to compare dates in a data set, sometimes dates are written mm/dd/yyyy, like 05/24/1986. But you could also have 24-05-86, which uses hyphens rather than slashes, a two digit year instead of four and switches the order of the month and the day. Without consistency, it’s hard to sort by the most recent date. There are tools to help with this standardization, which I am learning. While in the past, I was willing to do this, but felt it wasn’t part of journalism, I’ve now accepted and even enjoy that getting to the analysis part is just as much journalism and just as important. A tech person outside of journalism recently commented to me that journalists have less protected and more real-world examples, and thus it’s even more important, and a bigger challenge to handle lots of different use cases. Not something I had thought about before.

Secondly, I’ve started thinking of clean data as having one column for each type of information. I’ve dealt with a few cases where information in multiple columns need to combined, or broken into separate rows or records. I’ve been practically using and reading about reshaping data, particularly in R. I’m starting to really like this language for its anticipation of common data concerns. I know I could handle a problem like this in Excel or Ruby (I tried before I discovered reshape), but R makes it a lot easier. And while I respect these data munging tasks, I still can’t help but appreciate getting through them more efficiently to get to the good stuff. Reshape is covered nicely in the third week of Coursera’s R Programming series, which I highly recommend.

If you’re wondering, my favorite R book is “R for Data Science” from Hadley Wickham, of course. You can access it for free here. I’m working my way through it with some of my colleagues at work, but happy to talk about it with anyone at any time.

It’s getting late here, so that’ll be all for now, but I hope to get back in the habit of sharing things I learn here, and my Twitter account is finally active again. I’m also seeking to get back into speaking at and attending conferences and workshops, so hopefully I’ll see you around here or in real life. Click “Contact Me” in the upper right, to, well, you know — I’ve had a great time meeting, chatting and learning with all of you, and dearly hope it continues. More soon!

Hello again — What constitutes “clean data”?

Related posts you might enjoy:

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112