PyGrunn: ‘all you need is less’, rethinking big data - Berco Beute

Tags: pygrunn

(One of my summaries of a talk at the 2017 PyGrunn conference).

He did some thinkwork lately. So he wants to talk a level higher. The systems that we’re building are increasingly bigger and increasingly more important. If something goes down, often real problems occur.

Similarly, it is important to talk about software ethical thinking. What do you build? is it OK to build it?

Perhaps more important than the actual software: the data. There’s a difference between data and information. Information is data that’s been processed/interpreted/organized/categorized so that you can do more with it.

Data by itself is not meaningful. Once you add context, it starts to come to life. A context can be your application. In a way, “big data” is thus meaningless. Meaningless without (your application’s) context.

He showed a quote by Frank Zappa

Information is not knowledge.
Knowledge is not wisdom.
Wisdom is not truth.
Truth is not beauty.
Beauty is not love.
Love is not music.
Music is the best...

The amount of available data is huge. It grows and grows and grows every year. A “cambrian explosion”. Much of this data is generated by us. Should we generate it? Should we keep it around?

The “internet of things” will only make the growth bigger.

Data is scattered across organisations. Offline and online. Look at the amount of data that your government has about you. The Dutch government is pretty OK, but quite a number of governments…. do you want them to have lots of data about you?

Information is also scattered accoss (your) devices.

The question: is that a problem?

  • There’s a cost component. Storing data costs money. It is expensive to collect. The less you collect, the less you have to pay.

  • It is vulnerable! Data is a liability: you’re responsible for the data you collect and store. So you have to take lots of care! Data is a valuable good: for others as well. It is worth having and it is worth stealing.

    See also ransomware.

    Data can also be manipulated. You can be misguided. Or stolen personal data can be used for fraud or stealing stuff.

  • It impairs your vision. Do you still see the trees for the forest?

  • Shifting context blurs the meaning of data. Data collected for one purpose might not be correct in another.

What is the cause?

  • Ehm, why do we copy data so much? It looks like hoarding.

    One reason is the computer itself. Originally, they weren’t connected. So you had to copy data to get it from one to the next computer. And originally they had very little memory, so you had to copy data in and out of the computer. Now everything is connected and we have huge amounts of memory.

    We, as an industry, also have an almost build-in need to work with the original data and to all the calculations ourselves. Isn’t the actual information we want, isn’t that enough?

Is there another way?

  • Stop copying! Do you really need the data? Locally? Given enough bandwith and low enough latence, copies are not needed. (See the George Gilder book “telecosm, the world after bandwith abundance”).

    Is it so strange? No. Who still has CDs? Who has spotify?

  • Information minimalization. There should be a strict rule “don’t collect more information than you need”.

    Design information systems to require as few data as possible.

  • Start with the process. Process-based design instead of data-centric design. When starting with the data, you seem to always need more and more. If you start with the process, you look at what you really need.

  • Contract-based systems. He thinks we’ll slowly go in that direction. Interactions will be much more negotiation-like. You’ll ask for the information you need.

    An example. An online store won’t ask you for your address directly. It will ask you for permission to request the information from the government. There’ll be a “contract” between the store and the government that deals with the use the store is allowed to make of the address information.

    A bartender only needs to know whether you’re 18+ or not, he doesn’t need to have all the information that’s in your passport that you need to currently show him to prove your age.


  • Stop copying.

  • Always ask yourself: do I really need all that data?

  • Information minimization.

  • Stick to the processs.

  • Try to think in contracts.

See also the related “Web11” conference:

Photo explanation: just a nice unrelated picture from the my work-in-progress german model railway

Dutch note: python+django programmeren in hartje Utrecht bij de oude gracht? Watersector, dus veel data en geo. Leuk! Nelen&Schuurmans is op zoek. Stuur mij maar een mailtje, want de vacaturetekst staat nog niet online :-) logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):