Blu-Ray nanopatterns to boost solar (PV) efficiency »

This is an old revision of the document!

Learning to build a simple metric scraper robot

I know that we usually fly over webpages just scanning text for keywords and structural bits and pieces of information, with the least amount of attention we can spare. We often don't really read anymore. But today I would like you to slow down, take a moment, get your favorite beverage and sit down to actually read this, because a part of my current work presented a perfect opportunity to go into learning, knowledge transfer, inspiration and of course the misconception of originality as well.

When I was a small boy, I was often asked by grown-up's what I wanted to be when I grew up. I always answered: I will study cybernetics. The kicker is, I just said that because I knew it would please my mother, so she could enjoy showing off, what a smart and ambitious son she had. But as it turned out, I obviously wasn't smart enough for that :)

For the past couple of weeks I've been scrambling like hell and to get some of the work done I had to create an army of slaves to do it first. So I've been busy building robots of all different kinds. Some of them are made of real hardware. Others exist in software only and one of those creations shall be used as an example, to reflect the learning process involved.

When I look at today's ways of “learning” in school and universities I am not surprised that we are breeding generations of mindfucked zombies, endlessly repeating the same mistakes, trying to use the one tool they've learned for everything (even if completely incompatible). Pointless discussions about originality, plagiarism and unique revolutionary ideas. Patents. Intellectual-Property. Bullshit. Many people with that kind of background I meet cannot even say “I don't know”. They will scramble and come up with a bullshit answer for everything, only to appear knowing. Why? Because they have been taught that not knowing something is equal to failure. And failure will lead to become an unsuccessful looser, who will not get laid, right?

But how should anyone's brain be able to really learn something when it believes (even when just pretending) that it already knows it? You cannot fill a cup that already believes itself to be full. When you pretend to be an expert, you obviously cannot even ask questions that would help you to really understand something, because then your “expert status” would fade away. So better make your heads empty, because the more you learn, the more you realize, that you don't know shit.

Which turns our focus to failure. From all I could learn about efficient learning, failure was always the biggest accelerator for learning. If everything went smoothly and I didn't have to do much to learn/realize how or why something worked I didn't learn anything about it, because I simply didn't have to. Only failure and deep engagement with whatever I tackled really let my brain comprehend things to a level where I can say I've learned more about it.

This is a little drawing I made modeling how my personal learning process works and after looking at a lot of history, it seems to me that we can apply it over all ages and societies as well. Only that we've managed to carry cultural ballast with us, which tries to pretend that the right half of the circle doesn't exist or when not denied is always associated with negative educational/social metrics/values (grades/recognition).

Before we had Internet, it was easy to travel somewhere, copy what other people did, come back and pretend it's one's own “original” work (simply lie about it). No one could really check it. Especially not on individual mass scale. It was easy to sell the illusion of revolutionary and “original” work. But then, why do so many pop songs (not the countless covers of these songs anyways) are basically based on the melodies of countless local folk songs from all ages from all over the world? Or why were the Americans so eager to pull off Operation Paperclip after WW2? Why has there been and is so much industrial/military espionage to get the secret plans of “the other guys” if they were all so original? Well they weren't, because…

In the beginning there was the copy

Even if it appears unique and original to us, there always was some other inspiration/model to copy from. Most of what we do is based on other ideas and concepts laid out by other people before. And their ideas also evolved in the same matter. It's basically all about perception. I could present you the final python robot and say: “This is my awesome original work”. And you might believe it, since it's slick, streamlined and very efficient. But that is just the current result. You wouldn't (and in most cases won't) see how crappy it began and how it evolved into its current form. But this is exactly what we're going to do today.

The Problem

In order to verify the ucsspm results reference data was needed. A good industry produced pyranometer is too expensive at this time and hacking a cheap one oneself brings the problem of reference data again. So I searched the net for data sources and found http://www.meteo.physik.uni-muenchen.de/dokuwiki/doku.php?id=wetter:stadt:messung.

Unfortunately the data isn't accessible through an API or at least some JSON export of the raw data. Which meant I needed to devise a robot that would periodically scrape the data from that web page, extract all needed values and feed that data into the UCSSPM to calculate with real data for reference. Once it has done all that it has to push all usable raw data and the results of the UCSSPM prediction into an influxdb shard running on the stargazer.

The bash solution

This bash script was the first incarnation of this robot, where I tried to get away with just a little wget, sed and awk magic. I just copied the curl examples from influxdb's docs and started hacking away with some awk/sed examples I found as well.

#!/bin/sh

ED=0

API="url_to_influxdb_api_with_auth_tokens"

while :
do

    # slow down cowboy
    if [ "${ED}" -ge "10" ]
    then
        wget -q -4 -nv -O ~/tmp/meteo.data "http://www.meteo.physik.uni-muenchen.de/mesomikro/stadt/anzeige.php" 2>&1 >/dev/null
        ED=0
    else
        ED=$((${ED}+1))
    fi

    OUTT=$(cat ~/tmp/meteo.data | sed '34q;d' | awk '{print $3}')
    OUTH=$(cat ~/tmp/meteo.data | sed '42q;d')
    REGEX="^<TD><span class=normal> (.*)    %</span></TD>"
    if [[ $OUTH =~ $REGEX ]]; then OUTH=${BASH_REMATCH[1]}; fi
    OUTP=$(cat ~/tmp/meteo.data | sed '90q;d' | awk '{print $3}')
    WSPD=$(cat ~/tmp/meteo.data | sed '47q;d' | awk '{print $3}')
    WDIR=$(cat ~/tmp/meteo.data | sed '93q;d' | awk '{print $3}')
    SRAD=$(cat ~/tmp/meteo.data | sed '73q;d' | awk '{print $4}')

    curl -X POST "${API}" -d "[{\"name\":\"aquarius.env.outdoor.degC\",\"columns\":[\"value\"],\"points\":[[${OUTT}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"aquarius.env.outdoor.humidity\",\"columns\":[\"value\"],\"points\":[[${OUTH}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"aquarius.env.outdoor.pressure\",\"columns\":[\"value\"],\"points\":[[${OUTP}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"aquarius.env.outdoor.windspeed\",\"columns\":[\"value\"],\"points\":[[${WSPD}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"aquarius.env.outdoor.winddir\",\"columns\":[\"value\"],\"points\":[[${WDIR}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"aquarius.env.outdoor.srad\",\"columns\":[\"value\"],\"points\":[[${SRAD}]]}]"


    PV_T=$(echo "scale=2; ${OUTT} + ((${OUTT}/100.0)*15.0)" | bc)
    IFS='|' read -a ucsspm <<< "$(/home/chrono/src/UCSSPM/ucsspm.py -lat 48.11 -lon 11.11 -at_t ${OUTT} -at_h ${OUTH} -at_p ${OUTP} -pv_t ${PV_T})"
    curl -X POST "${API}" -d "[{\"name\":\"odyssey.ucsspm.etr\",\"columns\":[\"value\"],\"points\":[[${ucsspm[0]}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"odyssey.ucsspm.rso\",\"columns\":[\"value\"],\"points\":[[${ucsspm[1]}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"odyssey.ucsspm.sza\",\"columns\":[\"value\"],\"points\":[[${ucsspm[2]}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"odyssey.ucsspm.max\",\"columns\":[\"value\"],\"points\":[[${ucsspm[3]}]]}]"
    curl -X POST "${API}" -d "[{\"name\":\"odyssey.ucsspm.out\",\"columns\":[\"value\"],\"points\":[[${ucsspm[4]}]]}]"


    echo "${OUTT} ${OUTH} ${OUTP} ${WSPD} ${WDIR} ${SRAD}"
    echo "${ucsspm[0]} ${ucsspm[1]} ${ucsspm[2]} ${ucsspm[3]} ${ucsspm[4]}"

    sleep 7
done;

This also represents only the last state of this script, in the beginning it didn't even have the download protector and ran haywire for a little time, causing unnecessary stress on the server of the LMU :(

What else is wrong with it?

Since the data is delivered via HTML and wrapped in some weird html table construct it's an absolute pain to reliably scrape the data. Although technically this robot is working and doing its job, it often died or produced faulty prediction results because upstream data change had some incomprehensible white space changes as a consequence and sometimes just delivered 999.9 values. Pain to maintain. And since most relevant values came as floats there was no other solution than to use bc for floating point comparisons, since bash can't do it. And finally, the data structure and shipping method to influxdb is more than questionable, it would never scale. Each metric produces another new HTTP request creating a lot of wasteful overhead. But at the point of writing I simply didn't know enough to make it better.

The python solution

Seeing the bash script fail regularly and having to look after it all the time was no option. So I looked at countless scraping examples using python. After installing and uninstalling a lot of pip packages like beautifulsoup4, scrapy and all other tools you can find when searching for python web scraping, I couldn't get anything to work with python. So i broke it down to the most simple tasks and went step by step.

Reduce the amount of data to transfer and parse

After searching Dokuwiki's docs I found a nice feature for that: doku.php?do=export_xhtmlbody delivers the page content only. This reduces the amount of traffic and also the risk of changes which might break the scraper again in the future.

Try to find a way to look into specific HTML elements only

After looking at lxml examples again it seemed feasible to extract just TD elements and in this case all data was wrapped inside TD elements, so after a bit of testing this worked pretty well.

Have a reliable regular expression to extract all numbers (signed/unsigned int and float)

Well, stackexchange is full of examples to copy and http://www.regexr.com/ offers a nice live test

Have EVERY input sanity checked and cast into its designated type
Learn more about influxdb to restructure the data to reduce the amount of timeseries
Figure out a way to push a complete dataset in one http post request to reduce overhead
Have no single exception kill the rest of the script (robot's salvation)

#!/usr/bin/env python2
# -*- coding: UTF-8 -*-
################################################################################

import requests, re, os, sys, time, subprocess
from lxml import html

API="url_to_influxdb_api_with_auth_tokens"

# Set target URL to scrape

TARGET="http://www.meteo.physik.uni-muenchen.de/dokuwiki/doku.php?do=export_xhtmlbody&id=wetter:stadt:messung"


def flextract (data,min,max,fallback):

	# Safely extract all numbers found in string as float
	regx=re.findall(r"[-+]?[0-9]*\.?[0-9]+", data)

	try:

		WORK=float(regx[0])

		if 	WORK <= max and \
			WORK >= min and \
			WORK != "999.9":

			return WORK

	except:

		return fallback

################################################################################
##  MAIN  ######################################################################
################################################################################

def main():

	# Define some sane starting points in case everything fails

	OUTT=25.0
	OUTH=60.0
	OUTP=950.0
	PREV=0.0
	PRET=""
	WSPD=0.0
	WDIR=0
	SRAD=0.0
	DRAD=0.0
	ED=0

	while True:

		if ED == 0:
			try:
				# Get the target's content
				page = requests.get(TARGET)
				# use lxml's html magic to structuture the data
				tree = html.fromstring(page.text)
				# gather all values found in <TD> elements
				data = tree.xpath('//td/text()')
			except:
				pass

			time.sleep (9.5)
			ED=1

		else:
			ED=0
			time.sleep (10)

		# Air Temperature (2m) in degC
		OUTT=flextract(data[2],-35,45,OUTT)

		# Air Pressure in hPa
		OUTP=flextract(data[26],0.0,1200.0,OUTP)

		# Air Humidity (2m) in %
		OUTH=flextract(data[8],0.0,100.0,OUTH)

		# Precipitation Volume in mm
		PREV=flextract(data[29],0.0,500.0,PREV)

		# Precipitation Type
		PRET=data[31]
		PRET=PRET.encode('utf-8').strip()

		# Windspeed in m/s
		WSPD=flextract(data[10],0.0,100.0,WSPD)

		# Wind Direction in deg
		WDIR=int(flextract(data[28],0,360,WDIR))

		# Global Solar Radiation (direct)
		SRAD=flextract(data[21],0.0,1200.0,SRAD)

		# Global Solar Radiation (diffuse)
		DRAD=flextract(data[22],0.0,1200.0,DRAD)

		# Give a 15% temp gain (based on OUTT) to PV modules (FIXME: come up with something better based on SRAD until sensors are in place for real values)
		PV_T=OUTT + ((OUTT/100.0)*15.0)

		# Odyssey UCSSPM Long-Term Evaluation
		try:
			proc = subprocess.Popen(['./ucsspm.py', '-lat', '48.11', '-lon', '11.11', '-at_t', str(OUTT), '-at_p', str(OUTP), '-at_h', str(OUTH), '-pv_t', str(PV_T)], stdout=subprocess.PIPE)
			for line in proc.stdout.readlines():
				output=line.rstrip()
				ucsspmO=output.split('|')
		except:
			pass

		# Aquarius UCSSPM Long-Term Evaluation
		try:
			proc = subprocess.Popen(['./ucsspm.py', '-lat', '48.11', '-lon', '11.11', '-at_t', str(OUTT), '-at_p', str(OUTP), '-at_h', str(OUTH), '-pv_t', str(PV_T), '-pv_a', '5.0', '-pv_tc', '0.29', '-pv_e', '19.4'], stdout=subprocess.PIPE)
			for line in proc.stdout.readlines():
				output=line.rstrip()
				ucsspmA=output.split('|')
		except:
			pass

		payload = []

		payload.append('[{"name": "aquarius.env.outdoor.temp",  "columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (OUTT,'°C'))
		payload.append(' {"name": "aquarius.env.outdoor.baro",  "columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (OUTP,'hPa'))
		payload.append(' {"name": "aquarius.env.outdoor.hygro", "columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (OUTH,'%'))
		payload.append(' {"name": "aquarius.env.outdoor.percip","columns": ["value", "type", "unit"], "points": [[%.1f,"%s","%s"]]},' % (PREV,PRET,'mm'))
		payload.append(' {"name": "aquarius.env.outdoor.wind",  "columns": ["value", "type", "unit"], "points": [[%d,"%s","%s"]]},' % (WDIR,'direction','°'))
		payload.append(' {"name": "aquarius.env.outdoor.wind",  "columns": ["value", "type", "unit"], "points": [[%.1f,"%s","%s"]]},' % (WSPD,'speed','m/s'))
		payload.append(' {"name": "aquarius.env.outdoor.pyrano","columns": ["value", "type", "unit"], "points": [[%.1f,"%s","%s"]]},' % (SRAD,'direct','W/m²'))
		payload.append(' {"name": "aquarius.env.outdoor.pyrano","columns": ["value", "type", "unit"], "points": [[%.1f,"%s","%s"]]},' % (DRAD,'diffuse','W/m²'))
		payload.append(' {"name": "aquarius.ucsspm.etr","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmA[0]),'W/m²'))
		payload.append(' {"name": "aquarius.ucsspm.rso","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmA[1]),'W/m²'))
		payload.append(' {"name": "aquarius.ucsspm.sza","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmA[2]),'°'))
		payload.append(' {"name": "aquarius.ucsspm.max","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmA[3]),'W/m²'))
		payload.append(' {"name": "aquarius.ucsspm.out","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmA[4]),'W/m²'))
		payload.append(' {"name": "odyssey.ucsspm.etr","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmO[0]),'W/m²'))
		payload.append(' {"name": "odyssey.ucsspm.rso","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmO[1]),'W/m²'))
		payload.append(' {"name": "odyssey.ucsspm.sza","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmO[2]),'°'))
		payload.append(' {"name": "odyssey.ucsspm.max","columns": ["value", "unit"], "points": [[%.1f,"%s"]]},' % (float(ucsspmO[3]),'W/m²'))
		payload.append(' {"name": "odyssey.ucsspm.out","columns": ["value", "unit"], "points": [[%.1f,"%s"]]}]' % (float(ucsspmO[4]),'W/m²'))

		try:
			requests.post(url=API,data=''.join(payload),timeout=2)
		except:
			pass

################################################################################

if __name__ == '__main__':
	rc              = main()
	sys.exit        (rc)

And that's that. Success. The only thing left to do is to share this knowledge back so that the next person looking for ways to scrape data from web pages with python can copy these examples, adapt them according to use case and fail and learn and come up with new ideas as well. Hopefully in even less time.

software, ucsspm, python, data, scraping, metrics, influxdb, vfcc, research

Navigation Menu

Rustic Retreat

Hot Projects

SEEDStack

UCSSPM

picoReflow

PiGI

DIY ARA-2000

DSpace

Mission-Tags

Share & Donate

Learning to build a simple metric scraper robot

In the beginning there was the copy

The Problem

The bash solution

The python solution

User Tools

Site Tools

Navigation Menu

Rustic Retreat

Hot Projects

SEEDStack

UCSSPM

picoReflow

PiGI

DIY ARA-2000

DSpace

Mission-Tags

Share & Donate

Learning to build a simple metric scraper robot

In the beginning there was the copy

The Problem

The bash solution

The python solution

Page Tools