This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
mission:log:2014:11:24:using-python-lxml-request-as-simple-scrape-robot-for-metrics-from-webpages [2014-11-24 23:51] – [The python solution] chrono | mission:log:2014:11:24:using-python-lxml-request-as-simple-scrape-robot-for-metrics-from-webpages [2016-08-09 19:13] (current) – Updated VFCC links chrono | ||
---|---|---|---|
Line 22: | Line 22: | ||
===== In the beginning there was the copy ===== | ===== In the beginning there was the copy ===== | ||
- | Even if it appears unique and original to us, there always was some other inspiration/ | + | Even if it appears unique and original to us, there always was some other inspiration/ |
===== The Problem ===== | ===== The Problem ===== | ||
- | In order to verify | + | In order to start collecting long-term data to prove the [[lab: |
- | Unfortunately the data isn't accessible through an API or at least some JSON export of the raw data. Which meant I needed to devise a robot that would periodically scrape the data from that web page, extract all needed values and feed that data into the UCSSPM to calculate with real data for reference. Once it has done all that it has to push all usable raw data and the results of the UCSSPM prediction into an influxdb shard running on the stargazer. | + | Unfortunately the data isn't accessible through an API or at least some JSON export of the raw data. Which meant I needed to devise a robot that would periodically scrape the data from that web page, extract all needed values and feed that data into the UCSSPM to calculate with real data for reference. Once it has done all that it has to push all usable raw data and the results of the UCSSPM prediction into an influxdb shard running on the stargazer |
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
===== The bash solution ===== | ===== The bash solution ===== | ||
Line 90: | Line 93: | ||
**What else is wrong with it?** | **What else is wrong with it?** | ||
- | Since the data is delivered via HTML and wrapped in some weird html table construct it's an absolute pain to reliably scrape the data. Although technically this robot is working and doing its job, it often died or produced faulty prediction results | + | Since the data is delivered via HTML and wrapped in some weird html table construct it's an absolute pain to reliably scrape the data. Although technically this robot is working and doing its job, it often died or produced faulty prediction results, quite the opposite of the resilient systems I usually head for. |
+ | |||
+ | Infrequently | ||
+ | |||
+ | And finally, the data structure and shipping method to influxdb is more than questionable, | ||
===== The python solution ===== | ===== The python solution ===== | ||
- | Seeing the bash script fail regularly and having to look after it all the time was no option. | + | Seeing the bash script fail regularly and having to look after it all the time was no option. |
- | * Reduce the amount | + | When I reach that conclusion, I usually turn to python and started by looking at countless scraping examples in python. So I installed and uninstalled a lot of pip packages like beautifulsoup4, |
- | After searching Dokuwiki' | + | **1. Reduce the amount of data to transfer and parse** |
+ | |||
+ | After searching Dokuwiki' | ||
- | | + | **2. Try to find a structured |
+ | |||
+ | After looking at lxml examples again it seemed feasible to extract just TD elements and in this case all data was wrapped inside TD elements, so after a bit of testing, this worked pretty well. | ||
+ | |||
+ | **3. Increase resilience: Have a reliable regular expression to extract all numbers (signed/ | ||
+ | |||
+ | Well, stackexchange is full of examples for regular expressions to copy and http:// | ||
+ | |||
+ | **4. Learn more about influxdb to restructure the data to reduce the amount of timeseries** | ||
+ | |||
+ | This came almost naturally after looking at so many other examples of metric data structures, I simply copied and merged what I considered best practice. | ||
- | After looking at lxml examples again it seemed feasible | + | **5. Figure out a way to push a complete dataset in one http post request to reduce overhead** |
- | * Have a reliable regular expression | + | Brute forcing the correct data format needed with another shell script feeding curl until I was able to figure out the sequence, since there was nothing in the docs about the structure of requests with multiple timeseries. Influxdb is rather picky about strings and quotes so it took a little while to figure out how to do it with curl and then to build and escape the structure correctly in python. Played around with append() and join() and really started to appreciate them. |
- | Well, stackexchange is full of examples to copy and http:// | + | **6. Increase resilience: No single step exception should kill the robot (salvation)** |
- | * Have EVERY input sanity checked | + | Well, python let's you try and pass, to fail and fallback very gracefully :) |
- | * Learn more about influxdb | + | |
- | * Figure out a way to push a complete dataset in one http post request to reduce overhead | + | |
- | * Have no single exception kill the rest of the script (robot' | + | |
<sxh python; toolbar: | <sxh python; toolbar: | ||
Line 142: | Line 158: | ||
return WORK | return WORK | ||
+ | else: | ||
+ | return fallback | ||
except: | except: | ||
Line 240: | Line 257: | ||
payload.append(' | payload.append(' | ||
payload.append(' | payload.append(' | ||
- | payload.append(' | + | payload.append(' |
payload.append(' | payload.append(' | ||
payload.append(' | payload.append(' | ||
Line 268: | Line 285: | ||
</ | </ | ||
- | And that's that. Success. The only thing left to do, in order to close the circle again, was to share this knowledge, so that the next person looking for ways to scrape data from web pages with python can copy these examples, adapt them according to the new use case and fail and learn and come up with new ideas as well. Hopefully in even less time. | + | And that's that. Success. The only thing left to do, in order to close the circle again, was to share this knowledge, so that the next person looking for ways to scrape data from web pages with python can copy these examples, adapt them according to the new use case and fail and learn and come up with new ideas as well. Hopefully in even less time. And it also made it pretty obvious that the [[lab: |
+ | |||
+ | You can see the results of this robot' | ||
+ | |||
+ | And of course it goes without saying that this also serves to show pretty well how important learning computer languages will become. We cannot create a army of slaves to do our bidding (for that is what all these machines/ | ||
+ | |||
+ | But how do we expect people to be able to tell all these machines what and how exactly they' | ||
{{tag> | {{tag> |