<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" version="2.0" xml:base="http://techbuilders.info/search/datascience">
  <channel>
    <title>Data Science</title>
    <link>http://techbuilders.info/search/datascience</link>
    <description/>
    <language>en</language>
    
    <item>
  <title>Google Scholar - finding datasets made easy</title>
  <link>http://techbuilders.info/blog/google-scholar-finding-datasets-made-easy</link>
  <description>&lt;span class="field field--name-title field--type-string field--label-hidden"&gt;Google Scholar - finding datasets made easy&lt;/span&gt;
&lt;span class="field field--name-uid field--type-entity-reference field--label-hidden"&gt;&lt;span lang="" about="http://techbuilders.info/user/40" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;alex@techbuilders&lt;/span&gt;&lt;/span&gt;
&lt;span class="field field--name-created field--type-created field--label-hidden"&gt;Wed, 09/12/2018 - 17:59&lt;/span&gt;

            &lt;div class="clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item"&gt;&lt;p&gt;If you were ever in charge of building a Proof of Concept or visualization contest, you know there is an ocean of websites dedicated to hosting interesting data sets. There are so many options nowadays that it's overwhelming and one doesn't even know where to start. Another issue with the proliferation of websites dedicated to datasets is that data for a specific topic is spread across multiple websites, at a different granularity, different time frames or all of the above. What Google aims to achieve is a single repository where you could search for a specific topic, and you'd see all of the associated datasets across the various websites.&lt;/p&gt;

&lt;p&gt;In order to be a part of the search results, Google suggests that some guidelines are followed (found: &lt;a href="https://developers.google.com/search/docs/data-types/dataset"&gt;here&lt;/a&gt;) and encourages data providers to conform to the open standard (&lt;a href="https://schema.org/"&gt;schema.org&lt;/a&gt;) so that a robust dataset ecosystem thrives and proves valuable for scientists, data journalists, and geeks.&lt;/p&gt;

&lt;p&gt;I spent some time working with the Dataset Search and loved how I could easily find datasets on local government, environment, and social sciences through a UI that we are all familiar with. The datasets were mostly free (some were pay to play) and many times there were multiple sources listed if one of the links was no longer valid. I think this project as a lot of room to run and look forward to an open standard where companies are willing to share their data for the greater good.&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;Take a look! &lt;/p&gt;

&lt;p&gt;&lt;a href="https://scholar.google.com/"&gt;https://scholar.google.com/&lt;/a&gt;&lt;/p&gt;&lt;/div&gt;
      
  &lt;div class="field field--name-field-thumbnail field--type-image field--label-above"&gt;
    &lt;div class="field__label"&gt;Thumbnail&lt;/div&gt;
              &lt;div class="field__item"&gt;  &lt;img src="http://techbuilders.info/sites/default/files/2018-09/googlescholar_0.PNG" width="895" height="466" alt="" typeof="foaf:Image" /&gt;&lt;/div&gt;
          &lt;/div&gt;

  &lt;div class="field field--name-field-blog-category field--type-entity-reference field--label-above"&gt;
    &lt;div class="field__label"&gt;Blog Category&lt;/div&gt;
          &lt;div class="field__items"&gt;
              &lt;div class="field__item"&gt;&lt;a href="http://techbuilders.info/search/datascience" hreflang="en"&gt;Data Science&lt;/a&gt;&lt;/div&gt;
              &lt;/div&gt;
      &lt;/div&gt;

                    &lt;!--div class="field__items"--&gt;
                            &lt;a href="http://techbuilders.info/taxonomy/term/60" hreflang="en"&gt;Data Science&lt;/a&gt;,                             &lt;a href="http://techbuilders.info/taxonomy/term/62" hreflang="en"&gt;Visualization&lt;/a&gt;,                             &lt;a href="http://techbuilders.info/taxonomy/term/80" hreflang="en"&gt;Descriptive Analytics&lt;/a&gt;,                             &lt;a href="http://techbuilders.info/taxonomy/term/79" hreflang="en"&gt;Predictive Analytics&lt;/a&gt;                        &lt;!--/div--&gt;
        
&lt;ul class="comments"&gt;&lt;/ul&gt;&lt;div class="post-block post-leave-comment"&gt;
        &lt;h3 class="heading-primary"&gt;Leave a comment&lt;/h3&gt;
        &lt;!--h2 class="title comment-form__title"&gt;&lt;/h2--&gt;
        &lt;drupal-render-placeholder callback="comment.lazy_builders:renderForm" arguments="0=node&amp;1=97&amp;2=comment&amp;3=comment" token="wIdXtZVON3VV7yM9XwEabzXB6PCrS4K8x_s_nzrt1ZQ"&gt;&lt;/drupal-render-placeholder&gt;&lt;/div&gt;


  &lt;div class="field field--name-field-blog-display field--type-list-string field--label-above"&gt;
    &lt;div class="field__label"&gt;Blog Display&lt;/div&gt;
              &lt;div class="field__item"&gt;Thumbnail&lt;/div&gt;
          &lt;/div&gt;
</description>
  <pubDate>Wed, 12 Sep 2018 17:59:56 +0000</pubDate>
    <dc:creator>alex@techbuilders</dc:creator>
    <guid isPermaLink="false">97 at http://techbuilders.info</guid>
    </item>
<item>
  <title>Up and running with Apache Zeppelin</title>
  <link>http://techbuilders.info/blog/and-running-apache-zeppelin</link>
  <description>&lt;span class="field field--name-title field--type-string field--label-hidden"&gt;Up and running with Apache Zeppelin&lt;/span&gt;
&lt;span class="field field--name-uid field--type-entity-reference field--label-hidden"&gt;&lt;span lang="" about="http://techbuilders.info/user/40" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;alex@techbuilders&lt;/span&gt;&lt;/span&gt;
&lt;span class="field field--name-created field--type-created field--label-hidden"&gt;Mon, 02/26/2018 - 01:08&lt;/span&gt;

            &lt;div class="clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item"&gt;&lt;p&gt;&lt;a href="http://zeppelin.apache.org/"&gt;Apache Zeppelin&lt;/a&gt; is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more. It has great features, good support, and the platform is visually appealing and built to scale. The only issue with this solution is the lack of details and assumptions made in the documentation which makes it painful to even get started. We have built a goof-proof accelerator script to get up and running as if you knew what you were doing!&lt;/p&gt;

&lt;p&gt;As of the time of this article, Apache Zeppelin is on version &lt;s&gt;&lt;strong&gt;0.7.3&lt;/strong&gt;&lt;/s&gt;&lt;strong&gt; 0.8.0&lt;/strong&gt; and still in beta. The dev team has made some great progress over the past few years introducing &lt;a href="https://shiro.apache.org/"&gt;Apache Shiro&lt;/a&gt; for multi-user authentication and building custom visualization plugins with helium. To get started, the project leads assume you're familiar with, have access to, and want to deploy Apache Zeppelin using docker images or Amazon Web Services. For the guy trying to build a proof of concept (as cheap as possible) and see what it's all about, this is pretty frustrating given you have to have access to and experience with other technologies and platforms. At some point, you end up so side-tracked and strung out from going down rabbit holes that you forget what your initial goal was. &lt;strong&gt;My intention is to make getting started as easy as possible for the layman to check out a fantastic open source project.&lt;/strong&gt; In this tutorial, we'll focus on a single node installation, and scale this out to multiple nodes, incorporating authentication, and installing an SSL certificate.&lt;/p&gt;

&lt;p&gt;For this tutorial, I'm installing on a Virtual Machine (Hyper-V specifically) and have had decent performance for small to medium proof of concepts using the minimum requirements below. If you're new to Ubuntu, I suggest working with the Desktop version as it's a little easier to understand if you're coming from the Windows world and like a GUI to work with. I'll be using the Desktop version for clarity sake, the Server version is exactly the same but assumes you're comfortable with the terminal.&lt;/p&gt;

&lt;h4&gt;Minimum Requirements:&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://www.ubuntu.com/"&gt;Ubuntu 16.04.03 LTS&lt;/a&gt; (&lt;a href="https://www.ubuntu.com/download/desktop"&gt;Desktop&lt;/a&gt; or &lt;a href="https://www.ubuntu.com/download/server"&gt;Server&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;4 Virtual Processors&lt;/p&gt;

&lt;p&gt;8GB of RAM&lt;/p&gt;

&lt;p&gt;25GB of Hard Disk (Ubuntu requirement)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;Assumptions&lt;/h4&gt;

&lt;blockquote&gt;
&lt;p&gt;You know how to set up a Virtual Machine (see: &lt;a href="https://youtu.be/jsDB3AsCh2k"&gt;https://youtu.be/jsDB3AsCh2k&lt;/a&gt; )&lt;/p&gt;

&lt;p&gt;Admin access to this Virtual Machine&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the only assumptions? YES! After you create a bare bones virtual machine, our installer script takes care of (most) of the rest.&lt;/p&gt;

&lt;h4&gt; &lt;/h4&gt;

&lt;h4&gt;Step #1 Find and use the terminal&lt;/h4&gt;

&lt;p&gt;We pick this up at the desktop of a clean install of Ubuntu 16.04.03. Open the terminal and run&lt;/p&gt;

&lt;pre&gt;
sudo apt-get upgrade

wget https://raw.githubusercontent.com/techbui1ders/apachezeppelin-starterkit/master/latest/install-zeppelin.sh

chmod +x install-zeppelin.sh

sudo yes | ./install-zeppelin.sh&lt;/pre&gt;

&lt;p&gt;&lt;img alt="TechBuilders - Apache Zeppelin" data-entity-type="file" data-entity-uuid="29efa7b8-597e-47f6-9c19-829077d4f07c" height="442" src="http://techbuilders.info/sites/default/files/inline-images/techbuilders-apachezeppelin.PNG" width="684" /&gt;&lt;/p&gt;

&lt;p&gt;Then go and get some coffee as it's going to take a bit. Behind the scenes, it's downloading and unpacking the following (and it's dependencies) in order&lt;/p&gt;

&lt;pre&gt;
        Java
        Apache Zeppelin (0.8.0 or current)
        Apache Hadoop 2.7.7
        Apache Spark 2.1.0 with Hadoop 2.7
        SparkR interpreter with handy R packages
            devtools
            googleVis
            knitr
            ggplot2
            mplot
            plotly
        Matplotlib &amp; Numpy for Python&lt;/pre&gt;

&lt;p&gt;At the end of the install, Zeppelin will will have an OK status and a readout of the IP address and port that it's now hosted on.&lt;/p&gt;

&lt;p&gt;&lt;img alt="TechBuilders - Apache Zeppelin" data-entity-type="file" data-entity-uuid="56c6013d-4820-4d07-ae1a-b2f7a170481a" height="431" src="http://techbuilders.info/sites/default/files/inline-images/techbuilders-apachezeppelin2.PNG" width="678" /&gt;&lt;/p&gt;

&lt;p&gt;If you don't see the OK and it failed, check to make sure you have Java installed with&lt;/p&gt;

&lt;pre&gt;
$ java -version&lt;/pre&gt;

&lt;p&gt;If there is no version listed, install java and run the following to restart Apache Zeppelin&lt;/p&gt;

&lt;pre&gt;
$ sudo apt-get install default-jdk
$ /usr/lib/zeppelin/bin/zeppelin-daemon.sh start&lt;/pre&gt;

&lt;h4&gt; &lt;/h4&gt;

&lt;h4&gt;Step #2 Take flight&lt;/h4&gt;

&lt;p&gt;At this point, the engines are warmed and you're cleared for takeoff. Step outside of your virtual machine and browse to the IP address listed during the install. If you missed the IP address. Check it again using&lt;/p&gt;

&lt;pre&gt;
$ ifconfig&lt;/pre&gt;

&lt;p&gt;Look for the 'inet addr:' and then use that address plus the port 8080 to get started.&lt;/p&gt;

&lt;p&gt;&lt;img alt="TechBuilders - Apache Zeppelin" data-entity-type="file" data-entity-uuid="b3299954-3627-4bd6-9757-a7fe1b9697a8" height="280" src="http://techbuilders.info/sites/default/files/inline-images/techbuilders-apachezeppelin3.PNG" width="891" /&gt;&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;From here, we have one slight adjustment to make in an interpreter to have the tutorials flow smoothly. You're currently signed in as 'anonymous', click the user name in the top right hand corner and a dropdown will appear, click 'interpreter' and search for 'sh'. You'll need to modify the shell.command.timeout.millisecs from 1 minute (60000 milliseconds) to a few minutes (3000000)&lt;/p&gt;

&lt;p&gt;&lt;img alt="TechBuilders - Apache Zeppelin" data-entity-type="file" data-entity-uuid="177a7a2a-baea-4338-941b-2300a7e762c7" height="396" src="http://techbuilders.info/sites/default/files/inline-images/techbuilders-apachezeppelin5.PNG" width="637" /&gt;&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;Now you're all set to begin! Go into the Zeppelin tutorial and run all paragraphs for the first &lt;strong&gt;four&lt;/strong&gt; tutorials to make sure the interpreters and corresponding libraries are installed and running correctly. The R tutorial is more visually appealing and tends to give you a good basis for what can be done.&lt;/p&gt;

&lt;p&gt;&lt;img alt="TechBuilders - Apache Zeppelin" data-entity-type="file" data-entity-uuid="ad2469a1-588b-4bf8-aa50-b804b618f316" src="http://techbuilders.info/sites/default/files/inline-images/techbuilders-apachezeppelin4.PNG" /&gt;&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;p&gt;The 'Using Mahout' and 'Using Pig for querying data' tutorials don't work out of the box (testing before deployment is hard...mmmk!). We'll spend some time correcting these issues in the install script and will update when completed.&lt;/p&gt;

&lt;p&gt;From here, I would get familiar with the interpreters and creating / sharing notebooks. Understand how to edit the variables in the interpreters and how that factors in as you'll need to be comfortable changing this configurations to scale this platform out. Since you've installed Ubuntu Desktop, take some time to visually go to the Zeppelin configuration folder paths (specifically zeppelin.home) to see the files and information contained within them.&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;h4&gt;Step #3 Go into Orbit&lt;/h4&gt;

&lt;p&gt;If you don't have an S3 bucket already, set up an S3 bucket and place a CSV into it and connect Zeppelin using your favorite interpreter to do some data manipulation and plotting with it. After this point, you should be pretty comfortable with the basics of a single node installation of Apache Zeppelin installed. Below are some useful links to start the learning process with how to scale this out.&lt;/p&gt;

&lt;p&gt; &lt;/p&gt;

&lt;h4&gt;What's Next?&lt;/h4&gt;

&lt;p&gt;So we've installed Apache Zeppelin on a single node with anonymous authentication. We need to take this from POC to a solution that's a bit more enterprise. Our next articles on Apache Zeppelin will contain tutorials on Apache Shiro for authentication and connecting multiple nodes for more processing power. Hope you've enjoyed! Please comment with any questions.&lt;/p&gt;
&lt;/div&gt;
      
  &lt;div class="field field--name-field-thumbnail field--type-image field--label-above"&gt;
    &lt;div class="field__label"&gt;Thumbnail&lt;/div&gt;
              &lt;div class="field__item"&gt;  &lt;img src="http://techbuilders.info/sites/default/files/2018-02/techbuilders-apachezeppelin.PNG" width="562" height="400" alt="TechBuilders - Apache Zeppelin" typeof="foaf:Image" /&gt;&lt;/div&gt;
          &lt;/div&gt;

  &lt;div class="field field--name-field-blog-category field--type-entity-reference field--label-above"&gt;
    &lt;div class="field__label"&gt;Blog Category&lt;/div&gt;
          &lt;div class="field__items"&gt;
              &lt;div class="field__item"&gt;&lt;a href="http://techbuilders.info/search/software" hreflang="en"&gt;Software&lt;/a&gt;&lt;/div&gt;
          &lt;div class="field__item"&gt;&lt;a href="http://techbuilders.info/search/tutorials" hreflang="en"&gt;Tutorials&lt;/a&gt;&lt;/div&gt;
          &lt;div class="field__item"&gt;&lt;a href="http://techbuilders.info/search/datascience" hreflang="en"&gt;Data Science&lt;/a&gt;&lt;/div&gt;
              &lt;/div&gt;
      &lt;/div&gt;

                    &lt;!--div class="field__items"--&gt;
                            &lt;a href="http://techbuilders.info/taxonomy/term/69" hreflang="en"&gt;Apache&lt;/a&gt;,                             &lt;a href="http://techbuilders.info/taxonomy/term/59" hreflang="en"&gt;Machine Learning&lt;/a&gt;,                             &lt;a href="http://techbuilders.info/taxonomy/term/60" hreflang="en"&gt;Data Science&lt;/a&gt;,                             &lt;a href="http://techbuilders.info/taxonomy/term/62" hreflang="en"&gt;Visualization&lt;/a&gt;                        &lt;!--/div--&gt;
        
&lt;ul class="comments"&gt;&lt;/ul&gt;&lt;div class="post-block post-leave-comment"&gt;
        &lt;h3 class="heading-primary"&gt;Leave a comment&lt;/h3&gt;
        &lt;!--h2 class="title comment-form__title"&gt;&lt;/h2--&gt;
        &lt;drupal-render-placeholder callback="comment.lazy_builders:renderForm" arguments="0=node&amp;1=95&amp;2=comment&amp;3=comment" token="ZYNxPiMsobBysQHKqCmjPtjvfpcaPueCkbQMvxeruQ0"&gt;&lt;/drupal-render-placeholder&gt;&lt;/div&gt;


  &lt;div class="field field--name-field-blog-display field--type-list-string field--label-above"&gt;
    &lt;div class="field__label"&gt;Blog Display&lt;/div&gt;
              &lt;div class="field__item"&gt;Thumbnail&lt;/div&gt;
          &lt;/div&gt;
</description>
  <pubDate>Mon, 26 Feb 2018 01:08:53 +0000</pubDate>
    <dc:creator>alex@techbuilders</dc:creator>
    <guid isPermaLink="false">95 at http://techbuilders.info</guid>
    </item>

  </channel>
</rss>
