The mission of this piece of software is to use platforms like google, facebook, news sources, twitter, youtube, and almost all web sites and RSS feeds -- and harvest the offerings they provide into organized sources of information in a database. The goal is not to steal any data, but to work within the confines of what is provided and create my own content -- to find content, to select and organize it, and to share it. My home-grown shareable content discovery platform.
This is done by scraping web pages into semantic structures and by retrieving text, images, RSS feeds and using many different kinds of embedded content from those platforms, as well as using uploaded content.
This process yields a database that stores data, linkage and semantic information. Those pieces of information are then read from the database and are used to drive a UI that shows that content.

The user interface for doing all this is achieved by using drag/drop and/or copy/paste. You have to sign-in to do that.

Where is this content, how do you extract it and use it without treading on copyrights?

There are 3 types of content:
Type 1 - Platform content.

  • Found under the 'Share' icon look for an 'embed' or a 'link' option.
    • For the end user to view some of these platforms with complete fidelity they must be signed-into the platform of that provider (e.g. twitter, instagram, etc).
    • Embedded content platforms are not inherently safe and can be abused, therefore only known, commonly-accepted platforms should be supported.
      Some of the platforms that generally support embedded content:
      twitter, facebook, cbsnews, bbc, nytimes, google (docs, spread, books, arts&culture, drive, maps), npr, bandcamp, soundcloud, tiktok, instagram, nps, pinterest, ted, msdn, wsj, youtube, alltrails, hikingproject, flickr, apple podcasts, apple music, acast, reddit, abcnews, simplecast, slideshare
  • RSS Feeds are found under the RSS icon
    • RSS feeds have recently been made relevant again by Google's impending 'Follow' option, so RSS feeds might be emerging from years of neglect. These can have incredible amounts of information and often have rich UI's.

Examples of embedded content from various platforms:
 
 
 
 
 
 
 
 
 
 

 

Type 2 - User content from their own hard drive/phone

Type 3 - Web content (link/image/text)

Geek talk: Getting meaningful semantic information by parsing the structures of web pages


semantic: relating to meaning in language or logic.

Working with webpage semantic data

Semantic data is stored within HTML pages using several different methods. Typically more than one method is used within a page. These methods are:
  • ld+json
  • microdata
  • Open Graph
  • HTML header meta tags
  • HTML5 Semantic tags
Semantic data is defined using a framework stored at Schema.org. Schema.org is an organization which regulates and defines a heirarchy of semantically named items and structures.

All browsers that have broadly-based usage support semantic data.

Why use semantic data?
It is in the interest of web sites to include this semantic data if they wish to be found and used by their customers using search engines such as Google, Bing. It is in the interest of search engines to have this data so they can produce meaningful results for the user.

In addition, there are other companies who have services (i.e., not search engines) that use this data. Some examples of those where they directly use the semantic structure at schema.org:
  • an audio internet appliance that speaks recipe ingredients and instructions to you, and steps you through the process. Click here to see this semantic definition.
  • tourism and hospitality providers that return restaurant opening hours, a menu, location, access to reservations, customer ratings, price range, etc, click here
  • retail middle-men who have customers that are buying a product and want to compare similar products and prices, to see user ratings, user reviews, product attributes, photos, etc, click here
  • Bookclub applications click here

Working with semantic coding methods

ld+json

  • "ld" stands for "Linked Data". Within the HTML code of a web page this linked data, using named definitions from that schema, is stored in json format. It can be found within a "script" tag:
    <script type="application/ld+json">
    json data
    </script>
  • json is a protocol used to efficiently store structured data in a textual format.

Microdata

Microdata is a bit older than ld+json, it also is defined using a framework found at Schema.org. Microdata items are stored within HTML elements (i.e., <div>, <span>, <meta> etc.) inserted as an attribute named "itemprop". That attribute:
  • stores the name of the schema item, and
  • is accompanied by a "content" attribute which stores the value of the schema item.

Open Graph

Facebook's Open Graph stores its items within meta tags. When someone posts a web link to Facebook, Facebook will get at a minimum three of those meta tags from that link
When Facebook initially released Open Graph 50,000 web sites had added those meta tags within one week. This has become the de facto standard for most web sites. Almost all web sites now have at least these 3 of those metatags. Shows the economic and social power of semantics in linking items.

Meta tags

The oldest tags for storing and exposing metadata within a web page are 'meta' tags. These are found in the HTML header. These <meta> HTML elements are typically used to store some information such as the character set used on the page, the keywords of the document, the viewport, the author, and many other provider-specific implementations (such as Facebook Open Graph tags.)

HTML Semantic tags


Other sources:

Some websites also provide 'keyword' tags embedded within their semantic data. These keywords can be single words or phrases of words.


A note about semantic data for product price and availability:
If you extract the price and availability for a product from a web page (both ld+json and microdata frameworks provide this information in a structured way) and intend to produce the data for your own purposes, please note that many businesses require that you cannot just get it once and let it remain static. You must be able to dynamically refresh that price and availability anytime you 'offer' it. Businesses such as Target, Amazon and Bestbuy require this, and they provide a web service to get that updated information. An example is Target's "Redsky" service.

These three things from a given web site:
  • Links
  • Images
  • Text
along with the embedded semantic information as explained above, are pretty much all of the worthwhile information there is on a web page that is not domain specific.


Important:
If an image, text or link on a web page is retrieved via the above frameworks or via RSS you can assume the owner of that webpage has decided to allow those assets to be freely distributed. Otherwise you must assume all other assets are coyprighted by the owner of that webpage.