Bitcoin-Data

As of year 2017 we have many ways to download raw bitcoin data. If we want to use this bitcoin transaction data we need to first understand, extract and represent the data in some meaningful format. To achieve this it involves use of resources and time. To make your life easier we made efforts in providing the extracted bitcoin data in CSV format which can be used directly for an individual’s requirement.

This blog assumes the reader has some basic knowledge of Bitcoin, Blockchain and mechanism involved in it. Through this blog, we are mainly trying to help the people who are in need of transaction data which is close to real world transactions.

Contents

  1. Bitcoin Transaction
    1. Understanding Bitcoin Transaction with an example
    2. Important points on bitcoin transactions
  2. Libraries Used For Extracting And Representing Data
  3. Path to Download Extracted Data
  4. Representing Extracted Bitcoin data
  5. Finding Missing from-address and amount
  6. Representing Extracted Data Similar To Real Transactions
  7. Data-Loss
    1. Quntifying Data-Loss

Bitcoin Transaction

A bitcoin transaction is not as simple as transferring bitcoins from one address (owner) to another (recipient), but more complex than that. Each transaction involves transferring bitcoins between one or more inputs and outputs.

Understanding Bitcoin Transaction with an example

Let’s say you have a business and use bitcoin as a currency. And assume you have 3 clients A, B and C. You give your receiving address (address-P1), (address-P2) and address (address-P3) to A, B and C respectively to keep track of who paid what. To your service

  • A paid 0.9 BTC
  • B paid 1.2 BTC
  • C paid 0.6 BTC

By the end of the day, you have a balance of 2.7 BTC. Which actually means you have three “unspent outputs”..

  • 0.9 BTC in address-P1
  • 1.2 BTC in address-P2
  • 0.6 BTC in address-P3

Although all the bitcoins are yours, they are associated with different addresses. Image1

On day two A paid more 0.2BTC for your another service. Now you have same three “unspent outputs”..

  • 0.9 BTC + 0.2 BTC in address-P1
  • 1.2 BTC in address-P2
  • 0.6 BTC in address-P3, totalling to 2.9 BTC.

Image2

From the above image It means from all the transactions that were made, your addresses are the output addresses to the each transaction made by A, B and C and these four transactions are represented as “unspent outputs” in your account.

Now, assume you incurred 2.55 BTC cost for providing services to A,B and C from F. And you have paid 2.55 BTC to you service provider F to his address (address-F).

Image3

From the above image we can see that there are three inputs and two outputs in our transaction to F. Here inputs are the amount you received from A,B and C. And outputs will be the one’s address you are transferring bitcoins.

But wait, why we are seeing two outputs although you have made transaction to only F?

you always spend the “unspent outputs” – you can not simply subtract 2.55 BTC from your account because the bitcoins are tied to the address they were sent to. Furthermore, you always spend the outputs as a whole – you can not break an unspent output into smaller pieces. What actually happens is that you send three (out of four) of your unspent outputs..

  • 0.9 BTC that was in address-P1
  • 1.2BTC in address-P2
  • 0.6BTC from address-P3

and get back 0.145 BTC in form of “change” in a new automatically generated address (address-B). The outputs are therefore..

  • 2.55BTC to F address (address-F1)
  • 0.145 to yourself in the address-B.

Note: In any bitcoin transaction total inputs amount will always matches to total outputs amount.

But in your transaction total inputs amount is 2.7 BTC and total outputs amount is 2.695 BTC.

What happened to 0.005BTC ?
Here comes the concept of transaction fees, for every transaction user can allocate some amount as transaction fees to make the transaction faster. We don’t go deeper into why we pay transaction fees, to whom we pay this and who gains this fees. On an abstract level just think we are paying these fees to those who validate the transactions.

Finally your balance is now 0.345 BTC (or you have a sum of 0.345 unspent output) in you account after transaction to F.

Image4

Now, sometimes you are going to see some transactions like this:

Image4

In the above image it is understood that the addresses on left side are just happened to hold a sum of unspent outputs that are now inputs. But why there are hundreds of outputs in just one transaction?

As we know there will be a transaction fees for a transaction. Let’s use a mining pool as an example that makes daily payments to its members and they can not afford to make one transaction to each member every day.You would spend a lot of bitcoin in transactions fees. In order to avoid this, they make use of a idea of sending different amounts of bitcoins to different addresses in a single transaction. With this they can minimize the transaction fees.

Why can’t we just send the change back to the input address and why can’t we just restrict the user to have only one address to his account? Wouldn’t that make this simpler to understand?

One of the main bitcoin features is anonymity. It hides the user’s identity by providing an option to create number of addresses. With this there won’t be any way to track a particular user and one cannot just dig through all bitcoin transactions of some address “abcdab” and see how much it has spent and how much the user’s balance is. And moreover one bitcoin transaction may actually has multiple individual transactions made by the user.

Important points on bitcoin transactions
  • Since a bitcoin user can generate multiple address for multiple transactions, we cannot track back actual individual’s account (wallet) address.
  • One bitcoin transaction can have multiple individual’s transaction made by the user.
  • Even within the transaction there is no proper indication to identify input A sent output B this much amount.

Libraries Used For Extracting And Representing Data

  • We used Java open-source library Bitcoinj for extracting raw bitcoin data.
  • For representing extracted data we used CSV format.
  • And finally Python open-source library Pandas for processing CSV files.

Path to Download Extracted Data

TODO

Representing Extracted Bitcoin data

tx_id previous_hash type index address amount total time
4a5… 000… I 4294967295 1231006505
a10… b2a… I 0 1134967512
za1… O 2 azx… 500910 910823 1108496751
f30… O 1 E 500910 910823 1222096742

With colums

  • Tx_id —> transaction id of the particular transaction
  • Previous_hash —> transaction id from where the user got inputs
  • Type —> input or output
  • Index —> index associated with type
  • Address —> associated address to the input/outputs
  • Amount —> amount associated to inputs/outputs
  • Total —> Sum of all output amount
  • Time —> Time at which the respective transaction block is created

Understanding of represented transaction

In all the above rows “total” is the sum of all outputs amount.

  • Row 1: In this row, if we observe “previous_hash”, it is a 64 bit hashed string and its value is “000000000000000000000000000000000000000000000000000000000000000” meaning it is a coinbase transaction which has only one input and no outputs.

  • Row 2 : We have missing “address”, “amount” and “total”. Since this is “I” (one of the input’s of a transaction), there won’t be any details about amount transferred and from address. But their values can be found out by using the provided “previous_hash”.

  • Row 3 : Here we have only missing “previous_hash”. This is of type “O” (one of the output’s of a transaction). For outputs there wont ae any information about “previous_hash” as these are new output transactions making by the user. And the “address” represents the recipient.

  • Row 4 : This row is similar to “row 3” but with address value of “E” meaning missing/error while extracting the raw bitcoin transactions.

Note: After extracting out all raw transactions to csv format, we filtered out the outputs indexes with address values as “E”.

Finding Missing from-address and amount

From the above discussion it is clear that outputs from one transaction becomes the inputs of other transaction. And these inputs and outputs are linked by the index values. Hence logic to find out missing address, amount in inputs of a transaction is to track back the respesctive inputs previous_hash(previous_transaction) i.e for example..

tx_id previous_hash type index address amount total time
1a5… cs1… I 2 1231006505
v1s… fc0… I 1 1134967512
cs1… O 2 azx… 311822 510452 1108496751
fc0… O 1 abz… 500910 916923 1222096742
2b5… 1s1… I 0 1231006505

in the above table

  • Row 1 is of type “I” with index value “2”, and its previous_hash is “cs1…”. Now if we look for the tx_id with value “cs1…” and “O” index value same as “2” row 3 matches. Hence address and amount values in row 3 are filled in the row 1.
  • Similarly row 2 is of type “I” with index value “1”, and its previous_hash is “fc0…”. Now if we look for the tx_id with value “fc0…” and “O” index value same as “1” row 4 matches. Hence address and amount values in row 4 are filled in the row 1.

then the resultant table looks like

tx_id previous_hash type index address amount total time
1a5… cs1… I 2 azx… 311822 1231006505
v1s… fc0… I 1 abz… 500910 1134967512
cs1… O 2 azx… 311822 510452 1108496751
fc0… O 1 abz… 500910 916923 1222096742
2b5… 1s1… I 0 1231006505

Representing Extracted Data Similar To Real Transactions

Since our main idea is to represent the available data close to real transactions. We have made some assumptions..

  • Each address represents a unique account.
  • Exclude transactions fees of a transaction.
  • Since there is no indicator to link inputs and outputs in a transaction, we proportionately divided the inputs amount to outputs.

Image5

From the above diagram we can see that each input’s amount is proportionately distributed among all outputs. As a result of this we get the below diagram. We see a Small raise in the output amounts because of ignoring transaction fees.

Image6

Finally the resulted transactions are represented in a CSV file as..

tx_id from-address to-address amount time
1a5… abc… axq… 2092.97 1231006505
v1s… axa… zaq… 23097.23 1134967512
cs1… zxq… zza… 33972.2 1108496751
fc0… zap… qqe… 1000.91 1222096742
2b5… 1s1… mnq… 201746.0 1231006505

Download a sample file to view the extracted transactions

Sample File

Data-Loss

There is some loss of data because of..

  • Bitcoin data extracter (Bitcoinj)
    • Chance of missing some blocks.
    • Unable to extract all the information from each block.
  • As we need to track back to find the inputs address and amount, limiting the track back length to around 15 lakhs blocks.
Quntifying Data-Loss
  • Total transactions —> 18,19,56,095
  • Total outputs —> 50,57,48,719
  • Total inputs —> 46,42,48,536
  • Tatal coinbase transactions (inputs) —> 4,45,007
  • Number of inputs which were calculated by back tracking the blocks —> 39,16,67,619
  • Uncalculated inputs —> 7,21,35,910
  • Outputs with missing address —> 35,44,300

Information Loss = (Uncalculated inputs + Outputs with missing address) / Total inputs = 16.3 %

Manoj Kumar avatar
About Manoj Kumar
R&D Engineer