As of year 2017 we have many ways to download raw bitcoin data. If we want to use this bitcoin transaction data we need to first understand, extract and represent the data in some meaningful format. To achieve this it involves use of resources and time. To make your life easier we made efforts in providing the extracted bitcoin data in CSV format which can be used directly for an individual’s requirement.
This blog assumes the reader has some basic knowledge of Bitcoin, Blockchain and mechanism involved in it. Through this blog, we are mainly trying to help the people who are in need of transaction data which is close to real world transactions.
- Bitcoin Transaction
- Understanding Bitcoin Transaction with an example
- Important points on bitcoin transactions
- Libraries Used For Extracting And Representing Data
- Path to Download Extracted Data
- Representing Extracted Bitcoin data
- Finding Missing from-address and amount
- Representing Extracted Data Similar To Real Transactions
- Quntifying Data-Loss
A bitcoin transaction is not as simple as transferring bitcoins from one address (owner) to another (recipient), but more complex than that. Each transaction involves transferring bitcoins between one or more inputs and outputs.
Understanding Bitcoin Transaction with an example
Let’s say you have a business and use bitcoin as a currency. And assume you have 3 clients A, B and C. You give your receiving address (address-P1), (address-P2) and address (address-P3) to A, B and C respectively to keep track of who paid what. To your service
- A paid 0.9 BTC
- B paid 1.2 BTC
- C paid 0.6 BTC
By the end of the day, you have a balance of 2.7 BTC. Which actually means you have three “unspent outputs”..
- 0.9 BTC in address-P1
- 1.2 BTC in address-P2
- 0.6 BTC in address-P3
On day two A paid more 0.2BTC for your another service. Now you have same three “unspent outputs”..
- 0.9 BTC + 0.2 BTC in address-P1
- 1.2 BTC in address-P2
- 0.6 BTC in address-P3, totalling to 2.9 BTC.
From the above image It means from all the transactions that were made, your addresses are the output addresses to the each transaction made by A, B and C and these four transactions are represented as “unspent outputs” in your account.
Now, assume you incurred 2.55 BTC cost for providing services to A,B and C from F. And you have paid 2.55 BTC to you service provider F to his address (address-F).
From the above image we can see that there are three inputs and two outputs in our transaction to F. Here inputs are the amount you received from A,B and C. And outputs will be the one’s address you are transferring bitcoins.
But wait, why we are seeing two outputs although you have made transaction to only F?
you always spend the “unspent outputs” – you can not simply subtract 2.55 BTC from your account because the bitcoins are tied to the address they were sent to. Furthermore, you always spend the outputs as a whole – you can not break an unspent output into smaller pieces. What actually happens is that you send three (out of four) of your unspent outputs..
- 0.9 BTC that was in address-P1
- 1.2BTC in address-P2
- 0.6BTC from address-P3
and get back 0.145 BTC in form of “change” in a new automatically generated address (address-B). The outputs are therefore..
- 2.55BTC to F address (address-F1)
- 0.145 to yourself in the address-B.
Note: In any bitcoin transaction total inputs amount will always matches to total outputs amount.
But in your transaction total inputs amount is 2.7 BTC and total outputs amount is 2.695 BTC.
What happened to 0.005BTC ?
Here comes the concept of transaction fees, for every transaction user can allocate some amount as transaction fees to make the transaction faster. We don’t go deeper into why we pay transaction fees, to whom we pay this and who gains this fees. On an abstract level just think we are paying these fees to those who validate the transactions.
Finally your balance is now 0.345 BTC (or you have a sum of 0.345 unspent output) in you account after transaction to F.
Now, sometimes you are going to see some transactions like this:
In the above image it is understood that the addresses on left side are just happened to hold a sum of unspent outputs that are now inputs. But why there are hundreds of outputs in just one transaction?
As we know there will be a transaction fees for a transaction. Let’s use a mining pool as an example that makes daily payments to its members and they can not afford to make one transaction to each member every day.You would spend a lot of bitcoin in transactions fees. In order to avoid this, they make use of a idea of sending different amounts of bitcoins to different addresses in a single transaction. With this they can minimize the transaction fees.
Why can’t we just send the change back to the input address and why can’t we just restrict the user to have only one address to his account? Wouldn’t that make this simpler to understand?
One of the main bitcoin features is anonymity. It hides the user's identity by providing an option to create number of addresses. With this there won't be any way to track a particular user and one cannot just dig through all bitcoin transactions of some address “abcdab” and see how much it has spent and how much the user's balance is. And moreover one bitcoin transaction may actually has multiple individual transactions made by the user.
Important points on bitcoin transactions
- Since a bitcoin user can generate multiple address for multiple transactions, we cannot track back actual individual's account (wallet) address.
- One bitcoin transaction can have multiple individual's transaction made by the user.
- Even within the transaction there is no proper indication to identify input A sent output B this much amount.
Libraries Used For Extracting And Representing Data
- We used Java open-source library Bitcoinj for extracting raw bitcoin data.
- For representing extracted data we used CSV format.
- And finally Python open-source library Pandas for processing CSV files.
Path to Download Extracted Data
Representing Extracted Bitcoin data
- Tx_id ---> transaction id of the particular transaction
- Previous_hash ---> transaction id from where the user got inputs
- Type ---> input or output
- Index ---> index associated with type
- Address ---> associated address to the input/outputs
- Amount ---> amount associated to inputs/outputs
- Total ---> Sum of all output amount
- Time ---> Time at which the respective transaction block is created
Understanding of represented transaction
In all the above rows "total" is the sum of all outputs amount.
Row 1: In this row, if we observe “previous_hash”, it is a 64 bit hashed string and its value is “000000000000000000000000000000000000000000000000000000000000000” meaning it is a coinbase transaction which has only one input and no outputs.
Row 2 : We have missing "address", "amount" and "total". Since this is "I" (one of the input's of a transaction), there won't be any details about amount transferred and from address. But their values can be found out by using the provided “previous_hash”.
Row 3 : Here we have only missing "previous_hash". This is of type "O" (one of the output's of a transaction). For outputs there wont be any information about "previous_hash" as these are new output transactions making by the user. And the "address" represents the recipient.
Row 4 : This row is similar to "row 3" but with address value of "E" meaning missing/error while extracting the raw bitcoin transactions.
Note: After extracting out all raw transactions to csv format, we filtered out the outputs indexes with address values as "E".
Finding Missing from-address and amount
From the above discussion it is clear that outputs from one transaction becomes the inputs of other transaction. And these inputs and outputs are linked by the index values. Hence logic to find out missing address, amount in inputs of a transaction is to track back the respective inputs previous_hash(previous_transaction) i.e for example..
in the above table
- Row 1 is of type "I" with index value "2", and its previous_hash is "cs1...". Now if we look for the tx_id with value "cs1..." and
"O" index value same as "2" row 3 matches. Hence address and amount values in row 3 are filled in the row 1.
- Similarly row 2 is of type "I" with index value "1", and its previous_hash is "fc0...". Now if we look for the tx_id with value "fc0..." and
"O" index value same as "1" row 4 matches. Hence address and amount values in row 4 are filled in the row 1.
then the resultant table looks like
Representing Extracted Data Similar To Real Transactions
Since our main idea is to represent the available data close to real transactions. We have made some assumptions..
- Each address represents a unique account.
- Exclude transactions fees of a transaction.
- Since there is no indicator to link inputs and outputs in a transaction, we proportionately divided the inputs amount to outputs.
From the above diagram we can see that each input's amount is proportionately distributed among all outputs. As a result of this we get the below diagram. We see a Small raise in the output amounts because of ignoring transaction fees.
Finally the resulted transactions are represented in a CSV file as..
Download a sample file to view the extracted transactions
There is some loss of data because of..
- Bitcoin data extractor (Bitcoinj)
- Chance of missing some blocks.
- Unable to extract all the information from each block.
- As we need to track back to find the inputs address and amount, limiting the track back length to around 15 lakhs blocks.
- Total transactions ---> 18,19,56,095
- Total outputs ---> 50,57,48,719
- Total inputs ---> 46,42,48,536
- Tatal coinbase transactions (inputs) ---> 4,45,007
- Number of inputs which were calculated by back tracking the blocks ---> 39,16,67,619
- Uncalculated inputs ---> 7,21,35,910
- Outputs with missing address ---> 35,44,300
Information Loss = (Uncalculated inputs + Outputs with missing address) / Total inputs = 16.3 %